Parsing in Fexl

A parse function reads a stream of input characters and returns some kind of structured result based on what it sees in the stream.

Since Fexl is a functional programming language, it is tempting to represent the input as a literal list of characters, and write the parse functions as pure functions of that list. I have done that in the past, and it works out pretty well, but I find it more complex than necessary. For one, I have to pass the stream around as an argument everywhere, and each parse function has to return the tail of the stream not yet read. Through the use of a very clever a point-free or monadic style, I can avoid explicitly mentioning the stream everywhere. However, I think that makes the code deceptively simple on the surface but excessively complex in reality.

Consequently I take a very conventional procedural approach to parsing. The look function returns the "current character" of the input stream. The skip function advances to the next character in the stream. This approach is explicitly stateful and imperative.

Parsing the SSV format

In our investment accounting business I make extensive use of the "SSV" (space-separated value) format to specify event histories, such as trades, capital contributions, dividends, interest, etc. This is a series of lines, with each line consisting of a series of items separated by white space. An item can be a plain word such as "trade" (without the quotes), or a string enclosed in quotes such as "Alice Smith" (with the quotes), or a "tilde string" as used in Fexl itself, where you can choose an arbitrary delimiter such as "~END" and then enclose any block of text you like within a pair of those delimiters. Text within a quoted or tilde-quoted string may include line feeds, so you can have multi-line items.

The format allows comments, indicated by a "#" and continuing to the end of the line. (Of course, a "#" inside a quoted string does not indicate a comment.)

Here is the module for parsing the SSV format:

read_ssv.fxl:

# Parse the SSV (space-separated value) format.
\\get_plain_item=
    (
    collect_to
        (
        at_white T;
        at_ch QU T;
        at_ch "~" T;
        at_eof T;
        F
        )
    )

\\get_quote_item=
    (
    skip
    \buf=buf_new
    collect_to_ch buf QU (buf_get buf) void
    )

\\get_tilde_item=
    (
    \buf=buf_new
    eq 1 (collect_tilde_string buf) (buf_get buf) void
    )

\\get_item=
    (
    at_eof void;
    at_ch QU get_quote_item;
    at_ch "~" get_tilde_item;
    at_eol (skip void);
    get_plain_item
    )

\\get_row=
    (@\\loop
    skip_match (at_eol F; at_white)
    \item=get_item
    is_undef item void;
    \row=loop
    is_undef row [item] [item;row]
    )

\\get_rows=
    (@\\loop
    skip_white
    \row=get_row
    is_undef row [];
    \rows=loop
    [row;rows]
    )

\parse=(\read\x read x get_rows)

\read_ssv_string=(parse read_stream)
\read_ssv_chars=(parse read_chars)
\read_ssv_file=(parse read_file)

define "read_ssv_string" read_ssv_string
define "read_ssv_chars" read_ssv_chars
define "read_ssv_file" read_ssv_file

Parsing the CSV format

Data feeds from broker accounts often use the "CSV" (comma-separated value) format. Here is the module for parsing the CSV format:

read_csv.fxl:

# Parse the CSV (comma-separated value) format.
# NOTE: https://www.ietf.org/rfc/rfc4180.txt
# "Spaces are considered part of a field and should not be ignored."
\get_plain_item=
    (\sep
    collect_to
        (
        at_ch sep T;
        at_eol T;
        at_eof T;
        F
        )
    )

# Get a quoted item.  A single QU char is treated as end of string.  Two QU
# chars in a row are treated as a single QU character which appears in the
# string.
\\get_quote_item=
    (
    skip
    \buf=buf_new
    @\\loop
    at_ch QU
        (
        skip
        at_ch QU
            (
            buf_keep buf
            loop
            )
            (buf_get buf)
        );
    at_eof void;
    buf_keep buf
    loop
    )

\get_item=
    (\sep
    at_ch QU get_quote_item;
    at_eof void;
    get_plain_item sep
    )

\get_row=
    (\sep
    @\\loop
    \item=(get_item sep)
    is_undef item void;
    at_ch sep
        (
        skip
        \row=loop
        is_undef row [item] [item;row]
        )
        [item]
    )

\get_rows=
    (\sep
    @\\loop
    skip_match at_eol
    \row=(get_row sep)
    is_undef row [];
    \rows=loop
    [row;rows]
    )

\parse=(\read\sep\x read x (get_rows sep))

# Use arbitrary separator.
\read_xsv_string=(parse read_stream)
\read_xsv_chars=(parse read_chars)
\read_xsv_file=(parse read_file)

# comma-separated
\read_csv_string=(read_xsv_string ",")
\read_csv_chars=(read_xsv_chars ",")
\read_csv_file=(read_xsv_file ",")

# tab-separated
\read_tsv_string=(read_xsv_string TAB)
\read_tsv_chars=(read_xsv_chars TAB)
\read_tsv_file=(read_xsv_file TAB)

define "read_xsv_string" read_xsv_string
define "read_xsv_chars" read_xsv_chars
define "read_xsv_file" read_xsv_file
define "read_csv_string" read_csv_string
define "read_csv_chars" read_csv_chars
define "read_csv_file" read_csv_file
define "read_tsv_string" read_tsv_string
define "read_tsv_chars" read_tsv_chars
define "read_tsv_file" read_tsv_file

Parsing the JSON format

I also have a JSON parser which I use for many API data feeds, but it's not in the standard library yet. I'll write something about it here later.

Validating the parsers

To validate the SSV and CSV parsers, I have devised an extensive test suite, including:

test/b12.fxl (test driver)
test/test.csv (separate test input file)
test/test.ssv (separate test input file)
out/b12 (reference test output file)

This includes not only normal cases, but also very intricate corner cases, such as having the stream end without a line feed, or failing to close a quoted string properly. I have very deliberately "fuzzed" the test cases to exercise many possibilities.

Supporting modules

There is a supporting module "read.fxl" which defines some common functions used by read_ssv.fxl and read_csv.fxl above:

read.fxl:

# Skip matching characters.
\skip_match=
    (\\is_match
    @\\loop
    is_match (skip loop);
    )

# Collect characters up to an ending condition.
\collect_to=
    (\\is_end
    \buf=buf_new
    @\\loop
    is_end (buf_get buf);
    buf_keep buf
    loop
    )

\read_file=(\name read_stream (fopen name "r"))

\read_chars=
    (
    \flatten=
        (
        \buf=buf_new
        @\\loop\\xs
        xs (buf_get buf) \x\xs
        buf_put buf x
        loop xs
        )
    \xs
    read_stream (flatten xs)
    )

define "skip_match" skip_match
define "collect_to" collect_to
define "read_chars" read_chars
define "read_file" read_file