A parse function reads a stream of input characters and returns some kind of structured result based on what it sees in the stream.
Since Fexl is a functional programming language, it is tempting to represent the input as a literal list of characters, and write the parse functions as pure functions of that list. I have done that in the past, and it works out pretty well, but I find it more complex than necessary. For one, I have to pass the stream around as an argument everywhere, and each parse function has to return the tail of the stream not yet read. Through the use of a very clever a point-free or monadic style, I can avoid explicitly mentioning the stream everywhere. However, I think that makes the code deceptively simple on the surface but excessively complex in reality.
Consequently I take a very conventional procedural approach to parsing. The look function returns the "current character" of the input stream. The skip function advances to the next character in the stream. This approach is explicitly stateful and imperative.
In our investment accounting business I make extensive use of the "SSV" (space-separated value) format to specify event histories, such as trades, capital contributions, dividends, interest, etc. This is a series of lines, with each line consisting of a series of items separated by white space. An item can be a plain word such as "trade" (without the quotes), or a string enclosed in quotes such as "Alice Smith" (with the quotes), or a "tilde string" as used in Fexl itself, where you can choose an arbitrary delimiter such as "~END" and then enclose any block of text you like within a pair of those delimiters. Text within a quoted or tilde-quoted string may include line feeds, so you can have multi-line items.
The format allows comments, indicated by a "#" and continuing to the end of the line. (Of course, a "#" inside a quoted string does not indicate a comment.)
Here is the module for parsing the SSV format:
# Parse the SSV (space-separated value) format. \\get_plain_item= ( collect_to ( at_white T; at_ch QU T; at_ch "~" T; at_eof T; F ) ) \\get_quote_item= ( skip \buf=buf_new collect_to_ch buf QU (buf_get buf) void ) \\get_tilde_item= ( \buf=buf_new eq 1 (collect_tilde_string buf) (buf_get buf) void ) \\get_item= ( at_eof void; at_ch QU get_quote_item; at_ch "~" get_tilde_item; at_eol (skip void); get_plain_item ) \\get_row= (@\\loop skip_match (at_eol F; at_white) \item=get_item is_undef item void; \row=loop is_undef row [item] [item;row] ) \\get_rows= (@\\loop skip_white \row=get_row is_undef row ; \rows=loop [row;rows] ) \parse=(\read\x read x get_rows) \read_ssv_string=(parse read_stream) \read_ssv_chars=(parse read_chars) \read_ssv_file=(parse read_file) \form def "read_ssv_string" read_ssv_string; def "read_ssv_chars" read_ssv_chars; def "read_ssv_file" read_ssv_file; form
Data feeds from broker accounts often use the "CSV" (comma-separated value) format. Here is the module for parsing the CSV format:
# Parse the CSV (comma-separated value) format. # NOTE: https://www.ietf.org/rfc/rfc4180.txt # "Spaces are considered part of a field and should not be ignored." \get_plain_item= (\sep collect_to ( at_ch sep T; at_eol T; at_eof T; F ) ) # Get a quoted item. A single QU char is treated as end of string. Two QU # chars in a row are treated as a single QU character which appears in the # string. \\get_quote_item= ( skip \buf=buf_new @\\loop at_ch QU ( skip at_ch QU ( buf_keep buf loop ) (buf_get buf) ); at_eof void; buf_keep buf loop ) \get_item= (\sep at_ch QU get_quote_item; at_eof void; get_plain_item sep ) \get_row= (\sep @\\loop \item=(get_item sep) is_undef item void; at_ch sep ( skip \row=loop is_undef row [item] [item;row] ) [item] ) \get_rows= (\sep @\\loop skip_match at_eol \row=(get_row sep) is_undef row ; \rows=loop [row;rows] ) \parse=(\read\sep\x read x (get_rows sep)) # Use arbitrary separator. \read_xsv_string=(parse read_stream) \read_xsv_chars=(parse read_chars) \read_xsv_file=(parse read_file) # comma-separated \read_csv_string=(read_xsv_string ",") \read_csv_chars=(read_xsv_chars ",") \read_csv_file=(read_xsv_file ",") # tab-separated \read_tsv_string=(read_xsv_string TAB) \read_tsv_chars=(read_xsv_chars TAB) \read_tsv_file=(read_xsv_file TAB) \form def "read_xsv_string" read_xsv_string; def "read_xsv_chars" read_xsv_chars; def "read_xsv_file" read_xsv_file; def "read_csv_string" read_csv_string; def "read_csv_chars" read_csv_chars; def "read_csv_file" read_csv_file; def "read_tsv_string" read_tsv_string; def "read_tsv_chars" read_tsv_chars; def "read_tsv_file" read_tsv_file; form
I also have a JSON parser which I use for many API data feeds, but it's not in the standard library yet. I'll write something about it here later.
To validate the SSV and CSV parsers, I have devised an extensive test suite, including:
This includes not only normal cases, but also very intricate corner cases, such as having the stream end without a line feed, or failing to close a quoted string properly. I have very deliberately "fuzzed" the test cases to exercise many possibilities.
There is a supporting module "read.fxl" which defines some common functions used by read_ssv.fxl and read_csv.fxl above:
# Skip matching characters. \skip_match= (\\is_match @\\loop is_match (skip loop); ) # Collect characters up to an ending condition. \collect_to= (\\is_end \buf=buf_new @\\loop is_end (buf_get buf); buf_keep buf loop ) \read_file=(\name read_stream (fopen name "r")) \read_chars= ( \flatten= ( \buf=buf_new @\\loop\\xs xs (buf_get buf) \x\xs buf_put buf x loop xs ) \xs read_stream (flatten xs) ) \form def "skip_match" skip_match; def "collect_to" collect_to; def "read_chars" read_chars; def "read_file" read_file; form