As many computer programmers have observed: "Parsing is hard!" I concur, but it can be simplified. I like to think in terms of functions, so I take a functional programming approach instead of the more common imperative or stateful approach.
A parse function takes a list of characters (a "stream") as input and returns some type of result. In some cases it may return multiple types of results, such as one for failure and one for success. In that case the parse function would receive two handler functions as input, typically called no for failure and yes for success. This allows you to capture syntax errors directly, instead of having the code die per the usual practice.
When the parse function calls a handler function, it passes in the data resulting from the parse, but it also passes in the tail of the input stream after the parse. This allows you to chain parse functions sequentially, consuming the input stream in logical segments.
The two basic functions that work directly with an input stream are:
\scan=(\no\yes\xs xs (no ) yes) \push=(\x\yes\xs yes [x;xs])
The scan function is what you use to examine the next character in the stream. It takes two handlers no and yes, along with the input stream xs, and returns (no ) if the stream is empty, or (yes x xt) if the stream consists of a character x followed by the tail stream xt.
The push function is what you use to push a character that you've already read back onto the front of the input stream, so it can be read subsequently by some other function.
Note that those are the only two parse functions which refer to the input stream xs directly. All other parse functions are built on top of those, using a point-free and monadic style so you can compose parse functions without passing the input stream around explicitly.
Now consider a rather interesting parsing task. Write a function which collects all characters in the input up to the next occurrence of a given terminator string. For example, if the input stream starts with "123~END456", and you use the terminator string "~END", the function should return "123", and leave the input sitting at the "4" character after the terminator.
Note that you have to handle the case where the terminator is not found. If the stream is "123END456", and you use the terminator "~END", then it is not found.
Here's the outline of that function:
# Collect a string up to the next occurrence of the terminator string. # Return (yes str) if the term was found, otherwise return no. \collect_string= (\term\no\yes ... )
Note that because I'm using the point-free monadic style, I do not mention the input stream xs explicitly. However, xs is in fact the fourth input to the collect_string function. Also, for the same reason, it "goes without saying" that the function will pass in the tail of the list to the handlers, so the actual return values will by (yes str xt) if the term was found, or (no xt) if not found — where xt is the tail of the list after the parse.
So here's an example of how it might be called:
collect_string term ( say "missing terminator" ) (\str say ["str = "(as_str str)] )
I show the full implementation of collect_string in the next section.
In our investment accounting business I make extensive use of the "SSV" (space-separated value) format to specify event histories, such as trades, capital contributions, dividends, interest, etc. This is a series of lines, with each line consisting of a series of items separated by white space. An item can be a plain word such as "trade" (without the quotes), or a string enclosed in quotes such as "Alice Smith" (with the quotes), or a "tilde string" as used in Fexl itself, where you can choose an arbitrary delimiter such as "~END" and then enclose any block of text you like within a pair of those delimiters. Text within a quoted or tilde-quoted string may include line feeds, so you can have multi-line items.
The format allows comments, indicated by a "#" and continuing to the end of the line. (Of course, a "#" inside a quoted string does not indicate a comment.)
Here is the module for parsing the SSV format:
# Parse the SSV (space-separated value) format. # Collect a string up to the next occurrence of the terminator string. \collect_string= (\term\no\yes \term=(str_bytes term) \buf_result=buf_new \buf_match=buf_new (@\collect\in_term\term_tail term_tail (\str=(buf_get buf_result) yes str) \term_ch\term_tail scan no \x eq x term_ch ( # Current char matches the next term char, so save it. buf_put buf_match x collect T term_tail ); # Current char does not match the next term char. in_term ( # Add the previously matched terminator chars to the result. buf_put buf_result (buf_get buf_match) push x ) ( # Add current char to the result and proceed. buf_put buf_result x ); collect F term ) F term ) \get_plain_item= (\yes \buf=buf_new \done==(\str=(buf_get buf) yes str) @\loop scan done \x is_white x (push x done); eq x QU (push x done); eq x "~" (push x done); buf_put buf x loop ) \get_item= (\no\yes scan no \x is_eol x no; eq x QU (collect_to (eq QU) no yes); eq x "~" ( collect_to is_white no \term collect_string (. x term) no yes ); push x; get_plain_item yes ) \get_row= (@\loop\no\yes skip_match (\x is_eol x F; is_white x); get_item no \item loop (yes [item]) \row yes [item;row] ) \get_rows= (@\loop\yes skip_match is_white; get_row (yes ) \row loop \rows yes [row;rows] ) \read_ssv_string=(read_string get_rows) \read_ssv_chars=(read_chars get_rows) \read_ssv_file=(read_file get_rows) def "read_ssv_string" read_ssv_string; def "read_ssv_chars" read_ssv_chars; def "read_ssv_file" read_ssv_file; void
Data feeds from broker accounts often use the "CSV" (comma-separated value) format. Here is the module for parsing the CSV format:
# Parse the CSV (comma-separated value) format. # NOTE: https://www.ietf.org/rfc/rfc4180.txt # "Spaces are considered part of a field and should not be ignored." \get_plain_item= (\yes \buf=buf_new \done==(\str=(buf_get buf) yes str) @\loop scan done \x eq x "," (push x done); is_eol x (push x done); buf_put buf x loop ) # Get a char inside a quoted string. A single QU char is treated as end of # string. Two QU chars in a row are treated as a single QU character which # appears in the string. \get_quote_char= (\fail\done\keep scan fail \x eq x QU ( scan done \x eq x QU (keep QU) (push x done) ) (keep x) ) # Get a quoted item. \get_quote_item= (\no\yes \buf=buf_new @\loop get_quote_char no (\str=(buf_get buf) yes str) (\x buf_put buf x loop) ) \get_item= (\no\yes scan no \x eq x QU (get_quote_item no yes); push x; get_plain_item yes ) \get_row= (@\loop\no\yes get_item no \item scan (yes [item]) \x eq x "," (loop (yes [item]) \row yes [item;row]) (yes [item]) ) \get_rows= (@\loop\yes skip_match is_eol; get_row (yes ) \row loop \rows yes [row;rows] ) \read_csv_string=(read_string get_rows) \read_csv_chars=(read_chars get_rows) \read_csv_file=(read_file get_rows) def "read_csv_string" read_csv_string; def "read_csv_chars" read_csv_chars; def "read_csv_file" read_csv_file; void
(Note: Sometimes the data uses TAB instead of comma to separate the items on a row, so I have another version which allows you to pass in the separator as a parameter.)
I also have a JSON parser which I use for many API data feeds, but it's not in the standard library yet. I'll write something about it here later.
To validate the SSV and CSV parsers, I have devised an extensive test suite, including:
This includes not only normal cases, but also very intricate corner cases, such as having the stream end without a line feed, or failing to close a quoted string properly. I have very deliberately "fuzzed" the test cases to exercise many possibilities.
There is a supporting module "read.fxl" which defines some common functions used by read_ssv.fxl and read_csv.fxl above:
# \scan=(\no\yes\xs xs (no ) yes) \push=(\x\yes\xs yes [x;xs]) \is_eol=(\x eq x LF T; eq x CR) \is_white=(ge " ") # Skip matching characters. \skip_match= (\is_match @\loop\yes scan yes \x is_match x (loop yes) (push x yes) ) # Collect characters up to an ending condition. \collect_to= (\is_end\no\yes \buf=buf_new @\loop scan no \x is_end x (\str=(buf_get buf) yes str) (buf_put buf x loop) ) # Map a get function to a stream of characters. \stream=(\get @\loop \x=get is_undef x  [x;loop]) \read_chars=(\read read (\data\xt data)) \read_string= (\read\str \fh=(readstr str) read_chars read (stream; sgetc fh) ) \read_file= (\read\name \fh=(fopen name "r") is_void fh void; \data=(read_chars read (stream; fgetc fh)) fclose fh data ) def "scan" scan; def "push" push; def "is_eol" is_eol; def "is_white" is_white; def "skip_match" skip_match; def "collect_to" collect_to; def "read_chars" read_chars; def "read_string" read_string; def "read_file" read_file; void