# Parsing in Fexl

As many computer programmers have observed: "Parsing is hard!" I concur, but it can be simplified.

Some functional programming languages insist on a "pure" approach to parsing, involving a stream object which is passed around implicitly by use of a monad. While it is possible to take this approach in Fexl, I don't recommend it. I've tried it, and I find that the resulting code is far less clear and more painful to write.

Instead, I take a stateful approach to reading the input stream, just as I would in an imperative language such as C. The stream is set up so that the functions look and skip are defined. The look function returns the current character in the stream. The skip function skips to the next character, returning nothing. It also keeps track of the current line and column number, so I can give a helpful message in case of a syntax error. (I use the "var" type in Fexl to track these.)

Now consider a rather interesting parsing task. Write a function which collects all characters in the input up to the next occurrence of a given terminator string. For example, if the input stream starts with "123~END456", and you use the terminator string "~END", the function should return "123", and leave the input sitting at the "4" character after the terminator.

Note that you have to handle the case where the terminator is not found. If the stream is "123END456", and you use the terminator "~END", then it is not found.

Here's how I specify the interface to that function:

# (collect_to_term term no yes)
#
# Collect all characters up to the next occurrence of term in the input stream.
# Return (yes str) if the term was found, otherwise return no.


I use a "continuation" style to return no if the terminator is not found, or (yes str) if the terminator was found. That way I don't have to encode a "success flag" into the return value which the caller has to examine.

So here's an example of how it might be called:

collect_to_term term
(
say "DID NOT FIND IT!"
)
(\str
say ["str  = "(as_str str)]
\ch=look
say ["ch   = "(as_str ch)]
)


Not only does that show you the resulting string if successful, it also shows you the next character after the occurrence of the terminator.

So without further ado, here is an implementation of collect_to_term:

# (collect_to_term term no yes)
#
# Collect all characters up to the next occurrence of term in the input stream.
# Return (yes str) if the term was found, otherwise return no.

\collect_to_term=
(\term\no\yes
eq "" term (yes "");
\buf_text=buf_new
\buf_term=buf_new

@\restart_term
\t_ch=(sget fh)

@\outside_term
\ch=look
is_eof ch no;
eq ch t_ch
(
buf_put buf_term ch
@\inside_term
\t_ch=(sget fh)
is_eof t_ch
(
skip
\str=(buf_get buf_text)
yes str
);
skip
\ch=look
is_eof ch no;
eq ch t_ch
(
buf_put buf_term ch
inside_term
)
(
buf_put buf_text (buf_get buf_term)
restart_term
)
)
(
buf_put buf_text ch
skip
outside_term
)
)


Note that my use of the fixpoint operator @ resembles the use of goto in a traditional programming language like C! The lambda parameter following the @ behaves like a "label", to which you can jump simply by mentioning its name. I've got three such "labels": restart_term, outside_term, and inside_term.

The restart_term function starts reading the terminator string iteratively using readstr to open it, like a file, and sget to read (UTF-8) characters successively, again like it's reading a file.

The outside_term function is what we do when we're "outside" the terminator: meaning that we haven't yet found the first terminator character.

The inside_term function is what we do when we're "inside" the terminator. As we are matching terminator characters with input characters, we save the characters in a buffer called buf_term. That's just in case we start to match the terminator, but later find out that we haven't. In that case I throw the buf_term characters onto the end of the buf_text buffer I use to collect the string result.

# Parsing the SSV format

In our hedge fund accounting business I make extensive use of the "SSV" (space-separated value) format to specify event histories, such as trades, capital contributions, dividends, interest, etc. This is a series of lines, with each line consisting of a series of items separated by white space. An item can be a plain word such as "trade" (without the quotes), or a string enclosed in quotes such as "Alice Smith" (with the quotes), or a "tilde string" as used in Fexl itself, where you can choose an arbitrary delimiter such as "~END" and then enclose any block of text you like within a pair of those delimiters. Text within a quoted or tilde-quoted string may include line feeds, so you can have multi-line items.

The format allows comments, indicated by a "#" and continuing to the end of the line. (Of course, a "#" inside a quoted string does not indicate a comment.)

Here is the module for parsing the SSV format:

# SSV (space-separated value) format

# The end-of-line character is LF.  Any CR is ignored as white space.
\is_eol=(eq LF)
\is_not_eol=(ne LF)

# A blank character is any white space except end-of-line.
\is_blank=(\x is_white x (is_not_eol x) F)

\at_eol=
(\ch
is_eof ch T;
is_eol ch T;
eq ch "#" T;
F
)

\skip_blank==(skip_match is_blank)
\skip_to_eol==(skip_match is_not_eol)

\skip_filler==
(@\loop
skip_white
\ch=look
is_eof ch ();
ne ch "#" ();
skip
skip_to_eol
skip
loop
)

(
collect_match
(\ch
is_white ch F;
eq ch QU F;
eq ch "~" F;
eq ch "#" F;
T
)
)

(\err\good
get_pos \row\col
skip
collect_to (eq QU) (err "Unclosed string" row col) good
)

(\err\good
get_pos \row\col
# Gather the terminator up to the next white space.
collect_to is_white (err "Incomplete string terminator" row col) \term
collect_to_term term (err "Unclosed string" row col);
good
)

(\err
\ch=look
)

(
skip_blank
at_eol
)

# Read the stream in SSV format.  Return (good rows) if valid, or
# (err msg row col) otherwise.
(
skip_filler
is_eof
)

standard


# Parsing the CSV format

Data feeds from broker accounts often use the "CSV" (comma-separated value) format. Sometimes the data uses TAB instead of comma to separate the items on a row, so I accept a parameter to specify any separator you like.

Here is the module for parsing the CSV format:

# Parse the CSV (comma-separated value) format.
# NOTE: https://www.ietf.org/rfc/rfc4180.txt
# "Spaces are considered part of a field and should not be ignored."

\is_eol=(\ch eq ch CR T; eq ch LF T; F)

# Read all characters up to a comma or end of line, and skip any comma.
(\sep
collect_match
(\ch
eq ch sep (skip F);
is_eol ch F;
T
)
)

# Collect characters up to the closing quote.  Two QU chars in a row are
# treated as a single QU character which appears in the string.
# If an ending quote was found, return (yes str), otherwise return no.
\collect_to_end_quote=
(\no\yes
\buf=buf_new
\done==
(
\str=(buf_get buf)
yes str
)
@\loop
skip
\ch=look
is_eof ch no;
eq ch QU
(
skip
\ch=look
is_eof ch done;
eq ch QU
(
buf_put buf ch
loop
)
(
skip
done
)
)
(
buf_put buf ch
loop
)
)

# Get a quoted item.
(\err\good
get_pos \row\col
collect_to_end_quote
(err "Unclosed string" row col)
good
)

(\sep\err
\ch=look
)

(\sep
()
(\ch
is_eof ch T;
is_eol ch T;
F
)
)

# Read the stream in CSV format with the given separator, usually "," or TAB.
# Return (good rows) if valid, or (err msg row col) otherwise.
(\sep
(skip_match is_eol)
is_eof
)

standard


# Validating the parsers

To validate the SSV and CSV parsers, I have devised an extensive test suite. This includes not only normal cases, but also very intricate corner cases, such as having the stream end without a line feed, or failing to close a quoted string properly. I have very deliberately "fuzzed" the test cases to exercise all possibilities.

# Supporting modules

There is a supporting module "read.fxl" which defines some common functions used by read_ssv.fxl and read_csv.fxl above:

# A white space character is anything SPACE or below.
\is_white=(ge " ")

# Skip matching characters.
\skip_match=
(\match
@\loop
\ch=look
(is_eof ch F; match ch)
(skip loop)
()
)

\skip_white==(skip_match is_white)

# Collect matching characters.
\collect_match=
(\match\return
\buf=buf_new
@\loop
\ch=look
(is_eof ch F; match ch)
(
skip
buf_put buf ch
loop
)
(
\item=(buf_get buf)
return item
)
)

# (collect_to match no yes)
# Collect characters up to an ending condition.  Return (yes str) if the
# condition is found, otherwise return no.
\collect_to=
(\match\no\yes
\buf=buf_new
@\loop
\ch=look
is_eof ch no;
skip
match ch
(
\item=(buf_get buf)
yes item
)
(
buf_put buf ch
loop
)
)

# (collect_to_term term no yes)
# Collect all characters up to the next occurrence of term in the input stream.
# Return (yes str) if the term was found, otherwise return no.
\collect_to_term=
(\term\no\yes
eq "" term (yes "");
\buf_text=buf_new
\buf_term=buf_new

@\restart_term
\t_ch=(sget fh)

@\outside_term
\ch=look
is_eof ch no;
eq ch t_ch
(
buf_put buf_term ch
@\inside_term
\t_ch=(sget fh)
is_eof t_ch
(
skip
\str=(buf_get buf_text)
yes str
);
skip
\ch=look
is_eof ch no;
eq ch t_ch
(
buf_put buf_term ch
inside_term
)
(
buf_put buf_text (buf_get buf_term)
restart_term
)
)
(
buf_put buf_text ch
skip
outside_term
)
)

# Read a sequence of items.
skip_prefix
\ch=look
at_end ch (good []);
loop \list
good [item;list]
)

define "is_white" is_white;
define "skip_match" skip_match;
define "skip_white" skip_white;
define "collect_match" collect_match;
define "collect_to" collect_to;
define "collect_to_term" collect_to_term;
standard


And finally, there is a module "stream.fxl" which is used to stream over things like text blocks, named files, open file handles, or even a sequence of characters in a Fexl list. You can see how it uses variables (var_new) to track things like the current character in the stream, the current row (line) and column number, and the current "get" function used to return the sequence of characters. The row and column numbers are needed to provide helpful error messages.

The stream_fn function is the base function for initiating a stream and running an arbitrary piece of code in the context of that stream. It saves and restores the current variable values so you can call it in a nested fashion if you wish.

An opportunity exists to optimize the stream.fxl module, by implementing the low-level variable manipulation directly in C. In particular this would accelerate the skip function, which tracks the current character and row and column position. In the meantime, I find the native Fexl implementation to be plenty fast for production work.

stream.fxl:

# Stream operations.
\v_ch=var_new
\v_row=var_new
\v_col=var_new
\v_get=var_new

\look==(var_get v_ch)
\row==(var_get v_row)
\col==(var_get v_col)

\is_eof=(\ch is_defined ch F T)

\skip==
(
var_get v_get \ch
var_put v_ch ch
is_eof ch ();
eq ch LF
(
var_put v_row (+ 1 row)
var_put v_col 0
)
(
var_put v_col (+ 1 col)
)
)

# Get the current row and column position.
\get_pos=
(\return
\curr_row=row
\curr_col=col
return curr_row curr_col
)

\stream_fn=
(\get\body

\save_ch=(var_get v_ch)
\save_row=(var_get v_row)
\save_col=(var_get v_col)
\save_get=(var_get v_get)

var_put v_row 1
var_put v_col 0
var_put v_get
(\return
\ch=get
return ch
)

skip # Read the first character.
\result=body

var_put v_ch save_ch
var_put v_row save_row
var_put v_col save_col
var_put v_get save_get

result
)

\stream_text=
(\text\body
\get==(sget fh)
stream_fn get body
)

\stream_chars=
(\chars\body
\text=(to_str chars)
stream_text text body
)

\stream_fh=
(\fh\body
\get==(fget fh)
stream_fn get body
)

\stream_file=
(\name\no\body
\fh=(fopen name "r")
is_void fh no;
\result=(stream_fh fh body)
fclose fh
result
)

define "look" look;
define "is_eof" is_eof;
define "skip" skip;
define "get_pos" get_pos;
define "stream_fn" stream_fn;
define "stream_text" stream_text;
define "stream_chars" stream_chars;
define "stream_fh" stream_fh;
define "stream_file" stream_file;
standard