An XML parser for Common Lisp programs

086 March 1, 2019 -- (tech tmsr)

This post is part of a series of published artifacts that (will) represent components for The Republic's RSS bot, Feedbot. The idea behind this series is to grow Feedbot piece by piece1, starting from the smallest elements that fit in head, then using them as building blocks for the actual product, which will flow downstream from the botworks V tree.

The first item to be published is S-XML, an XML parser written in Common Lisp. Both the name and the code have been lifted from files published by one nonperson known as Sven Van Caekenberghe, who, fortunately, wrote a library that is relatively small (around a thousand LoC), is organized so that it can be grasped in a relatively short time, and is known to work2. Unfortunately though, as with most (all?) heathen programs encountered, this one isn't without warts. Thus, in addition to providing a patch, this article discusses the structure of S-XML and its current problems.

The patch for S-XML is available in my V source repository. Now, as to the library itself, it is structured as follows.

S-XML contains three layers of abstraction: a. the core parsing code, that reads characters from a stream and returns XML elements, stored in xml.lisp; b. a series of so-called "wrappers" over the parser that take its results and give them a particular structure; and c. an interface between (a) and (b), stored in dom.lisp. In fact, one could say that the layering goes exactly the other way around: the code in (b) provides a set of functions for the parser (a), while the parser takes a string/stream, processes it and calls the functions provided by (b) so that it can decide what to do once it has all the required data, e.g. tags, attributes etc. The wrappers in (b) are stored in xml-struct-dom.lisp, lxml-dom.lisp and sxml-dom.lisp3.

The advantage of this design is that it doesn't constrain the user to any particular DOM tree representation. Personally, I find this so-called feature to be entirely useless, as the need to parse XML files into multiple tree formats using a single library looks like the perfect recipe for hallucinated freedom, not to mention the extra lines of (mostly dead) code added. This cleverness is more for its own sake than anything of substance, and thus eventually some hero or another will surgically excise this particular tumour.

Until then, however, the thing works, so it's all the better to publish it than to wait for the moment when said hero gets off his or her ass and makes the thing shine. Meanwhile, the more pressing matter for yours truly, and the next episode of this series, will involve publishing a small RSS/Atom parser based on S-XML DOM trees.

  1. This style, nowadays immediately recognizable by TMSR citizens as the "FFA style", draws from Asciilifeform's Finite Field Arithmetic series.

  2. It's been powering Feedbot for some months now.

  3. Corresponding, respectively, to a defstruct-based format, Franz's LXML format and the so-called SXML. The latter two structure XML markup as S-expressions. Currently, Feedbot's RSS parser uses LXML, for no reason in particular other than it being the implicit option.