Hunchentoot: requests and replies [a]

09c August 23, 2019 -- (tech tmsr)

This post is part of a series on Common Lisp WWWism, more specifically dissecting the Common Lisp web server known as Hunchentoot. Other posts in the series include:

This post is a two-parter (see below why) that will discuss the objects known as "requests" and "replies", as they are part of the very same fundamental mechanism.

The reader has probably noticed that little to nothing has been discussed about the core of this whole orchestra, the core being the HTTP piece -- yes, we're what looks like more than halfway through this series and most of what we've discussed comprises TCPisms and CL-on-PC. So let's begin our incursion into the subject with a likbez:

The idea behind HTTP is simple: let there be a network of nodes N; let the nodes be divided into client nodes C and server nodes S1; let every node s in S be associated with a set of resources Rs. In this framework, HTTP specifies a means for a client c in C to access a resource r in Rs, knowing that s in S is a server. Furthermore, it allows one c to interact with an r owned by s in other ways, such as by "posting" data to that r. Additionally, newer specifications of the protocol have introduced other so-called "methods" of interaction; I will deliberately omit them, both for the sake of brevity and because all or most of the "additional" stuff is to be burned down and left out of any such future "HTTP" protocol2.

So let's say that a c in C wants to interact with a s in S as per above. The premise being that c and s communicate using HTTP, then c will send s a message called a request, which contains the resource to be accessed, the method and other such information, as specified in the RFC. Upon receiving this request message, s will respond to c with a message called a response, which contains a status code, the message size, the data and so on, again, as per the RFC. This is then (as viewed from the airplane) the whole mechanism that our Hunchentoot needs to implement in order to be able to communicate with curls, web clients, proxies and so on: receiving and processing requests; and baking and sending replies. Note that Hunchentoot merely provides the mechanism for this; the actual policy (i.e. whether a resource is to be associated with a file on the disk, or a set of database entries or whatever) is implemented by the user.

In what is becoming a traditional tarpitian style of documenting code, we will move directly to:

[[]] request: Object holding everything pertaining to a HTTP request: headers, a method, local/remote address/ports, the protocol version, a socket stream, GET/POST parameters, the resource being requested; additionally: the raw POST data, a reference to the acceptor on which the request was made, "auxiliary data" to be employed by the user however he or she wishes.

Note that (somewhat counter-intuitively) request parsing and object creation doesn't occur in request.lisp, but upstream in process-connection; more specifically, headers are parsed in headers.lisp, in get-request-data3. So then what does request.lisp do? Well, it: defines the data structure; implements a lot of scaffolding for GET and POST parameter parsing; implements a user interface for request handlers; and finally, it creates a context in which request handlers can be safely executed, i.e. if something fails, execution can be unwound to the place where handle-request was called and an error response can be logged and returned. Let's look at each of these pieces.

The first set of functions deals with parameter parsing. In particular, GET parameter parsing is performed when a request object is instantiated, while POST parameters are parsed on request, i.e. when the accessor method is called. Let's see:

[ii2] initialize-instance: Similarly to other pieces of code under review, this is an ":after" method which gets called immediately after an object instantiation. a. an error handling context is defined; in which b. script-name and query-string are set based on the request URI4; c. get-parameters are set5; d. cookies-in are set6; e. session is set7; and finally, if everything fails, f. log-message* is called to log the error and return-code* is set to http-bad-request.

By the way, since HTTP hasn't escaped Unicode, URL decoding needs a character format, which is determined based on the content-type field in the header, which is determined using the external-format-from-content-type function.

[effct] external-format-from-content-type: Takes a content-type string as an argument; if this argument is non-nil, then take the charset from parse-content-type8; and try to convert the result into a flexi-streams "external-format" via make-external-format. If this fails, send a warning.

[mrpp] maybe-read-post-parameters: This does quite a bit of checking on the parameters it receives, namely it only does something when: the content-type header is set; and the request method is POST; and [the "force" parameter is set; or the raw-post-data slot is not set]; and the raw-post-data slot is not set to t -- to quote from a comment, "can't reparse multipart posts, even when FORCEd". Furthermore, for the function to do anything, the content-length header must be set or input chunking must be enabled; otherwise, a warning is logged and the function returns.

If all checks pass, then wrapped in a condition handler: a. parse-content-type (see footnote #8 for details), yielding a type, subtype and charset; b. try making an external-format based b1. on the external-format parameter, and b2. on the charset found at (a), and b3. if all fails, fall back to *hunchentoot-default-external-format*.

Once we have an external-format, c. populate the post-parameters slot: c1. if content-type is "application/x-www-form-urlencoded", then use form-url-encoded-list-to-alist (see footnote #5); otherwise c2. the content-type is "multipart/form-data", which is parsed using parse-multipart-form-data.

Finally, d. if something fails in one of the previous steps, then d1. an error is logged; d2. the return-code is set to http-bad-request; and d3. the request is aborted.

[pmfd] parse-multipart-form-data: a. in a condition-handling context; b. make a new content-stream with the external-format set to latin-19; then on that content-stream, c. parse-rfc2388-form-data; then d. get-post-data; and e. if the result from (d) is a non-empty string, it's considered "stray data" and reported; finally, f. if an error occurs, it's logged and nothing is returned.

Otherwise, the result from (c) is returned, as per prog1.

[prfd] parse-rfc2388-form-data: Fortunately for us, parsing multipart-blah-blah is encapsulated in yet another RFC of its own, for which there already exists a CL "library"10. Unfortunately, the coad written around said "library" is still kludgy. Let's see.

a. parse the content-type header; then b. look for a "boundary" content-type parameter, and return empty-handed if that doesn't exist; otherwise c. for each MIME part; d. get the MIME headers; and e. particularly, the content-disposition header; and f. particularly, the "name" field of that header.

g. when the item at (f) exists, append the following to the result: g1. the item at (f), converted using convert-hack; and g2. the contents, converted using the same convert-hack. However, mime-part-contents can return either11 g2i. a path to a local file, in which case the coad stores the path, the (converted) filename and its content-type; or g2ii. a string, in which case the (converted) string is stored.

[ch] convert-hack: You might wonder what this does and why it exists in the first place. Let's quote from the documentation itself:

The rfc2388 package is buggy in that it operates on a character stream and thus only accepts encodings which are 8 bit transparent. In order to support different encodings for parameter values submitted, we post process whatever string values the rfc2388 package has returned.

I don't know what the fuck "8 bit transparent" means, but the function does exactly this: it converts the input string to a raw vector of octets, then converts said vector (using octets-to-string) to a string of the encoding given by the external-format parameter. So this is just dancing around the previous latin1 game -- yes, if you send a UTF-8-encoded file wrapped in a (ISO-8859-1-encoded) POST request, the result will be mixed-encoding data, and whoever gets said data will have to make heads and tails of the resulting pile of shit.

I can't wait for the moment when the ban on this multipart fungus comes into effect, it'll be a joyous day.

[gpd] get-post-data: Reads data from the request stream and sets the raw-post-data slot:

a. if the want-stream argument is set, then the stream is converted to latin-1-encoded (as per above) and the slot is set to this stream, bound by the content-length (if this field exists).

b. if content-length is set and it's greater than the already-read argument -- i.e. there is still data to be read from the stream, assuming the user has already read some of it -- then check whether chunking is enabled and, if so, log a warning; either way, read the content and let it be assigned to raw-post-data.

c. if chunking is enabled, then c1. setup two arrays: an adjustable "content" array and a buffer; c2. setup a position marker for the content array; c3. read into the buffer; then c4. adjust the content array to the new size; then c5. copy data from the buffer into the content array at the current position; and finally, c6. stop when there's no more content to be read.

As you can well see, I am running out of space, so contrary to the schedule I'm going to split this into two pieces, the second part to be published next week. Annoyingly enough, this is also delaying other work, including the fabled tarpitian-comment-server, so for now the venues for comments remain #spyked and (if you know where you're heading) #trilema.

P.S.: As per discussion in the forum, the next item in the Hunchentoot series, following this "requests and replies" miniseries, will be a genesis for the whole thing.

  1. In practice some c in C can also be a s in S and vice-versa, why not? In a sane world C and S would be the same set, and thus our client-server-herpderp model would become that of a peer-to-peer network, that is, one in which all nodes would both host resources and ask for them. Again, I ask: why not? And if not, then pray tell, why does the Incan star topology appeal so much to you?

  2. Witness, just as an example, the difference between RFCs 1945 and 2616: the former specifies precisely three methods -- of which the second is, for purely practical reasons, a subset of the first -- while the latter... well!

    By the way, do you think this is all? Nope: current specifications split HTTP into no less than seven parts, which makes this a tome of its own. And as if this wasn't enough, as of 2019 RFC 7230 has been obsoleted by 8615, and if this keeps up at the current pace God help us, by 2029 we'll probably get to RFCs numbered in the hundred thousands.

    Long story short, fuck these sons of bitches and all their cancerous "improvements".

  3. And here's where I find out I've actually been reading all this in the correct order, given that I know where this particular bit occurs and I don't need to spend hours digging into finding out. Pretty neat, huh?

    Now how 'bout you get a blog and start doing this for the coad that you're using? Wouldn't that be neat?

  4. Given an URI of the general form http://my-resource.tld?p1=v1&p2=v2..., script-name denotes the part before the question mark, while query-string denotes the part after it.

  5. The query-string is split by ampersands and passed to form-url-encoded-list-to-alist, which takes this list and splits each element by the equals sign. Thus the string p1=v1&p2=v2... ends up being represented as the association list:

    (("p1" . "v1") ("p2" . "v2") ...)
  6. The process is similar to the previous footnote. The cookie string is split and passed to cookies-to-alist which does pretty much the same thing as the previously-described function, only there's no URL decoding going on.

  7. I've set to deliberately omit this part since the beginning, so I won't go into details here.

  8. Did I by any chance mention HTTP has grown into a tumour-like structure? As but one of many examples of malignant cells: the "content-type" field contains a type/subtype part (e.g. "text/plain", or "application/octet-stream" or whatever); but it also contains parameters, such as a charset, or "multipart" markers, into which I won't get just yet, because my blood pressure is already going up.

    Anyway, parse-content-type reads all these and returns a type, a subtype and a charset, if it exists.

    By the by, unlike the previous footnotes, the "parameter" alist is constructed using Chunga's read-name-value-pairs, which seems to do more or less the same thing as those functions. So where the fuck does all this duplication come from?

  9. This looks confusing after all the previous "external-format" fudge. Note that the content has a user-provided format, while the multipart-blah-blah syntax is by default Latin1. This is not fucking amusing, I know.

  10. Written by one Janis Dzerins. There's also a variation/"improvement" on this coad, rfc-2388-binary, which is used by another Lisp piece that I have on my disk, I forgot which one.

    Speaking of which, I expect I'll run into more of these "duplicate libraries" problems in the future, which will require porting/adaptation work. No, I'm not gonna use a thousand libraries for XML parsing, there's one, and that's it. If you're gonna complain, then you'd better have a good reason, and be ready to explain where the fuck you were when I needed that code.

  11. Y'know, I didn't set out to review this piece of code back when I started this, but it can't be helped. mime-part-contents is the "-contents" part of a defstruct built using parse-mime. Now, this code behaves in (what I believe to be) a very odd manner, namely: when it encounters arbitrary string input, it returns it as-is; the alternative is to encounter a MIME part that contains a filename field, in which case this coad creates a new temporary file and stores the content there; in this case, the mime-part structure will contain a bunch of headers and a path to the contents, which is an unexpected extra layer of indirection, because why the fuck not.

    This behaviour can be overriden by setting the "write-content-to-file" parameter to nil. However, this is the default behaviour, and what Hunchentoot expects. Fuck my life.