Gutenberg ASCII archive updated, now with 0.5% less junk

March 9, 2019 by Lucian Mogosanu

The updated ASCII text archive of Project Gutenberg is available at lucian.mogosanu.ro and mirrored at: nosuchlabs.com.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

$ ksum gutentext.tar.xz > gutentext.tar.xz.ksum
$ cat gutentext.tar.xz.ksum
e0e3bbc7677365f8503a5a10c7e1a3ab28864dd7c8e87d1aace7b11c4985b4dd3383dfca616ee96ad0cc1f1d36312b0c7ba544a29189ed2d2ea36cb13a687df9  gutentext.tar.xz
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCgAGBQJcg7CjAAoJEL2unQUaPTuVMwIQAJV7J8JeHuiU6ZnqXSAdO4aC
n7nzs4mchEGEaGXttTMFnunvaZ5GgpejGB/puGwYKjhUXPcTdoGgPMowLJV4F4Y4
Ispev9b6K2/7AIDTbAZEI+rqkf1aE4sJob68KjBQjrOFgBNbgEvCHIjdTY7x/zbp
Jz19yo06/31E8TUUMDTW2BSwPC4gzAK15OBvSwjE6fUJVIt/ffMs1y/HX++09jO0
H/1bYEdQ9WOSGxHkO7siSaQa3uKyW6K7Le3XPK+bp4XGJX4z0k7ZNAOC7Ard8Izl
uLJi4ROtOV+UqLv7oR2cPXOgSCNEWnnqNxyRopHOesUB1rdboGylYmTC/z49qXS2
nqCmzvu9xCxApkhv6oxf9swhSTpw/2c6ioP5Ze/LBWEoVUe3l8EWtc9TIAUXpXu7
Bfk6XUhRFGpLC46Y7MG8Bj3bOmypu5lH3ksgo5QaUkcVecRpOYj6Mp2IWHlvNgvp
cnd0iuiIaK24rIl74elEvi2xyN3W8IwGtYhv8CBwYI1rvnsDcJrWD7xvQGVwf/SD
PiklJ/IM2Dev4AJubnT0U3N5xdo7mXVBhzu3Ky4qjJiRl1CYcdVNHxPRxT8XHqay
KzLvY6NgfFrdpL5uGRop9F5qi0Ax1dNlg4u/6oqd7ryKA6g5X8nuQi7IHMc4B3J+
pGobf0U4ao162tm9jyBV
=Jy3B
-----END PGP SIGNATURE-----

Read further for details, technical or otherwise.

The differences brought by this version over the previous one have to do mostly with a lack of crap to bleed the reader's eyes, or, to be more precise:

mircea_popescu: BingoBoingo and it is VERY HARMFUL fucking junk. having "All donations should be made to "Project Gutenberg/CMU": and are tax deductible to the extent allowable by law. (CMU = Carnegie- Mellon University)." or "Copyright laws are changing all over the world, be sure to check the copyright laws for your country before posting these files!!" in the lede of "The Merchant of Venice by William Shakespeare" promotes a most harmful and in any case uncountenable view whereby the fucktarded usgistan is at least more important than fucking shakespeare.
mircea_popescu: it very well fucking is not. it's not even remotely as important. having usg.cmu or usg.anything-else spew on actual literature is nothing short of vandalism. i don't want their grafitti, and i don't care why they think they're owed it.
mircea_popescu: this without even going into ridiculous nonsense a la "We produce about two million dollars for each hour we work. The time it takes us, a rather conservative estimate, is fifty hours to get any etext selected, entered, proofread, edited, copyright searched and analyzed, the copyright letters written, etc. This projected audience is one hundred million readers. If our value per text is nominally estimated at one dollar then we produce $2 million dollars per hour" ; apparently nobody fucking there bothered to EVER confront http://btcbase.org/log/2017-05-15#1656097 or http://btcbase.org/log/2017-07-15#1684170 etcetera.

Taking a cursory look at some of the books, one can immediately notice a message along the lines of:

*** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda ***

or maybe:

***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START***
... screens of legalese, followed by:
*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*

which brings into our mind the name of one Guillotin, and of one device that sharply separates the head(er) from the body -- in our case, the embodiment of this device being ye olde text processing tools grep, head and tail. We can thusly get some preliminary results by running the following ugly but very effective bash snippet:

marker='*** ?START OF (THIS|THE) PROJECT GUTENBERG EBOOK'

# For each file, attempt guillotine
find ${GUTENDIR} -name '*.txt' | while read f; do
        fname=$(basename ${f})
        dirname=$(dirname ${f} | sed "s/^${GUTENDIR}///g")

        # Look for end-header marker
        mloc=$(grep -n -m 1 "$marker" $f)
        if [ ! -z "${mloc}" ]; then
                # If found, say something
                >&2 echo "$dirname/$fname -- found:$mloc"

                # Copy guillotined file from source to target directory;
                # comment the lines below to do a dry run.
                linenum=$(echo $mloc | cut -d":" -f1)

                mkdir -p ${TARGETDIR}/${dirname}

                >&2 echo "Guillotining ${f} into ${TARGETDIR}/${dirname}/${fname}"
                tail -n +$(($linenum + 1)) ${f} > ${TARGETDIR}/${dirname}/${fname}
        else
                # If not found, say something else
                >&2 echo "$dirname/$fname -- not-found"

                # Copy file as-is; comment the lines below to do a dry run.
                mkdir -p ${TARGETDIR}/${dirname}
                >&2 echo "Copying ${f} to ${TARGETDIR}/${dirname}/${fname}"
                cp ${f} ${TARGETDIR}/${dirname}/${fname}
        fi
done

After giving this a try, we observe that at some point our script fails, because for some files the output of grep is "Binary file ... matches" instead of what we'd expect. The reason for this is that some of the files are not actually ASCII, so we go and sed the letters to their ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily disposed of, we use some encoding that our grep can recognize.

This first episode successfully ended, we rerun the script, finding out that not all headers have been processed, judging by the fact that they're in the "not-found" set. We run once again, setting $marker to some other relevant value, and upon this we observe that some output files in $TARGETDIR are now empty.

The reason for this is that not only there isn't a single "standard" copyrast header inserted into each book, but that some "headers" are actually footers. Thus we use a heuristic to determine whether the end marker of a "small print" notice is at the beginning or the end of a file¹ and we cut accordingly, leaving us with another batch of non-standard copyshit snippets.

After repeated pruning, we're left with a few (96, more precisely) files that we -- by which "we" mean I -- checked manually to make sure that they're clean, which they were. Hopefully I didn't miss anything, please complain if I did.

Bottom line: a total of 46150 files were processed, of which 46028 were deheadered, leaving us with 96 books that had no headers to begin with (or readmes, addenda to musical scores, etc.) and 26 index files. The total size of the headers was a not-so-measly 88.9MB that are now forever gone into the void.

For example:

mloc=$(grep -n -m 1 "$marker" $f)
tnloc=$(wc -l $f | cut -d" " -f1)
mnloc=$(echo $mloc | cut -d":" -f1)
if [ $(($tnloc - $mnloc)) -le 10 ]; then
        echo "at-end"
else
        echo "at-beginning"
fi

where the magic value of 10 is conveniently chosen. And so on. ↩

Filed under: olds.
RSS 2.0 feed. Comment. Send trackback.

3 Responses to “Gutenberg ASCII archive updated, now with 0.5% less junk”

#1:
Schedule for Republican work, May-June 2019 « The Tar Pit says:

October 11, 2019 at 15:15

[...] ended the past few months with Gutenberg, Feedbot and a lot of walking around the country; time to look now at what I'm to do in the next [...]
#2:
Hunchentoot: further architectural notes; and usage examples « The Tar Pit says:

October 11, 2019 at 15:23

[...] it comes with no guarantees"? I haven't read the entire Gutenberg archive either, and yet I did publish it that [...]
#3:
On computers « The Tar Pit says:

April 11, 2020 at 13:01

[...] you could fit a few thousands of books in a truck, nowadays you can fit a few tens of thousands in a few gigabytes stored on a two-inch-long USB stick. Or if the average human can do basic arithmetical operations [...]

The Tar Pit

Gutenberg ASCII archive updated, now with 0.5% less junk

3 Responses to “Gutenberg ASCII archive updated, now with 0.5% less junk”

Leave a Reply