Gutenberg ASCII archive updated, now with 0.5% less junk

088 March 9, 2019 -- (tmsr)

The updated ASCII text archive of Project Gutenberg is available at lucian.mogosanu.ro and mirrored at: nosuchlabs.com.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

$ksum gutentext.tar.xz > gutentext.tar.xz.ksum$ cat gutentext.tar.xz.ksum
-----BEGIN PGP SIGNATURE-----

n7nzs4mchEGEaGXttTMFnunvaZ5GgpejGB/puGwYKjhUXPcTdoGgPMowLJV4F4Y4
Ispev9b6K2/7AIDTbAZEI+rqkf1aE4sJob68KjBQjrOFgBNbgEvCHIjdTY7x/zbp
Jz19yo06/31E8TUUMDTW2BSwPC4gzAK15OBvSwjE6fUJVIt/ffMs1y/HX++09jO0
H/1bYEdQ9WOSGxHkO7siSaQa3uKyW6K7Le3XPK+bp4XGJX4z0k7ZNAOC7Ard8Izl
uLJi4ROtOV+UqLv7oR2cPXOgSCNEWnnqNxyRopHOesUB1rdboGylYmTC/z49qXS2
nqCmzvu9xCxApkhv6oxf9swhSTpw/2c6ioP5Ze/LBWEoVUe3l8EWtc9TIAUXpXu7
Bfk6XUhRFGpLC46Y7MG8Bj3bOmypu5lH3ksgo5QaUkcVecRpOYj6Mp2IWHlvNgvp
cnd0iuiIaK24rIl74elEvi2xyN3W8IwGtYhv8CBwYI1rvnsDcJrWD7xvQGVwf/SD
PiklJ/IM2Dev4AJubnT0U3N5xdo7mXVBhzu3Ky4qjJiRl1CYcdVNHxPRxT8XHqay
KzLvY6NgfFrdpL5uGRop9F5qi0Ax1dNlg4u/6oqd7ryKA6g5X8nuQi7IHMc4B3J+
pGobf0U4ao162tm9jyBV
=Jy3B
-----END PGP SIGNATURE-----

Read further for details, technical or otherwise.

The differences brought by this version over the previous one have to do mostly with a lack of crap to bleed the reader's eyes, or, to be more precise:

mircea_popescu: BingoBoingo and it is VERY HARMFUL fucking junk. having "All donations should be made to "Project Gutenberg/CMU": and are tax deductible to the extent allowable by law. (CMU = Carnegie- Mellon University)." or "Copyright laws are changing all over the world, be sure to check the copyright laws for your country before posting these files!!" in the lede of "The Merchant of Venice by William Shakespeare" promotes a most harmful and in any case uncountenable view whereby the fucktarded usgistan is at least more important than fucking shakespeare.
mircea_popescu: it very well fucking is not. it's not even remotely as important. having usg.cmu or usg.anything-else spew on actual literature is nothing short of vandalism. i don't want their grafitti, and i don't care why they think they're owed it.
mircea_popescu: this without even going into ridiculous nonsense a la "We produce about two million dollars for each hour we work. The time it takes us, a rather conservative estimate, is fifty hours to get any etext selected, entered, proofread, edited, copyright searched and analyzed, the copyright letters written, etc. This projected audience is one hundred million readers. If our value per text is nominally estimated at one dollar then we produce $2 million dollars per hour" ; apparently nobody fucking there bothered to EVER confront http://btcbase.org/log/2017-05-15#1656097 or http://btcbase.org/log/2017-07-15#1684170 etcetera. Taking a cursory look at some of the books, one can immediately notice a message along the lines of: *** START OF THIS PROJECT GUTENBERG EBOOK yadda-yadda *** or maybe: ***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN EBOOKS**START*** ... screens of legalese, followed by: *END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END* which brings into our mind the name of one Guillotin, and of one device that sharply separates the head(er) from the body -- in our case, the embodiment of this device being ye olde text processing tools grep, head and tail. We can thusly get some preliminary results by running the following ugly but very effective bash snippet: marker='\*\*\* \?START OF $$THIS\|THE$$ PROJECT GUTENBERG EBOOK' # For each file, attempt guillotine find${GUTENDIR} -name '*.txt' | while read f; do
fname=$(basename${f})
dirname=$(dirname${f} | sed "s/^${GUTENDIR}\///g") # Look for end-header marker mloc=$(grep -n -m 1 "$marker"$f)
if [ ! -z "${mloc}" ]; then # If found, say something >&2 echo "$dirname/$fname -- found:$mloc"

# Copy guillotined file from source to target directory;
# comment the lines below to do a dry run.
linenum=$(echo$mloc | cut -d":" -f1)

mkdir -p ${TARGETDIR}/${dirname}

>&2 echo "Guillotining ${f} into${TARGETDIR}/${dirname}/${fname}"
tail -n +$(($linenum + 1)) ${f} >${TARGETDIR}/${dirname}/${fname}
else
>&2 echo "$dirname/$fname -- not-found"

# Copy file as-is; comment the lines below to do a dry run.
mkdir -p ${TARGETDIR}/${dirname}
>&2 echo "Copying ${f} to${TARGETDIR}/${dirname}/${fname}"
cp ${f}${TARGETDIR}/${dirname}/${fname}
fi
done

After giving this a try, we observe that at some point our script fails, because for some files the output of grep is "Binary file ... matches" instead of what we'd expect. The reason for this is that some of the files are not actually ASCII, so we go and sed the letters to their ASCII equivalents (e.g. é to e), or, where orcograms cannot be easily disposed of, we use some encoding that our grep can recognize.

This first episode successfully ended, we rerun the script, finding out that not all headers have been processed, judging by the fact that they're in the "not-found" set. We run once again, setting $marker to some other relevant value, and upon this we observe that some output files in $TARGETDIR are now empty.

The reason for this is that not only there isn't a single "standard" copyrast header inserted into each book, but that some "headers" are actually footers. Thus we use a heuristic to determine whether the end marker of a "small print" notice is at the beginning or the end of a file1 and we cut accordingly, leaving us with another batch of non-standard copyshit snippets.

After repeated pruning, we're left with a few (96, more precisely) files that we -- by which "we" mean I -- checked manually to make sure that they're clean, which they were. Hopefully I didn't miss anything, please complain if I did.

Bottom line: a total of 46150 files were processed, of which 46028 were deheadered, leaving us with 96 books that had no headers to begin with (or readmes, addenda to musical scores, etc.) and 26 index files. The total size of the headers was a not-so-measly 88.9MB that are now forever gone into the void.

1. For example:

mloc=$(grep -n -m 1 "$marker" $f) tnloc=$(wc -l $f | cut -d" " -f1) mnloc=$(echo $mloc | cut -d":" -f1) if [$(($tnloc -$mnloc)) -le 10 ]; then
echo "at-end"
else
echo "at-beginning"
fi

where the magic value of 10 is conveniently chosen. And so on.