May-June 2018 Shithub SSH key harvest, data and preliminary analysis

July 7, 2018 by Lucian Mogosanu

As you may know if you'be been following the logs of The Most Serene Republic and the posts on Trilema1, RSA keys everywhere have been under the scrutiny of The Supercollider, and there's nothing anyone anywhere can do to stop that. And as it happens, I've been looking to add my humble incremental contribution on top of others', a contribution which consisted of two months of scraping SSH keys off the imperial hub of gits, also known as "GitHub".

The experimental methodology consists of the following pipeline. On one side of the pipeline: a. start by querying the GitHub API for users, i.e. getting -- this will yield a list of users U; b. for each u in U, push the user's ID u[i] and name2 u[n] into a queue Q; c. take the highest user ID maxi found and head back to step (a), with since=maxi as an argument; repeat until there are no users left to process.

On the other side of the pipeline: a. pop a u from Q; b. access[n].keys to download a list of (newline-separated) keys K; c. for each k in K, process it and append it to a text file; repeat until the other side of the pipeline has nothing else to give us.

Note that the side of the pipe described first, i.e. user enumeration, requires access to the GitHub API, which is rate-limited, which rate limit threshold can be increased by using an API key generated by an account. The other side of the pipeline can avoid calling the API altogether by getting /username.keys, which is pretty convenient3.

So what does this give us? In theory, this gives us a list of SSH keys and some metadata, e.g. user names. In practice, moreover:

[i] Some "users", e.g. readme, session and a few others, are in fact hard-coded to some page or the other. These are a minuscule part of the GitHub database.

[ii] Sometimes, when /username.keys is under heavy usage, GitHub just throws a timeout message, probably due to a rate limit on database requests. This is easy to detect (the error page has a very specific layout) and retries can be automated or whatever.

[iii] The last user ID registered on GitHub on cca. the 1st of July was 40,743,719. That's how many users have been registered until then, some of them meanwhile deleted, and some of them... see below.

[iv] For a large proportion of users, accessing /username.keys gives out a black-on-white page with the message "Not Found". What is this, then? Looking at the user names that yield this, one can notice that many of them look as if they were the product of a random string generator. So someone, probably more than one entity, is simply filling the GitHub database (and my poor text file) with spam accounts, which are then deleted using some spam filtering machine. It's obvious that the spam filtering mechanism is automated, because some of the users (e.g. user 40,743,719 above) were freshly registered and automatically banned. Nice, huh?

[v] The number of "Not Found" spam-users amounted to a total of 7,747,033 at the time when I stopped, so about 20% of total registrations ever. Not bad for a bunch of spammers, I guess.

[vi] The distribution of keys by type at the 1st of July is: 6426 ecdsa-sha2-nistp256, 474 ecdsa-sha2-nistp384, 3729 ecdsa-sha2-nistp521, 23408 ssh-dss, 33222 ssh-ed25519 and 4605938 ssh-rsa, or in a nice bar chart4:

[vii] As per the figures above, there were only about 4.6 million RSA keys in existence on GitHub on the 1st of July 2018, as opposed to the approximately 6.9 million found by JuroV in his 2015 harvest. I can't really explain this. I'm pretty damn sure that I haven't missed anything, so I'll leave further discussion to the reader.

This about sums it up. Now that the reader has gone through my not-so-short and possibly boring list, they can find the data here.

$ sha512sum phathub-201806.csv.xz | gpg --clearsign -u
Hash: SHA512

2a5aee998cfb33f1c2b1ebf2c0bdbe1f7a648c8d260e5a3a41f7cff0ecf97be0f33e5ca542e34aa7b8cfb41dc259c89809303c6a743cb2582ccec480fcf43813  phathub-201806.csv.xz
Version: GnuPG v1.4.10 (GNU/Linux)


Update, an archive with raw non-RSA keys: github-non-rsa.tar.xz. Each file in the archive is a set of three comma-separated values, where the first value is the user ID, the second the user name, and the third the SSH key.

$ sha512sum github-non-rsa.tar.xz| gpg --clearsign -u
Hash: SHA512

87a706b9649ebc1456c4331ac0225bd24b0b40fa72bc19af5164be090a15b09d10128e15e80daaa0ed55e6c2db6c474626e5629514d50798a73ee8b5dc107b9c  github-non-rsa.tar.xz
Version: GnuPG v1.4.10 (GNU/Linux)


  1. And if you haven't, I wonder what you're doing here. No really, I'm curious, you're invited to explain if you wish. Anyway, here's a bunch of references, for reference: 1, 2, and last but not least, 3

  2. And whatever other metadata may seem useful. I for one haven't found anything else of interest there. 

  3. For the harvester, definitely not for the GitHub git admins maintaining their shitty infrastructure. What can I say, life's so unfair when it's filled to the brim with hallucinated choice. Or how did that go? 

  4. Yes, the ordinate is a log scale, problem? It's not my fault that RSA dominates the set by a few orders of magnitude. 

Filed under: computing.
RSS 2.0 feed. Comment. Send trackback.

Leave a Reply