May-June 2018 Shithub SSH key harvest, data and preliminary analysis

076 July 7, 2018 -- (tech)

As you may know if you'be been following the logs of The Most Serene Republic and the posts on Trilema1, RSA keys everywhere have been under the scrutiny of The Supercollider, and there's nothing anyone anywhere can do to stop that. And as it happens, I've been looking to add my humble incremental contribution on top of others', a contribution which consisted of two months of scraping SSH keys off the imperial hub of gits, also known as "GitHub".

The experimental methodology consists of the following pipeline. On one side of the pipeline: a. start by querying the GitHub API for users, i.e. getting https://api.github.com/users?since=0 -- this will yield a list of users U; b. for each u in U, push the user's ID u[i] and name2 u[n] into a queue Q; c. take the highest user ID maxi found and head back to step (a), with since=maxi as an argument; repeat until there are no users left to process.

On the other side of the pipeline: a. pop a u from Q; b. access https://github.com/u[n].keys to download a list of (newline-separated) keys K; c. for each k in K, process it and append it to a text file; repeat until the other side of the pipeline has nothing else to give us.

Note that the side of the pipe described first, i.e. user enumeration, requires access to the GitHub API, which is rate-limited, which rate limit threshold can be increased by using an API key generated by an account. The other side of the pipeline can avoid calling the API altogether by getting /username.keys, which is pretty convenient3.

So what does this give us? In theory, this gives us a list of SSH keys and some metadata, e.g. user names. In practice, moreover:

[i] Some "users", e.g. readme, session and a few others, are in fact hard-coded to some page or the other. These are a minuscule part of the GitHub database.

[ii] Sometimes, when /username.keys is under heavy usage, GitHub just throws a timeout message, probably due to a rate limit on database requests. This is easy to detect (the error page has a very specific layout) and retries can be automated or whatever.

[iii] The last user ID registered on GitHub on cca. the 1st of July was 40,743,719. That's how many users have been registered until then, some of them meanwhile deleted, and some of them... see below.

[iv] For a large proportion of users, accessing /username.keys gives out a black-on-white page with the message "Not Found". What is this, then? Looking at the user names that yield this, one can notice that many of them look as if they were the product of a random string generator. So someone, probably more than one entity, is simply filling the GitHub database (and my poor text file) with spam accounts, which are then deleted using some spam filtering machine. It's obvious that the spam filtering mechanism is automated, because some of the users (e.g. user 40,743,719 above) were freshly registered and automatically banned. Nice, huh?

[v] The number of "Not Found" spam-users amounted to a total of 7,747,033 at the time when I stopped, so about 20% of total registrations ever. Not bad for a bunch of spammers, I guess.

[vi] The distribution of keys by type at the 1st of July is: 6426 ecdsa-sha2-nistp256, 474 ecdsa-sha2-nistp384, 3729 ecdsa-sha2-nistp521, 23408 ssh-dss, 33222 ssh-ed25519 and 4605938 ssh-rsa, or in a nice bar chart4:

[vii] As per the figures above, there were only about 4.6 million RSA keys in existence on GitHub on the 1st of July 2018, as opposed to the approximately 6.9 million found by JuroV in his 2015 harvest. I can't really explain this. I'm pretty damn sure that I haven't missed anything, so I'll leave further discussion to the reader.

This about sums it up. Now that the reader has gone through my not-so-short and possibly boring list, they can find the data here.

$ sha512sum phathub-201806.csv.xz | gpg --clearsign -u lucian@mogosanu.ro
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

2a5aee998cfb33f1c2b1ebf2c0bdbe1f7a648c8d260e5a3a41f7cff0ecf97be0f33e5ca542e34aa7b8cfb41dc259c89809303c6a743cb2582ccec480fcf43813  phathub-201806.csv.xz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iQIcBAEBCgAGBQJbP1muAAoJEL2unQUaPTuVcXUQAKRnR/MUUaxi6uEcZVYVJA6s
lGTSOM32gGXDXI3o+HidALDd2kaovIkOCxm2JAX08QtleQIbdrkquMlMHNNYBYh6
bkqUSUo905LRwbuwkPP7pOKxR4zuwyj2XFFAH8hShe5ZKBgzbC+jBi21mwT8B7i9
QsOtYGxwy3A3oLUAZDQaGZh59dGlJoRvQiJ3TX3m//rNczLqFd0DuIMfkAltjtj4
Cyr0tgMY3wnQduXngRW+4TvIIq6lSV4b8n+vnG2ksKCdw/mfL/gCitLItA6F2XeL
taSCZmLPn88so0XbLm55uEHiYsuaRCRKaqD+NcdZDZ5SJ6wiu01zfmgEHdmlNlGv
ujB6NdDPCkFBOC/0l860/+f1G4z98DQMouM/A7AdeVJOEZCwPsnaZQS/FE3zbRWz
v+PMj1nIsgG+kTMT01IoQn4c/vHgFxG+j4x7Y1vD0uF+iS3C0Dn7waFCVuD04dx9
mhv0CPDhu7eWsCpM7+ilf5gpLAjg1oLjX7bGSJK/yCbBlN3gd6W/zt+fOMALrFb7
WxN8JsAGGgP6VXI1nAj4xg75sV8gfdFHIz/kh/nmpaD1dmNkX7LTxc651ynQCMjd
yzB2nmCQI3+CNVu6Gcd3JPthsLXDC2Qi6f1rkiYhWBgZbLZV8Tvhn/5ogshJJWEP
uwFQM86fHrXTTBKr72Ma
=ZYuq
-----END PGP SIGNATURE-----

Update, an archive with raw non-RSA keys: github-non-rsa.tar.xz. Each file in the archive is a set of three comma-separated values, where the first value is the user ID, the second the user name, and the third the SSH key.

$ sha512sum github-non-rsa.tar.xz| gpg --clearsign -u lucian@mogosanu.ro
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

87a706b9649ebc1456c4331ac0225bd24b0b40fa72bc19af5164be090a15b09d10128e15e80daaa0ed55e6c2db6c474626e5629514d50798a73ee8b5dc107b9c  github-non-rsa.tar.xz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iQIcBAEBCgAGBQJbQd7DAAoJEL2unQUaPTuVQHkQAKCgTVYP/2lSzgmLdQpXHETR
/pSeBxAOp4FSjbqlipspnb+z8MLAUVs4l6kHyjdtjLK7B9KXi+lANLPXSOqAeMMQ
gJm/T51Yy/JVJEsgWyoLwnw4XQcRX2kOxzpdpd+ERDOvB+d2Zzvw1amDUz2S6Z9C
9QuIc005VJIZkMB+eC8uN7jb06leHcR6Q2bfB6WgbDAPvLBcrfRXGoZrmKAPgmrY
Y22s1Kug1TnSaDEJs6m5QNzNLl9iDkU8o6aHgydYgtUDQSfN6jUZI0ZrdysEDxDf
mu+jtHBgL/XoclaP08oy7LLhzWJ8qC0pA8Wixm26IRNQB2qWRXndWV5eCdtE2NNg
4RtNfW9GF4pZ3hZSQ/ciDfWrRIK90sEzEHnd1Lw8vtIPVN3Naj3tDVNVYZBrIQDc
7aK4QIIJTBuS5V2VvsiS1X/yKEZkJ9ZHMH4jBFIqQkwsbYYcTHT+UWY4iiqcqyHT
DwpvksS1926xK4T8CF3IpwYknVhKa7l1MjC5/XVo6HdcSJ/DXJPUClqGoaBjoET3
e+lP+eCXPiDzApb0vj+5SjgICgX4vRjupxc4lQR6o6C6qgW1OeZKUmBACgfIkymD
P/c3iuxyeHY3hI5Q8kiT8qXt33vrwD2wLyHK1tnq0Pm+7gnOlQTCDMOYLA670mQ2
F2IqZdc8aRMpU42RZWS8
=iOmn
-----END PGP SIGNATURE-----

  1. And if you haven't, I wonder what you're doing here. No really, I'm curious, you're invited to explain if you wish. Anyway, here's a bunch of references, for reference: 1, 2, and last but not least, 3.

  2. And whatever other metadata may seem useful. I for one haven't found anything else of interest there.

  3. For the harvester, definitely not for the GitHub git admins maintaining their shitty infrastructure. What can I say, life's so unfair when it's filled to the brim with hallucinated choice. Or how did that go?

  4. Yes, the ordinate is a log scale, problem? It's not my fault that RSA dominates the set by a few orders of magnitude.