Comments on: A cursory look at the infamous TRB "wedge" bug

By: spyked

spyked — Fri, 10 Dec 2021 21:12:30 +0000

Hey cgra! First off, thanks for your write-ups and efforts to untangle this; I still have your latest article in the to-read list, hoping to get to it by the end of the year.

Regarding your question: I didn't see it as possible either upon reading the code, but I did observe at run-time that plenty of transactions were scheduled for retrieval, while no blocks would be requested according to the logs. This might actually be related to your observation on duplicate requests or there might be some other reason altogether for this behaviour, in any case, it was a purely empirical observation which might not actually hold on careful inspection.

Unfortunately I've done very little independent technical work lately and none related to Bitcoin, as my node has been keeping up with the tip of the chain consistently after the initial sync. This tangled mess of a code be damned, but it works surprisingly well after all these years... at least as far as I've used it.

Now, the day should come when I will try to get another node up and running and I should run into this issue again. Not sure if it's the cleanest solution or just an ad-hoc attempt at a bootstrapping mechanism, but since I already have a list of (known, trusted) hashes synced with my previous node, this oughta help into steering the new node to speed up its initial sync by asking for precisely the blocks that it needs... and now that I think of it, perhaps the whole mempool mechanism should stay put while this initial sync takes place.

By: cgra

cgra — Fri, 10 Dec 2021 17:18:27 +0000

Can you still recall how did you conclude that transactions prioritize over blocks? If it really was the case, I'd like a clue, because I don't see it as possible (or am I mixing things up?). I see nLastTime being effectively a same-second call counter embedded into the nNow value. This should guarantee mapAskFor processing order between non-repeating items, even if calls came in rapid succession (they do when receiving responses to getblocks). And the 2-minute increment only affects repeated items.

By: cgra

cgra — Sun, 28 Nov 2021 09:26:24 +0000

Regarding getting permanently stuck:

Peers keep track what blocks they've already advertised to you, in CNode.setInventoryKnown. Then, when responding to your getblocks, they won't ever advertise the same item again, no matter how often you requested. Only reconnecting with the peer will reset this "spam guard".

I believe the root of this problem is that the current code worked OK when TRB still kept orphans (the issue of ~infinite RAM usage issue aside), because one way or the other, you will end up requesting/receiving blocks out-of-order, while the next block in your current chain state is scheduled to be asked later, from another node, as the first one failed to deliver.

Now, with enough of this shuffle, you may end up in a situation where you've already asked block B from every peer, but always too early. When block B is finally the next in your chain state, and ready to be accepted, every peer will refuse to advertise the block B again! Your request mechanism requires one advert for every getdata request, so you're stuck until a new peer shows up, or a current peer reconnects.

By: spyked

spyked — Sun, 28 Jun 2020 20:21:41 +0000

Thanks for sharing this, Jacob! I'll take a deeper look into the logs you shared sometime this week and I'll share any interesting bits that I may find.

Regarding mechanical HDDs: although I'm not claiming it's not possible, I for one never managed to sync a HDD-based node, although I tried several times and in several configurations (including -connect to another node). I'm not sure how many disk operations are required on average to add a block to the database; depending on this, you might run into hard processing limitations given anything but the first couple of years of blocks. IMHO it's worth measuring this either way.

By: #ossasepia Logs for Jun 2020 « Ossa Sepia

#ossasepia Logs for Jun 2020 « Ossa Sepia — Fri, 26 Jun 2020 13:38:40 +0000

[...] http://thetarpit.org/2020/a-cursory-look-at-the-infamous-trb-wedge-bug << The Tar Pit -- A cursory look at the infamous TRB "wedge" bug [...]

By: Jacob Welsh

Jacob Welsh — Wed, 24 Jun 2020 19:27:23 +0000

@spyked: I've tried your patch by way of a trivial regrind onto my tree. It doesn't seem to help with my issue - or at least what I'm observing now which is quite possibly different from yours or even what I've had before, but now seems reproducible 'in vitro' between two nodes on my LAN.

Node 1 is a healthy, well-provisioned, fully synced node maintaining outbound connections (and listening for inbound but firewalled upstream). Node 2 is a VM with 3G RAM sitting on mechanical HDDs, running with -connect to Node 1. It takes ~4 minutes to connect up a valid block and so would never get synced in any bearable time on its own, but was seeded with a filesystem-level copy of Node 1's datadir so is able to keep up now. (Copied "cold" to rule out database consistency questions.)

What seems to happen is Node 2 (on which I installed the patch) gets caught up to where the block height was when it started catchup, then fails to get anything but the latest "pushes" of tip from Node 1, leaving a gap of blocks from its catchup interval unfilled.

Log excerpt covering one "bastard cycle": http://welshcomputing.com/files/debug.log.20200624.gz

The main thing I see in there ("grep -i block") is that it supposedly does send a "getblocks" upon receiving a bastard, but its peer never replies with the necessary block "inv"s, so "getdata" doesn't even enter into the question. On restarting Node 2 it always gets going again.

Secondarily, I see that SendMessages is run on a 100ms polling loop, so the debug output from the patch is a bit noisy. Perhaps better to print nothing in the EMPTY case. (I always run with debug=1 because otherwise it truncates debug.log on startup.)

By: spyked

spyked — Sat, 20 Jun 2020 14:46:52 +0000

@Stan: Re. catch-up reliability, I'm still testing various scenarios. So far it works fine on my setup when the node is ~100 blocks behind the tip, I haven't yet had the chance to see how it behaves when it's lagging 1K-10K blocks behind. Also, I'm not sure whether there's any way to measure performance for the transaction relay, and I saw e.g. some versions of PRB decoupled the two types of getdata (for blocks and txs) entirely.

@Jacob: I'm curious to see whether this fixes your issue, since as per Stan's comment above, there's more than one type of "wedge", and who knows, maybe there's actually more than two types.

In other news, I've been meaning to test your rawtransaction work for a while, but I have to get these weird issues out of the way first. To be honest, there's a whole pile of TRB work that I'm not quite ready to take at the moment, more for the lack of familiarity with the project and the code base than anything else. But I'll get there, even if at snail's pace...

By: Jacob Welsh

Jacob Welsh — Fri, 19 Jun 2020 22:08:22 +0000

@spyked: Thanks for posting; I've suffered this flavor of intermittent wedge many times, including on fully synced nodes. Prioritizing blocks over transactions is quite agreeable to me; likewise getting rid of the pretense of multithreading as a longer-term thing (either fully embracing kernel threads or else the internal non-preemptive event scheduler). I'll give the patch a closer look and test on top of my latest.

By: Stanislav Datskovskiy

Stanislav Datskovskiy — Thu, 18 Jun 2020 14:14:33 +0000

Re #2 -- indeed, 'feeder' requires the existence of a separate, up-to-date node somewhere, from which to feed. I used it recently to resync my desk node when I suffered a temporary loss of connectivity at my desk.

And indeed Mod6's DoS doesn't seem to happen especially often in the wild, but is 100% reproducible on demand via 'wedger', and so I no longer operate any nodes where the underlying issue is not patched, and do not recommend doing so -- in principle, anyone can perma-wedge a vulnerable TRB remotely.

Will be interesting to see whether your patch results in reliable "catch up" mechanism.

By: spyked

spyked — Thu, 18 Jun 2020 14:07:59 +0000

Indeed, the scope of this article is only the second variety of "wedge", thanks for the clarification.

If I understand correctly, your feeder.py script presupposes that I already have the blocks and I can deliver them directly to the node. This isn't the case here, as the initial sync works fine; instead, I was having a lot of trouble keeping the node up to date, so after the sync, no getdata requests were sent and the node remained at the mercy of such external "feeders".

I'll also check out the DoS pill posted by you and Mod6. I think I've encountered it in the past, but it didn't happen often enough to encourage me to get off my ass and try it out.