A cursory look at the infamous TRB "wedge" bug

June 18, 2020 by Lucian Mogosanu

Long story short, for those who are still tuned in: following my recent adventures getting a TRB node up and running on Debian, I proceeded to get it to sync with the rest of the network over some period of time. Alas, although syncing with -connect and -nolisten didn't pose any problems, my new node remained consistently behind and eventually got entirely stuck at some height until the next restart. In other words, it exhibited a behaviour that people often refer to as a "wedge"¹, wherein for some reason the node stops asking its peers for new blocks. In this article I'll take a brief look at the problem² and propose a stopgap solution.

For starters, we recall that Bitcoin is a peer-to-peer protocol, from which follows that the structure of the network is flat: as opposed to the hierarchical "client-server" paradigm, where "servers" listen and "clients" initiate new connections, here each of the parties may initiate connections to each other at any moment. Although this is in principle not necessary³ for the parties involved, it is necessary for the overall health of the network, so that the distributed system converges to a relatively stable state at a given point, i.e. the vast majority of the nodes have the same view of the blockchain, at least until the next "best" block is mined.

For this to be possible, nodes may in principle exchange data (transactions, blocks) and metadata, i.e. hashes identifying said data on the network. So for example if a node A wants to see what's new, it's going to connect to another node, B, and ask for its "inventory":

0. A -> B : getblocks
1. B -> A : inv [ block h1, block h2, tx h3, ... ]
2. A -> B : getdata [block h1, block h2]
3. A -> B : getdata [tx h3]
4. B -> A : block h1_data
5. B -> A : block h2_data
6. B -> A : tx    h3_data

At 0, A asks B for its inventory; B then responds (1) with a list of blocks and/or transactions it has available; A follows up on the hashes and sends getdata requests (2, 3), to which B responds with the actual data (4, 5, 6). So for starters, we'll assume that the "wedge" occurs somewhere in, or between steps 1 and (2, 3). Actually, in my observations steps (2, 3) were never reached at all, but I did notice the following detail: after receiving an inventory from the peer, the node doesn't immediately proceed to issue getdata. Instead, it marks the data as being "asked for", for example:

askfor block 0000000000000000000e7f0c37df73c236ed49da75dbf3ce3ed5576137228bb4   0
sending getdata: block 0000000000000000000e7f0c37df73c236ed49da75dbf3ce3ed5576137228bb4
askfor tx 9680cf6c8d585a0257e09101e95b7d8c6a055d5fba8e95c58ec7184d01823640   0
sending getdata: tx 9680cf6c8d585a0257e09101e95b7d8c6a055d5fba8e95c58ec7184d01823640

If we grep for "askfor" in the Bitcoin sources, we get:

$ grep -rw askfor src
$ src/net.h:        printf("askfor %s   %"PRI64d"n", inv.ToString().c_str(), nRequestTime);

Looking at src/net.h, we notice that this printf is part of a function called AskFor, which, by the way, is called from ProcessMessage over in src/main.cpp, in response to inv messages. Let's expose AskFor here in all its nude glory:

void AskFor(const CInv& inv)
{
    // We're using mapAskFor as a priority queue,
    // the key is the earliest time the request can be sent
    int64& nRequestTime = mapAlreadyAskedFor[inv];
    printf("askfor %s   %"PRI64d"n", inv.ToString().c_str(), nRequestTime);

    // Make sure not to reuse time indexes to keep things in the same order
    int64 nNow = (GetTime() - 1) * 1000000;
    static int64 nLastTime;
    ++nLastTime;
    nNow = std::max(nNow, nLastTime);
    nLastTime = nNow;

    // Each retry is 2 minutes after the last
    nRequestTime = std::max(nRequestTime + 2 * 60 * 1000000, nNow);
    mapAskFor.insert(std::make_pair(nRequestTime, inv));
}

So we get an inv, which we use to perform some weird computation related to some timestamps: is it ever possible that nLastTime is bigger than nNow? I couldn't say from reading that piece of code, despite the fact that nLastTime is scoped within the very code that I shared with you. So anyway, the request time and the inventory item are placed together in an object, mapAskFor, which is used the following way:

$ grep -rw mapAskFor src
src/net.h:    std::multimap<int64, CInv> mapAskFor;
src/net.h:        // We're using mapAskFor as a priority queue,
src/net.h:        mapAskFor.insert(std::make_pair(nRequestTime, inv));
src/main.cpp:        while (!pto->mapAskFor.empty() && (*pto->mapAskFor.begin()).first <= nNow)
src/main.cpp:            const CInv& inv = (*pto->mapAskFor.begin()).second;
src/main.cpp:            pto->mapAskFor.erase(pto->mapAskFor.begin());

The first is the definition within the CNode class; the second and the third are part of AskFor; so we need to look further at src/main.cpp -- more precisely, at SendMessages, more precisely at the portion that issues getdata commands to peers:

    //
    // Message: getdata
    //
    vector<CInv> vGetData;
    int64 nNow = GetTime() * 1000000;
    CTxDB txdb("r");
    while (!pto->mapAskFor.empty() && (*pto->mapAskFor.begin()).first <= nNow)
    {
        const CInv& inv = (*pto->mapAskFor.begin()).second;
        if (!AlreadyHave(txdb, inv))
        {
            printf("sending getdata: %sn", inv.ToString().c_str());
            vGetData.push_back(inv);
            if (vGetData.size() >= 1000)
            {
                pto->PushMessage("getdata", vGetData);
                vGetData.clear();
            }
        }
        mapAlreadyAskedFor[inv] = nNow;
        pto->mapAskFor.erase(pto->mapAskFor.begin());
    }
    if (!vGetData.empty())
        pto->PushMessage("getdata", vGetData);

This piece of code is what's supposed to turn inv items into getdata requests, by grabbing items from the mapAskFor priority queue and trying to push them on the wire. The more interesting part to my eye is the:

!pto->mapAskFor.empty() && (*pto->mapAskFor.begin()).first <= nNow

condition, which not only expects that there are items in the queue, but it also expects that the so-called time index of the first item in the queue indicates that the message must be sent earlier than now... which sorta makes sense, since messages may be scheduled to be sent sometime in the future. This is how we realize that this whole consumer/producer thing is the implementation of an algorithm for scheduling getdata requests⁴. The algorithm, as far as I can tell so far, is:

# Upon receiving an inv...
1. Compute nRequestTime based on previous occurences of inv;
2. Update nLastTime based on nNow and previous nLastTime value;
3. Compute schedule time as
   a. nRequestTime + 2 minutes,
   b. or now, whichever is higher;
3. Place (schedule time, inv) in priority queue mapAskFor.

# mapAskFor is required to be:
# - traversable; and
# - sorted by schedule time, from oldest to newest?

# On transmit
1. while (mapAskFor has items and first item is schedulable)
   a. schedule getdata

The problem, as far as I could tell from my observation, and please do correct me if I'm wrong, is that either a. mapAskFor is not persistent between connections, so invs get lost; or b. mapAskFor does in fact not keep items sorted, which would be just plain weird, because that's as far as C++'s std::multimap guarantee goes⁵.

Anyway, in the common case mapAskFor will end up populated with hashes for a few blocks (as they come) and a whole shitload for relayed transactions, and the latter for some reason get prioritized over the former. This is why on my setup -connect (in conjunction with -nolisten) works just fine: no one connects to the node, so this scheduling queue doesn't get all filled up, so eventually the node will eventually send all getdata messages. Really, no extra "aggression" required, at least not more than is provided in Stan's "aggressive pushgetblocks" patch.

Alas, I see only one solution to this issue, and that is a complete rewrite of this getdata-scheduler, which I suppose would require a complete redesign of TRB's current multithreaded model. I'm not keen to dive into all that at the moment, so I am going to propose the following change⁶ in AskFor:

diff -uNr a/bitcoin/src/net.h b/bitcoin/src/net.h
--- a/bitcoin/src/net.h 492c9cc92a504bb8174d75fafcbee6980986182a459efc9bfa1d64766320d98ba2fa971d78d00a777c6cc50f82a5d424997927378e99738b1b3b550bdaa727f7
+++ b/bitcoin/src/net.h b83e0fe060dd237715ea511de804aebc49b0868614ef368673cb6ddd2014fc8126b92ee4cacc3d158862904d778f0adb50b6e19751ef4b9f22a4ca93e591402f
@@ -266,8 +266,11 @@
         nNow = std::max(nNow, nLastTime);
         nLastTime = nNow;

-        // Each retry is 2 minutes after the last
-        nRequestTime = std::max(nRequestTime + 2 * 60 * 1000000, nNow);
+       // Queue requests for TX later on to prioritize getting new blocks
+        if (inv.type == MSG_TX)
+            nRequestTime += 2 * 60 * 1000000;
+
+        nRequestTime = std::max(nRequestTime, nNow);
         mapAskFor.insert(std::make_pair(nRequestTime, inv));
     }

This programs the scheduling policy to prioritize "block" inv hashes over "tx" ones. As I was saying, I'm skeptical that this is an actual solution, but so far it's managed to help bring my node up to speed really quickly without any noticeable damage. Now, why I experienced this issue and others didn't, and why the magic numbers were chosen as they are, this is an entire mystery to me at the moment.

Meanwhile, my changes are available (along with all the others) in the following V patch:

spyked_prio_askfor_msg_block.vpatch (seal)

Some relevant bits:

635283 spyked_prio_askfor_msg_block spyked

Prioritize MSG_BLOCK hashes in the askfor priority queue;

Print askfor debug data when -debug is passed;

Removes max connection/IP address retardation, stolen from ben_vulpes_increase_aggression_levels experimental vpatch

Comments and questions are, as usual, welcome.

The problem has been widely discussed in #trilema, #asciilifeform and other places over the years. More recently, Mod6 has been looking at it in his "getdata" series. For all it's worth, I can't claim to grasp the issue very well, as I have very limited knowledge related to both the Bitcoin protocol and all of TRB's idiosyncratic behaviours.

Either way, I had to start looking somewhere and this is what I've found so far. ↩
To be entirely honest, I can't say that this solves the historical "wedge" issue that folks have been running into, but it does give me some insight into the particular problem that I personally encountered. Let me know if this does anything for you. ↩
And fiercely hated by the enemies of Bitcoin, who discourage direct interaction between individuals in favour of interaction mediated by imperial "platforms".

Why do "they" do this? So that "they", in effect, own you.

How do "they" do this? Through routing policies: tell me, what's your public IP address? do you have any guarantees, or at least a promise that it will stay the same in, say, a month from now? If you're connecting to the 'net from home, then chances are that no, you don't own an IP address, thus you can't possibly own a fully-fledged Bitcoin node. ↩
Now, where did I see that before? ↩
And if items do indeed remain in this scheduling queue between connections, then one can't help but beg the question: when do they get garbage collected? This might be a great recipe for a memory leak. Just imagine mapAskFor growing in size indefinitely, just because the items contained aren't sorted properly. ↩
Based upon Mod6's latest. ↩

Filed under: computing.
RSS 2.0 feed. Comment. Send trackback.

11 Responses to “A cursory look at the infamous TRB "wedge" bug”

#1:
Stanislav Datskovskiy says:

June 18, 2020 at 16:52

The term "wedge" in early TRBism history referred to "node stops receiving blocks", which as it turned out, may happen from two entirely dissimilar conditions. The confusion is largely my fault, so I will detail:

In one form, "wedge" is a node that is actually wedged, i.e. does not answer either remote or local commands. This is Mod6's wedge, and it turned out to be a reproducible (via my wedger.py) remote DOS -- a vulnerable TRB, when 'wedger' is fired at it, will immediately eat all available RAM and do no further useful work until restarted, guaranteed.

In the other form, a node falls behind (for whatever reason) and is unable to catch up, while still answering commands. Such a node can be e.g. fed via "feeder" ( dulap.xyz/pub/feeder.py ) and is otherwise functional.

Your patch appears to be a potential pill against the latter (for the former, there is an existing pill on Mod6's WWW and mine.) I will test it on one of my nodes.
#2:
spyked says:

June 18, 2020 at 17:07

Indeed, the scope of this article is only the second variety of "wedge", thanks for the clarification.

If I understand correctly, your feeder.py script presupposes that I already have the blocks and I can deliver them directly to the node. This isn't the case here, as the initial sync works fine; instead, I was having a lot of trouble keeping the node up to date, so after the sync, no getdata requests were sent and the node remained at the mercy of such external "feeders".

I'll also check out the DoS pill posted by you and Mod6. I think I've encountered it in the past, but it didn't happen often enough to encourage me to get off my ass and try it out.
#3:
Stanislav Datskovskiy says:

June 18, 2020 at 17:14

Re #2 -- indeed, 'feeder' requires the existence of a separate, up-to-date node somewhere, from which to feed. I used it recently to resync my desk node when I suffered a temporary loss of connectivity at my desk.

And indeed Mod6's DoS doesn't seem to happen especially often in the wild, but is 100% reproducible on demand via 'wedger', and so I no longer operate any nodes where the underlying issue is not patched, and do not recommend doing so -- in principle, anyone can perma-wedge a vulnerable TRB remotely.

Will be interesting to see whether your patch results in reliable "catch up" mechanism.
#4:
Jacob Welsh says:

June 20, 2020 at 01:08

@spyked: Thanks for posting; I've suffered this flavor of intermittent wedge many times, including on fully synced nodes. Prioritizing blocks over transactions is quite agreeable to me; likewise getting rid of the pretense of multithreading as a longer-term thing (either fully embracing kernel threads or else the internal non-preemptive event scheduler). I'll give the patch a closer look and test on top of my latest.
#5:
spyked says:

June 20, 2020 at 17:46

@Stan: Re. catch-up reliability, I'm still testing various scenarios. So far it works fine on my setup when the node is ~100 blocks behind the tip, I haven't yet had the chance to see how it behaves when it's lagging 1K-10K blocks behind. Also, I'm not sure whether there's any way to measure performance for the transaction relay, and I saw e.g. some versions of PRB decoupled the two types of getdata (for blocks and txs) entirely.

@Jacob: I'm curious to see whether this fixes your issue, since as per Stan's comment above, there's more than one type of "wedge", and who knows, maybe there's actually more than two types.

In other news, I've been meaning to test your rawtransaction work for a while, but I have to get these weird issues out of the way first. To be honest, there's a whole pile of TRB work that I'm not quite ready to take at the moment, more for the lack of familiarity with the project and the code base than anything else. But I'll get there, even if at snail's pace...
#6:
Jacob Welsh says:

June 24, 2020 at 22:27

@spyked: I've tried your patch by way of a trivial regrind onto my tree. It doesn't seem to help with my issue - or at least what I'm observing now which is quite possibly different from yours or even what I've had before, but now seems reproducible 'in vitro' between two nodes on my LAN.

Node 1 is a healthy, well-provisioned, fully synced node maintaining outbound connections (and listening for inbound but firewalled upstream). Node 2 is a VM with 3G RAM sitting on mechanical HDDs, running with -connect to Node 1. It takes ~4 minutes to connect up a valid block and so would never get synced in any bearable time on its own, but was seeded with a filesystem-level copy of Node 1's datadir so is able to keep up now. (Copied "cold" to rule out database consistency questions.)

What seems to happen is Node 2 (on which I installed the patch) gets caught up to where the block height was when it started catchup, then fails to get anything but the latest "pushes" of tip from Node 1, leaving a gap of blocks from its catchup interval unfilled.

Log excerpt covering one "bastard cycle": http://welshcomputing.com/files/debug.log.20200624.gz

The main thing I see in there ("grep -i block") is that it supposedly does send a "getblocks" upon receiving a bastard, but its peer never replies with the necessary block "inv"s, so "getdata" doesn't even enter into the question. On restarting Node 2 it always gets going again.

Secondarily, I see that SendMessages is run on a 100ms polling loop, so the debug output from the patch is a bit noisy. Perhaps better to print nothing in the EMPTY case. (I always run with debug=1 because otherwise it truncates debug.log on startup.)
#7:
spyked says:

June 28, 2020 at 23:21

Thanks for sharing this, Jacob! I'll take a deeper look into the logs you shared sometime this week and I'll share any interesting bits that I may find.

Regarding mechanical HDDs: although I'm not claiming it's not possible, I for one never managed to sync a HDD-based node, although I tried several times and in several configurations (including -connect to another node). I'm not sure how many disk operations are required on average to add a block to the database; depending on this, you might run into hard processing limitations given anything but the first couple of years of blocks. IMHO it's worth measuring this either way.
#8:
cgra says:

November 28, 2021 at 12:26

Regarding getting permanently stuck:

Peers keep track what blocks they've already advertised to you, in CNode.setInventoryKnown. Then, when responding to your getblocks, they won't ever advertise the same item again, no matter how often you requested. Only reconnecting with the peer will reset this "spam guard".

I believe the root of this problem is that the current code worked OK when TRB still kept orphans (the issue of ~infinite RAM usage issue aside), because one way or the other, you will end up requesting/receiving blocks out-of-order, while the next block in your current chain state is scheduled to be asked later, from another node, as the first one failed to deliver.

Now, with enough of this shuffle, you may end up in a situation where you've already asked block B from every peer, but always too early. When block B is finally the next in your chain state, and ready to be accepted, every peer will refuse to advertise the block B again! Your request mechanism requires one advert for every getdata request, so you're stuck until a new peer shows up, or a current peer reconnects.
#9:
cgra says:

December 10, 2021 at 20:18

Can you still recall how did you conclude that transactions prioritize over blocks?

If it really was the case, I'd like a clue, because I don't see it as possible (or am I mixing things up?). I see nLastTime being effectively a same-second call counter embedded into the nNow value. This should guarantee mapAskFor processing order between non-repeating items, even if calls came in rapid succession (they do when receiving responses to getblocks). And the 2-minute increment only affects repeated items.
#10:
spyked says:

December 11, 2021 at 00:12

Hey cgra! First off, thanks for your write-ups and efforts to untangle this; I still have your latest article in the to-read list, hoping to get to it by the end of the year.

Regarding your question: I didn't see it as possible either upon reading the code, but I did observe at run-time that plenty of transactions were scheduled for retrieval, while no blocks would be requested according to the logs. This might actually be related to your observation on duplicate requests or there might be some other reason altogether for this behaviour, in any case, it was a purely empirical observation which might not actually hold on careful inspection.

Unfortunately I've done very little independent technical work lately and none related to Bitcoin, as my node has been keeping up with the tip of the chain consistently after the initial sync. This tangled mess of a code be damned, but it works surprisingly well after all these years... at least as far as I've used it.

Now, the day should come when I will try to get another node up and running and I should run into this issue again. Not sure if it's the cleanest solution or just an ad-hoc attempt at a bootstrapping mechanism, but since I already have a list of (known, trusted) hashes synced with my previous node, this oughta help into steering the new node to speed up its initial sync by asking for precisely the blocks that it needs... and now that I think of it, perhaps the whole mempool mechanism should stay put while this initial sync takes place.
#11:
#ossasepia Logs for Jun 2020 « Ossa Sepia says:

June 26, 2020 at 16:38

[...] http://thetarpit.org/2020/a-cursory-look-at-the-infamous-trb-wedge-bug << The Tar Pit -- A cursory look at the infamous TRB "wedge" bug [...]

The Tar Pit

A cursory look at the infamous TRB "wedge" bug

11 Responses to “A cursory look at the infamous TRB "wedge" bug”

Leave a Reply