We are propagating the wrong context (mostly from the Cluster top-level
methods). This makes that request cancellations (and cancellations of the
associated contexts) are not propagated to many methods, and can result in
deadlocks when an operation that is holding a lock is not aborted.
This affects for example the operation tracker. Getting all operations from
the tracker relies on someone reading from the out channel, or on the context
being cancelled. When a request is aborted in the middle of the response, and
the context is not cancelled, everything that wants to list operations would
become deadlocked, including operations that need write locks like
TrackNewOperation.
This fixes it.
When IPFS starts failing or doesn't respond (i.e. during a restart), cluster
is likely to start sending requests at very fast rates. i.e. if there are 100k
items to be pinned, and pins start failing immediately, cluster will consume
the pin queue really fast and it will all be failures. At the same time, ipfs
is hammered non-stop until recover, which may make it harder.
This commits introduces a rate-limit when requests to IPFS fail. After 10
failed requests, requests will be sent at most at 1req/s rate. Once a requests
succeeds, the rate-limit is raised.
This should prevent hammering the IPFS daemon, but also increased CPU in
cluster as it burns through pinning queues when IPFS is offline, making the
situation in machines worse (and emitting way more logs).
This fixes a bug in API code that made it return 204-No content when the RPC
methods failed with an error before any items were returned on the channel.
Unfortunately we were not paying attentions to errors while rpc-streaming pins
in the pintracker. The result is that the StatusAll operation would list all
the pins as unexpectedly unpinned when ipfs is offline, and this would result in
recover/requeing operations for all pins when ipfs is offline.
This commits changes the behaviour so that if IPFS Pin/ls has resulted in an
error, then the StatusAll operation cannot complete at all.
This commit introduces unlimited waiting on start until a request to `ipfs id`
succeeds.
Waiting has some consequences:
* State watching (recover/sync) and metrics publishing does not start until ipfs is ready
* swarm/connect is not triggered until ipfs is ready.
Once the first request to ipfs succeeds everything goes to what it was before.
This alleviates trying operations like sending our IDs in metrics when IPFS is
simply not there.
This fixes two bugs. First, the "blockPut response CID does not match the
multihash" warning was coming up when it shouldn't. Particularly, the
multipart reader called Node() several times for the same block, resulting in
CIDs been removed from the Seen set, and causing the warning when there were
several blocks (usually the empty dir block).
This also means we were counting Blockputs (and total data added) wrong in the
metrics, double-counting some blocks as these were recorded in Node() calls.
The fix makes the tracking in Next(), which is only called once for each
block. To avoid timing issues between Block reads from the channel and
blockput responses, the Seen set now stores how many times we have seen a
block. Thus a duplicated block that will get two BlockPut responses will not
trigger a warning regardless of the time when those responses arrive.
Fixes#1706.
It has been observed that some peers have a growing number of goroutines,
usually stuck in go-libp2p-gorpc.MultiStream() function, which is waiting to
read items from the arguments channel.
We suspect this is due to aborted /add requests. In situations when the add
request is aborted or fails, Finalize() is never called and the blocks channel
stays open, so MultiStream() can never exit, and the BlockStreamer can never
stop streaming etc.
As a fix, we added the requirement to call Close() when we stop using a
ClusterDAGService (error or not). This should ensure that the blocks channel
is always closed and not just on Finalize().
This implements committing batches on shutdown properly.
Now the batchWorker will only finish when there are no more things queued to
be included in the final batch(es).
LogPin/Unpin operations will fail while we are shutting down and they cannot
be included in the batch.
This attempt to commit any pending batches when the crdt component is being
shutudown. A commit should succeed if the new DAG node is created, heads are
replaced and broadcast.
The latest version of CRDT ensure that the datastore does not unnecessarily
gets marked as dirty when a broadcasted head cannot be fetched/processed, so
the side effect of publishing a head before shutting down should be under
control at least.