This is the third implementation attempt. This time, rather than broadcasting PeerAdd/Join requests to the whole cluster, we use the consensus log to broadcast new peers joining. This makes it easier to recover from errors and to know who exactly is member of a cluster and who is not. The consensus is, after all, meant to agree on things, and the list of cluster peers is something everyone has to agree on. Raft itself uses a special log operation to maintain the peer set. The tests are almost unchanged from the previous attempts so it should be the same, except it doesn't seem possible to bootstrap a bunch of nodes at the same time using different bootstrap nodes. It works when using the same. I'm not sure this worked before either, but the code is simpler than recursively contacting peers, and scales better for larger clusters. Nodes have to be careful about joining clusters while keeping the state from a different cluster (disjoint logs). This may cause problems with Raft. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>
10 KiB
IPFS Cluster architecture
Definitions
These definitions are still evolving and may change:
- Peer: an ipfs cluster service node which is member of the consensus. Alternatively: node, member.
- ipfs-cluster: the IPFS cluster software
- ipfs-cluster-ctl: the IPFS cluster client command line application
- ipfs-cluster-service: the IPFS cluster node application
- API: the REST-ish API implemented by the RESTAPI component and used by the clients.
- RPC API: the internal API that cluster peers and components use.
- Go API: the public interface offered by the Cluster object in Go.
- Component: an ipfs-cluster module which performs specific functions and uses the RPC API to communicate with other parts of the system. Implements the Component interface.
General overview
Modularity
ipfs-cluster
architecture attempts to be as modular as possible, with interfaces between its modules (components) clearly defined, in a way that they can:
- Be swapped for alternative implementations, improved implementations, separately without affecting the rest of the system
- Be easily tested in isolation
Components
ipfs-cluster
consists of:
- The definitions of components and their interfaces and related types (
ipfscluster.go
) - The Cluster main-component which binds together the whole system and offers the Go API (
cluster.go
). This component takes an arbitrary:- PinTracker: a component which tracks the pin set, makes sure that it is persisted by IPFS daemon as intended. Default:
MapPinTracker
- IPFSConnector: a component which talks to the IPFS daemon and provides a proxy to it. Default:
IPFSHTTPConnector
- API: a component which offers a public facing API. Default:
RESTAPI
- State: a component which keeps a list of Pins (maintained by the Consensus component)
- PinTracker: a component which tracks the pin set, makes sure that it is persisted by IPFS daemon as intended. Default:
- The Consensus component. This component is separate but internal to Cluster in the sense that it cannot be provided arbitrarily during initialization. The consensus component uses
go-libp2p-raft
viago-libp2p-consensus
. While it is attempted to be agnostic from the underlying consensus implementation, it is not possible in all places. These places are however well marked.
Components perform a number of functions and need to be able to communicate with eachothers: i.e.:
- the API needs to use functionality provided by the main component
- the PinTracker needs to use functionality provided by the IPFSConnector
- the main component needs to use functionality provided by the main component of a different peer (the leader)
RPC API
Communication between components happens through the RPC API: a set of functions which stablishes which functions are available to components (rpc_api.go
).
The RPC API uses go-libp2p-gorpc
. The main Cluster component runs an RPC server. RPC Clients are provided to all components for their use. The main feature of this setup is that Components can use go-libp2p-gorpc
to perform operations in the local cluster and in any remote cluster node using the same API.
This makes broadcasting operations, contacting the Cluster leader really easy. It also allows to think of a future where components may be completely arbitrary and run from different applications. Local RPC calls, on their side, do not suffer any penalty as the execution is short cut directly to the server component of the Cluster, without network intervention.
Code layout
Currently, all components live in the same ipfscluster
Go module, but they shall be moved to their own submodules without trouble in the future.
Applications
ipfs-cluster-service
This is the service application of IPFS Cluster. It brings up a cluster, connects to other peers, gets the latest consensus state and participates in cluster.
ipfs-cluster-ctl
This is the client/control application of IPFS Cluster. It is a command line interface which uses the REST API to communicate with Cluster.
Code paths
This sections explains how some things work in Cluster.
Startup
NewCluster
triggers the Cluster bootstrap.IPFSConnector
,State
,PinTracker
component are provided. These components are up but possibly waiting for aSetClient(RPCClient)
call. Without having an RPCClient, they are unable to communicate with the other components.- The first step bootstrapping is to create the libp2p
Host
. It is using configuration keys to set up public, private keys, the swarm network etc. - The
peerManager
is initialized. It allows to list known cluster peers and keep components in the loop about changes. - The
RPCServer
(go-libp2p-gorpc
) is setup, along with anRPCClient
. The client can communicate to any RPC server using libp2p, or with the local one (shortcutting). TheRPCAPI
object is registered: it implements all the RPC operations supported by cluster and used for inter-component/inter-peer communication. - The
Consensus
component is bootstrapped. It:- Sets up a new Raft node (with the
cluster_peers
from the configuration) from scratch, including snapshots, stable datastore (boltDB), log store etc... - Initializes
go-libp2p-consensus
components (Actor
andLogOpConsensus
) usinggo-libp2p-raft
- Returns a new
Consensus
object while asynchronously waiting for a Raft leader and then also for the current Raft peer to catch up with the latest index of the log when there is anything to catch up with. - Waits for an RPCClient to be set and when it happens it triggers an asynchronous RPC request (
StateSync
) to the localCluster
component and reportsReady()
- The
StateSync
operation (from the mainCluster
component) makes a diff between the localMapPinTracker
state and the consensus-maintained state. It triggers asynchronous local RPC requests (Track
andUntrack
) to theMapPinTracker
. - The
MapPinTracker
receives theTrack
requests and checks, pins or unpins items, as well as updating the local status of a pin (see the Pinning section below)
- Sets up a new Raft node (with the
- While the consensus is being bootstrapped,
SetClient(RPCClient
is called on all components (tracker, ipfs connector, api and consensus) - The new
Cluster
object is returned. - Asynchronously, a thread triggers the bootsrap process and then waits for the
Consensus
component to reportReady()
. - The bootstrap process uses the configured multiaddresses from the Bootstrap section of the configuration perform a remote RPC
PeerAdd
request (it tries until it works in one of them). - The remote node then performs a consensus log operation logging the multiaddress and peer ID of the new cluster member. If the log operation is successful, the new member is added to
Raft
(which internally logs it as well). The new node is contacted by Raft, its internal peerset updated. It receives the state from the cluster, including pins (which are synced to thePinTracker
and other peers, which are handled by thepeerManager
. - If the bootstrap was successful or not necessary, Cluster reports itself
Ready()
. At this moment, all components are up, consensus is working and cluster is ready to perform any operations. The consensus state may still be syncing, or mostly theMapPinTracker
may still be verifying that pins are there against the daemon, but this does not causes any problems to use cluster.
Adding a peer
- The
RESTAPI
component receives aPeerAdd
request on the respective endpoint. It makes aPeerAdd
RPC request. - The local RPC server receives it and calls
PeerAdd
method in the mainCluster
component. - If there is a connection from the given PID, we use it to determine the true multiaddress (with the right remote IP), in case it was not
provided correctly (usually when using
Join
since it cannot be determined-see below). This allows to correctly let peers join from accross NATs. - The peer is added to the
peerManager
so we can perform RPC requests to it. - A remote RPC
RemoteMultiaddressForPeer
is performed to the new peer, so it reports how our multiaddress looks from their side. - The address of the new peer is then commited to the consensus state. Since this is a new peer, Raft peerstore is also updated. Raft starts sending updates to the new peer, among them prompting it to update its own peerstore with the list of peers in this cluster.
- The list of cluster peers (plus ourselves with our real multiaddress) is sent to the new cluster in a remote
PeerManagerAddMultiaddrs
RPC request. - A remote
ID
RPC request is performed and the information about the new node is returned.
Joining a cluster
- Joining means calling
AddPeer
on the cluster we want to join. - This is used also for bootstrapping.
- Join is disallowed if we are not a single-peer cluster.
Pinning
- The
RESTAPI
component receives aPin
request on the respective endpoint. This triggers aPin
RPC request. - The local RPC server receives it and calls
Pin
method in the mainCluster
component. - The main cluster component calls
LogPin
in theConsensus
Component. - The Consensus component checks that it is the Raft leader, or alternatively makes an
ConsensusLogPin
RPC request to whoever is the leader. - The Consensus component of the current Raft's leader builds a pin
clusterLogOp
operation performs aconsensus.Commit(op)
. This is handled bygo-libp2p-raft
and, when the operation makes it to the Raft's log, it triggersclusterLogOp.ApplyTo(state)
in every peer. ApplyTo
modifies the local cluster state by adding the pin and triggers an asynchronousTrack
local RPC request. It returns without waiting for the result. WhileApplyTo
is running no other state updates may be applied so it is a critical code path.- The RPC server handles the request to the
MapPinTracker
component and itsTrack
method. It marks the local state of the pin aspinning
and makes a local RPC request for theIPFSHTTPConnector
component (IPFSPin
) to pin it. - The
IPFSHTTPConnector
component receives the request and uses the configured IPFS daemon API to send a pin request with the given hash. It waits until the call comes back and returns success or failure. - The
MapPinTracker
component receives the result from the RPC request to theIPFSHTTPConnector
and updates the local status of the pin topinned
orpin_error
.
Legacy illustrations
See: https://ipfs.io/ipfs/QmWhbV7KX9toZbi4ycj6J9GVbTVvzGx5ERffc6ymYLT5HS
They need to be updated but they are mostly accurate.