* Go API: the public interface offered by the Cluster object in Go.
* Component: an ipfs-cluster module which performs specific functions and uses the RPC API to communicate with other parts of the system. Implements the Component interface.
`ipfs-cluster` architecture attempts to be as modular as possible, with interfaces between its modules (components) clearly defined, in a way that they can:
* Be swapped for alternative implementations, improved implementations, separately without affecting the rest of the system
* Be easily tested in isolation
### Components
`ipfs-cluster` consists of:
* The definitions of components and their interfaces and related types (`ipfscluster.go`)
* The **Cluster** main-component which binds together the whole system and offers the Go API (`cluster.go`). This component takes an arbitrary:
* **API**: a component which offers a public facing API. Default: `RESTAPI`
* **IPFSConnector**: a component which talks to the IPFS daemon and provides a proxy to it. Default: `ipfshttp`
* **State**: a component which keeps a list of Pins (maintained by the Consensus component). Default: `mapstate`
* **PinTracker**: a component which tracks the pin set, makes sure that it is persisted by IPFS daemon as intended. Default: `maptracker`
* **PeerMonitor**: a component to log metrics and detect peer failures. Default: `basic`
* **PinAllocator**: a component to decide which peers should pin a CID given some metrics. Default: `descendalloc`
* **Informer**: a component to collect metrics which are used by the `PinAllocator` and the `PeerMonitor`. Default: `disk`
* The **Consensus** component. This component is separate but internal to Cluster in the sense that it cannot be provided arbitrarily during initialization. The consensus component uses `go-libp2p-raft` via `go-libp2p-consensus`. While it is attempted to be agnostic from the underlying consensus implementation, it is not possible in all places. These places are however well marked (everything that calls `Leader()`).
Communication between components happens through the RPC API: a set of functions which stablishes which functions are available to components (`rpc_api.go`).
The RPC API uses `go-libp2p-gorpc`. The main Cluster component runs an RPC server. RPC Clients are provided to all components for their use. The main feature of this setup is that **Components can use `go-libp2p-gorpc` to perform operations in the local cluster and in any remote cluster node using the same API**.
This makes broadcasting operations and contacting the Cluster leader really easy. It also allows to think of a future where components may be completely arbitrary and run from different applications. Local RPC calls, on their side, do not suffer any penalty as the execution is short-cut directly to the correspondant component of the Cluster, without network intervention.
On the down-side, the RPC API involves "reflect" magic and it is not easy to verify that a call happens to a method registered on the RPC server. Every RPC-based functionality should be tested. Bad operations will result in errors so they are easy to catch on tests.
Components are organized in different submodules (i.e. `pintracker/maptracker` represents component `PinTracker` and implementation `MapPinTracker`). Interfaces for all components are on the base module. Executables (`ipfs-cluster-service` and `ipfs-cluster-ctl` are also submodules to the base module).
A `config` module provides support for a central configuration file which provides configuration sections defined by each component by providing configuration objects which implementing a `ComponentConfig` interface.
This is the service application of IPFS Cluster. It brings up a cluster, connects to other peers, gets the latest consensus state and participates in cluster. Handles clean shutdowns when receiving Interrupts.
* Initialize the PeerManager: needs to keep track of cluster peers.
* Initialize Consensus: start looking for a leader asap.
* Setup RPC in all componenets: allow them to communicate with different parts of the cluster.
* Bootstrap: if we are bootstrapping from another node, do the dance (contact, receive cluster peers, join consensus)
* Cluster is up, but it is only `Ready()` when consensus is ready (which in this case means it has found a leader).
* All components are doing its thing.
Consensus startup deserves a note:
* Raft is setup and wrapped with `go-libp2p-consensus`.
* Waits until there is a leader
* Waits until the state is fully synced (compare raft last log index with current)
* Triggers a `SyncState` to the `PinTracker`, which means that we have recovered the full state of the system and the pin tracker should
keep tabs on the Pins in that state (we don't wait for completion of this operation).
* Consensus is then ready.
If the above procedures don't work, Cluster might just itself down (i.e. if a leader is not found after a while, we automatically shutdown).
### Pinning
* If it's an API request, it involves an RPC request to the cluster main component.
*`Cluster.Pin()`
* We find some peers to pin (`Allocate`) using the metrics from the PeerMonitor and the `PinAllocator`.
* We have a CID and some allocations: time to make a log entry in the consensus log.
* The consensus component will forward this to the Raft leader, as only the leader can make log entries.
* The Raft leader makes a log entry with a `LogOp` that says "pin this CID with this allocations".
* Every peer receives the operation. Per `ConsensusOpLog`, the `Apply(state)` method is triggered. Every peer modifies the `State` accordingly to this operation. The state keeps the full map of CID/Allocations and is only updated via log operations.
* The Apply method triggers a `Track` RPC request to the `PinTracker` in each peer.
* The pin tracker now knows that it needs to track a certain CID.
* If the CID is allocated to the peer, the `PinTracker` will trigger an RPC request to the `IPFSConnector` asking it to pin the CID.
* If the CID is allocated to different peers, the `PinTracker` will mark is as `remote`.
* Now the content is officially pinned.
Notes:
* the log operation should never fail. It has no real places to fail. The calls to the `PinTracker` are asynchronous and a side effect of the state modification.
* the `PinTracker` also should not fail to track anything, BUT
* the `IPFSConnector` might fail to pin something. In this case pins are marked as `pin_error` in the pin tracker.
* the `MapPinTracker` (current implementation of `PinTracker`) queues pin requests so they happen one by one to IPFS. In the meantime things are in `pinning` state.
* Send the new peer the list of cluster peers. This is an RPC request to the new peers `PeerManager` which allows to keep a tab on the current cluster peers.
* As usual, the "add peer" operation is forwarded to the Raft leader.
* The consensus component logs such operation and also uses Raft `AddPeer` method or equivalent.
* This results in two log entries, one internal to Raft which updates the internal Raft peerstore in all peers, and one from Cluster which is
received by the `Apply` method. This `Apply` operation does not modify the shared `State` (like when pinning), but rather only notifies the `PeerManager` about a new peer so the nodes can be set up to talk to it.
As such we are efectively using the consensus log to broadcast a PeerAdd operation. That is because this is critical and should either succeed everywhere or fail. If we performed a regular "for loop broadcast" and some peers fail, we end up with a mess that needs to be cleaned up and uncertain state. By using the consensus log, those peers which did not receive the operation have the oportunity to receive it later (on restart if they were down). There are still pitfalls to this (restoring from snapshot might result in peers with missing cluster peers) but the errors are more obvious and isolated than in other ways.
Notes:
* The `Join()` (used for bootstrapping) is just a `PeerAdd()` operation triggered remotely by an RPC request from the joining peer.