ipfs-cluster/architecture.md

# IPFS Cluster architecture

## Definitions

These definitions are still evolving and may change:

  * Peer: an ipfs cluster service node which is member of the consensus. Alternatively: node, member.
  * ipfs-cluster: the IPFS cluster software
  * ipfs-cluster-ctl: the IPFS cluster client command line application
  * ipfs-cluster-service: the IPFS cluster node application
  * API: the REST-ish API implemented by the RESTAPI component and used by the clients.
  * RPC API: the internal API that cluster peers and components use.
  * Go API: the public interface offered by the Cluster object in Go.
  * Component: an ipfs-cluster module which performs specific functions and uses the RPC API to communicate with other parts of the system. Implements the Component interface.
  * Consensus: The Consensus component, specifically Raft.

## General overview

### Modularity

`ipfs-cluster` architecture attempts to be as modular as possible, with interfaces between its modules (components) clearly defined, in a way that they can:

  * Be swapped for alternative implementations, improved implementations, separately without affecting the rest of the system
  * Be easily tested in isolation

### Components

`ipfs-cluster` consists of:

  * The definitions of components and their interfaces and related types (`ipfscluster.go`)
  * The **Cluster** main-component which binds together the whole system and offers the Go API (`cluster.go`). This component takes an arbitrary:
    * **API**: a component which offers a public facing API. Default: `RESTAPI`
    * **IPFSConnector**: a component which talks to the IPFS daemon and provides a proxy to it. Default: `ipfshttp`
    * **State**: a component which keeps a list of Pins (maintained by the Consensus component). Default: `mapstate`
    * **PinTracker**: a component which tracks the pin set, makes sure that it is persisted by IPFS daemon as intended. Default: `maptracker`
    * **PeerMonitor**: a component to log metrics and detect peer failures. Default: `basic`
    * **PinAllocator**: a component to decide which peers should pin a CID given some metrics. Default: `descendalloc`
    * **Informer**: a component to collect metrics which are used by the `PinAllocator` and the `PeerMonitor`. Default: `disk`
  * The **Consensus** component. This component is separate but internal to Cluster in the sense that it cannot be provided arbitrarily during initialization. The consensus component uses `go-libp2p-raft` via `go-libp2p-consensus`. While it is attempted to be agnostic from the underlying consensus implementation, it is not possible in all places. These places are however well marked (everything that calls `Leader()`).

Components perform a number of functions and need to be able to communicate with eachothers: i.e.:

  * the API needs to use functionality provided by the main component
  * the PinTracker needs to use functionality provided by the IPFSConnector
  * the main component needs to use functionality provided by the main component of different peers

### RPC API

Communication between components happens through the RPC API: a set of functions which stablishes which functions are available to components (`rpc_api.go`).

The RPC API uses `go-libp2p-gorpc`. The main Cluster component runs an RPC server. RPC Clients are provided to all components for their use. The main feature of this setup is that **Components can use `go-libp2p-gorpc` to perform operations in the local cluster and in any remote cluster node using the same API**.

This makes broadcasting operations and contacting the Cluster leader really easy. It also allows to think of a future where components may be completely arbitrary and run from different applications. Local RPC calls, on their side, do not suffer any penalty as the execution is short-cut directly to the correspondant component of the Cluster, without network intervention.

On the down-side, the RPC API involves "reflect" magic and it is not easy to verify that a call happens to a method registered on the RPC server. Every RPC-based functionality should be tested. Bad operations will result in errors so they are easy to catch on tests.

### Code layout

Components are organized in different submodules (i.e. `pintracker/maptracker` represents component `PinTracker` and implementation `MapPinTracker`). Interfaces for all components are on the base module. Executables (`ipfs-cluster-service` and `ipfs-cluster-ctl` are also submodules to the base module).

### Configuration

A `config` module provides support for a central configuration file which provides configuration sections defined by each component by providing configuration objects which implementing a `ComponentConfig` interface.

## Applications

### `ipfs-cluster-service`

This is the service application of IPFS Cluster. It brings up a cluster, connects to other peers, gets the latest consensus state and participates in cluster. Handles clean shutdowns when receiving Interrupts.

### `ipfs-cluster-ctl`

This is the client/control application of IPFS Cluster. It is a command line interface which uses the REST API to communicate with Cluster.

## How does it work?

This sections gives an overview of how some things work in Cluster. Doubts? Something missing? Open an issue and ask!

### Startup

* Initialize the P2P host.
* Initialize the PeerManager: needs to keep track of cluster peers.
* Initialize Consensus: start looking for a leader asap.
* Setup RPC in all componenets: allow them to communicate with different parts of the cluster.
* Bootstrap: if we are bootstrapping from another node, do the dance (contact, receive cluster peers, join consensus)
* Cluster is up, but it is only `Ready()` when consensus is ready (which in this case means it has found a leader).
* All components are doing its thing.

Consensus startup deserves a note:

* Raft is setup and wrapped with `go-libp2p-consensus`.
* Waits until there is a leader
* Waits until the state is fully synced (compare raft last log index with current)
* Triggers a `SyncState` to the `PinTracker`, which means that we have recovered the full state of the system and the pin tracker should
keep tabs on the Pins in that state (we don't wait for completion of this operation).
* Consensus is then ready.

If the above procedures don't work, Cluster might just itself down (i.e. if a leader is not found after a while, we automatically shutdown).


### Pinning

* If it's an API request, it involves an RPC request to the cluster main component.
* `Cluster.Pin()`
* We find some peers to pin (`Allocate`) using the metrics from the PeerMonitor and the `PinAllocator`.
* We have a CID and some allocations: time to make a log entry in the consensus log.
* The consensus component will forward this to the Raft leader, as only the leader can make log entries.
* The Raft leader makes a log entry with a `LogOp` that says "pin this CID with this allocations".
* Every peer receives the operation. Per `ConsensusOpLog`, the `Apply(state)` method is triggered. Every peer modifies the `State` accordingly to this operation. The state keeps the full map of CID/Allocations and is only updated via log operations.
* The Apply method triggers a `Track` RPC request to the `PinTracker` in each peer.
* The pin tracker now knows that it needs to track a certain CID.
* If the CID is allocated to the peer, the `PinTracker` will trigger an RPC request to the `IPFSConnector` asking it to pin the CID.
* If the CID is allocated to different peers, the `PinTracker` will mark is as `remote`.
* Now the content is officially pinned.

Notes:
* the log operation should never fail. It has no real places to fail. The calls to the `PinTracker` are asynchronous and a side effect of the state modification.
* the `PinTracker` also should not fail to track anything, BUT
* the `IPFSConnector` might fail to pin something. In this case pins are marked as `pin_error` in the pin tracker.
* the `MapPinTracker` (current implementation of `PinTracker`) queues pin requests so they happen one by one to IPFS. In the meantime things are in `pinning` state.


### Adding a peer

* If it's via an API request, it involves an RPC request to the cluster main component.
* `Cluster.PeerAdd()`
* Figure out the real multiaddress for that peer (the one we see).
* Let the `PeerManager` component know about the new peer. This adds it to the Libp2p host and notifies any component which needs to know
about peers. This means also ability to perform RPC requests to that peer.
* Figure out our multiaddress in regard to the new peer (the one it sees to connect to us). This is done with an RPC request and it also
ensure that the peer is up and reachable.
* Trigger a consensus `LogOp` indicating that there is a new peer.
* Send the new peer the list of cluster peers. This is an RPC request to the new peers `PeerManager` which allows to keep a tab on the current cluster peers.

The consensus part has its own complexity:

* As usual, the "add peer" operation is forwarded to the Raft leader.
* The consensus component logs such operation and also uses Raft `AddPeer` method or equivalent.
* This results in two log entries, one internal to Raft which updates the internal Raft peerstore in all peers, and one from Cluster which is
received by the `Apply` method. This `Apply` operation does not modify the shared `State` (like when pinning), but rather only notifies the `PeerManager` about a new peer so the nodes can be set up to talk to it.

As such we are efectively using the consensus log to broadcast a PeerAdd operation. That is because this is critical and should either succeed everywhere or fail. If we performed a regular "for loop broadcast" and some peers fail, we end up with a mess that needs to be cleaned up and uncertain state. By using the consensus log, those peers which did not receive the operation have the oportunity to receive it later (on restart if they were down). There are still pitfalls to this (restoring from snapshot might result in peers with missing cluster peers) but the errors are more obvious and isolated than in other ways.

Notes:

* The `Join()` (used for bootstrapping) is just a `PeerAdd()` operation triggered remotely by an RPC request from the joining peer.


## Legacy illustrations

See: https://ipfs.io/ipfs/QmWhbV7KX9toZbi4ycj6J9GVbTVvzGx5ERffc6ymYLT5HS

They need to be updated but they are mostly accurate.