This is what it was likely causing PeerRemove tests to fail randomly
but very often. We cancelled the Cluster context before shutting down
the Consensus component. This killed networking and aborted
the peer remove operations when the leader is removing itself.
As a result, it would error with "leadership lost", which would
trigger a retry which would set the final error to "context cancelled"
because the shutdown of the consensus component proceeds during the
retry, cancelling the consensus context.
This is not only affecting tests, it might affected operations when
running cluster.
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
I think this will prevents some random tests failures
when we realize that we are not anymore in the peerset
and trigger a shutdown but Raft has not finished fully
committing the operation, which then triggers an error,
and a retry. But the contexts are cancelled in the retry
so it won't find a leader and will error finally error
with that message.
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
This change removes the duplicities of the PeerManager component:
* No more commiting PeerAdd and PeerRm log entries
* The Raft peer set is the source of truth
* Basic broadcasting is used to communicate peer multiaddresses
in the cluster
* A peer can only be added in a healthy cluster
* A peer can be removed from any cluster which can still commit
* This also adds support for multiple multiaddresses per peer
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
The main differences is that the new version of Raft is more strict
about starting raft peers which already contain configurations.
For a start, cluster will fail to start if the configured cluster
peers are different from the Raft peers. The user will have to
manually cleanup Raft (TODO: an ipfs-cluster-service command for it).
Additionally, this commit adds extra options to the consensus/raft
configuration section, adds tests and improves existing ones and
improves certain code sections.
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
The following commit reimplements ipfs-cluster configuration under
the following premises:
* Each component is initialized with a configuration object
defined by its module
* Each component decides how the JSON representation of its
configuration looks like
* Each component parses and validates its own configuration
* Each component exposes its own defaults
* Component configurations are make the sections of a
central JSON configuration file (which replaces the current
JSON format)
* Component configurations implement a common interface
(config.ComponentConfig) with a set of common operations
* The central configuration file is managed by a
config.ConfigManager which:
* Registers ComponentConfigs
* Assigns the correspondent sections from the JSON file to each
component and delegates the parsing
* Delegates the JSON generation for each section
* Can be notified when the configuration is updated and must be
saved to disk
The new service.json would then look as follows:
```json
{
"cluster": {
"id": "QmTVW8NoRxC5wBhV7WtAYtRn7itipEESfozWN5KmXUQnk2",
"private_key": "<...>",
"secret": "00224102ae6aaf94f2606abf69a0e278251ecc1d64815b617ff19d6d2841f786",
"peers": [],
"bootstrap": [],
"leave_on_shutdown": false,
"listen_multiaddress": "/ip4/0.0.0.0/tcp/9096",
"state_sync_interval": "1m0s",
"ipfs_sync_interval": "2m10s",
"replication_factor": -1,
"monitor_ping_interval": "15s"
},
"consensus": {
"raft": {
"heartbeat_timeout": "1s",
"election_timeout": "1s",
"commit_timeout": "50ms",
"max_append_entries": 64,
"trailing_logs": 10240,
"snapshot_interval": "2m0s",
"snapshot_threshold": 8192,
"leader_lease_timeout": "500ms"
}
},
"api": {
"restapi": {
"listen_multiaddress": "/ip4/127.0.0.1/tcp/9094",
"read_timeout": "30s",
"read_header_timeout": "5s",
"write_timeout": "1m0s",
"idle_timeout": "2m0s"
}
},
"ipfs_connector": {
"ipfshttp": {
"proxy_listen_multiaddress": "/ip4/127.0.0.1/tcp/9095",
"node_multiaddress": "/ip4/127.0.0.1/tcp/5001",
"connect_swarms_delay": "7s",
"proxy_read_timeout": "10m0s",
"proxy_read_header_timeout": "5s",
"proxy_write_timeout": "10m0s",
"proxy_idle_timeout": "1m0s"
}
},
"monitor": {
"monbasic": {
"check_interval": "15s"
}
},
"informer": {
"disk": {
"metric_ttl": "30s",
"metric_type": "freespace"
},
"numpin": {
"metric_ttl": "10s"
}
}
}
```
This new format aims to be easily extensible per component. As such,
it already surfaces quite a few new options which were hardcoded
before.
Additionally, since Go API have changed, some redundant methods have been
removed and small refactoring has happened to take advantage of the new
way.
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>