Commit Graph

37 Commits

Author SHA1 Message Date
Hector Sanjuan
755cebbe0d Enable spell checking and fix spelling errors (using US locale) 2022-06-16 17:43:30 +02:00
Hector Sanjuan
508791b547 Migrate from ipfs/ipfs-cluster to ipfs-cluster/ipfs-cluster
This performs the necessary renamings.
2022-06-16 17:43:30 +02:00
Hector Sanjuan
3169fba9d1 metrics: track total pins, queued, pinning, pin error.
This fixes #1470 and #1187.
2022-04-22 15:57:48 +02:00
Hector Sanjuan
9b9d76f92d Pinset streaming and method type revamp
This commit introduces the new go-libp2p-gorpc streaming capabilities for
Cluster. The main aim is to work towards heavily reducing memory usage when
working with very large pinsets.

As a side-effect, it takes the chance to revampt all types for all public
methods so that pointers to static what should be static objects are not used
anymore. This should heavily reduce heap allocations and GC activity.

The main change is that state.List now returns a channel from which to read
the pins, rather than pins being all loaded into a huge slice.

Things reading pins have been all updated to iterate on the channel rather
than on the slice. The full pinset is no longer fully loaded onto memory for
things that run regularly like StateSync().

Additionally, the /allocations endpoint of the rest API no longer returns an
array of pins, but rather streams json-encoded pin objects directly. This
change has extended to the restapi client (which puts pins into a channel as
they arrive) and to ipfs-cluster-ctl.

There are still pending improvements like StatusAll() calls which should also
stream responses, and specially BlockPut calls which should stream blocks
directly into IPFS on a single call.

These are coming up in future commits.
2022-03-19 03:02:55 +01:00
Hector Sanjuan
000dccc1cc Monitor: do not clean up metrics immediately after an alert 2022-01-31 17:53:09 +01:00
Hector Sanjuan
d4591b8442 Monitor: remove accrual detection. Add LatestForPeer method.
Fixes #939
2022-01-31 17:53:09 +01:00
Hector Sanjuan
12756e0220 monitor: fix panic in checker
Alert might be launched when no metrics for peer are received at all.
2021-01-14 00:04:40 +01:00
Hector Sanjuan
90208b45f9 health/alerts endpoint: brush up old PR 2021-01-13 22:09:21 +01:00
Hector Sanjuan
4bcb91ee2b Merge branch 'master' into feat/alerts 2021-01-13 21:08:49 +01:00
Hector Sanjuan
f83ff9b655 staticcheck: fix all staticcheck warnings in the project 2020-04-14 20:16:10 +02:00
Kishan Mohanbhai Sagathiya
618ebd23f4 Check expiry in alert 2019-12-13 12:25:28 +05:30
Kishan Mohanbhai Sagathiya
76857112b2 Test that expired PeerMetrics gets deleted 2019-09-13 08:01:15 +07:00
Hector Sanjuan
e240c2a19f Simplify failed peer detection 2019-06-27 16:55:51 +01:00
Adrian Lanzafame
2255ba737b
fix ttl expiration check
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-25 12:54:41 +02:00
Hector Sanjuan
563a0da9ae Do alert for all metric types 2019-06-23 10:14:29 +01:00
Adrian Lanzafame
27295c10ac fix check failed
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-23 10:14:29 +01:00
Adrian Lanzafame
5e09da9d63 address pr feedback
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-23 10:14:29 +01:00
Adrian Lanzafame
e1b40d49c1 fix how accrual fd treats ttls
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-23 10:14:29 +01:00
Hector Sanjuan
b804e61ef0 Update deps along with go-libp2p-core refactor
Lots of rewrites in imports...
2019-06-14 13:10:45 +02:00
Hector Sanjuan
27368ab077 Fix: alert at most once PER METRIC
Before it would alert at most once per peer, which prevented some metrics
from alerting at all.
2019-06-11 11:44:12 +02:00
Hector Sanjuan
a0d93fc62c Change MaxAlertThreshold to 1 2019-06-11 10:54:12 +02:00
Adrian Lanzafame
14841e4e24 address pr feedback
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-11 10:54:12 +02:00
Adrian Lanzafame
7459917275 alerting for peers stops after one alert
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-11 10:54:12 +02:00
Adrian Lanzafame
a763560e0c
extend the initial size of metrics distribution to 5
This prevents accrual failure detection from kicking in too
soon after a cluster has started.

License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-05-07 19:07:11 +10:00
Adrian Lanzafame
9464759ae6
remove hard timeout limits and use only accrual failure detection
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-30 12:06:01 +10:00
Adrian Lanzafame
42693eb06d
fix passing ctx from daemon to pubsub
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-29 17:58:28 +10:00
Adrian Lanzafame
32ca9167d6
use accrual instead of metric expiration
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-26 17:58:29 +10:00
Adrian Lanzafame
3c09ebcc71
add Alerts measure
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-26 17:56:44 +10:00
Adrian Lanzafame
d5ecd9ef6a
WIP
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-23 20:30:26 +10:00
Adrian Lanzafame
eae4329cb3
address pr feedback
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:18:19 +10:00
Adrian Lanzafame
31af640e33
use allocations list to choose peer to repin
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:16:40 +10:00
Adrian Lanzafame
e187b800cf
rename TS to ReceivedAt
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:14:13 +10:00
Adrian Lanzafame
3d6eb64db6
Add accrual failure detection method
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:14:13 +10:00
Hector Sanjuan
acbd7fda60 Consensus: add new "crdt" consensus component
This adds a new "crdt" consensus component using go-ds-crdt.

This implies several refactors to fully make cluster consensus-component
independent:

* Delete mapstate and fully adopt dsstate (after people have migrated).
* Return errors from state methods rather than ignoring them.
* Add a new "datastore" modules so that we can configure datastores in the
   main configuration like other components.
* Let the consensus components fully define the "state.State". Thus, they do
not receive the state, they receive the storage where we put the state (a
go-datastore).
* Allow to customize how the monitor component obtains Peers() (the current
  peerset), including avoiding using the current peerset. At the moment the
  crdt consensus uses the monitoring component to define the current peerset.
  Therefore the monitor component cannot rely on the consensus component to
  produce a peerset.
* Re-factor/re-implementation of "ipfs-cluster-service state"
  operations. Includes the dissapearance of the "migrate" one.

The CRDT consensus component defines creates a crdt-datastore (with ipfs-lite)
and uses it to intitialize a dssate. Thus the crdt-store is elegantly
wrapped. Any modifications to the state get automatically replicated to other
peers. We store all the CRDT DAG blocks in the local datastore.

The consensus components only expose a ReadOnly state, as any modifications to
the shared state should happen through them.

DHT and PubSub facilities must now be created outside of Cluster and passed in
so they can be re-used by different components.
2019-04-17 19:14:26 +02:00
Hector Sanjuan
6447ea51d2 Remove *Serial types. Use pointers for all types.
This takes advantange of the latest features in go-cid, peer.ID and
go-multiaddr and makes the Go types serializable by default.

This means we no longer need to copy between Pin <-> PinSerial, or ID <->
IDSerial etc. We can now efficiently binary-encode these types using short
field keys and without parsing/stringifying (in many cases it just a cast).

We still get the same json output as before (with minor modifications for
Cids).

This should greatly improve Cluster performance and memory usage when dealing
with large collections of items.

License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
2019-02-27 17:04:35 +00:00
Adrian Lanzafame
3b3f786d68
add opencensus tracing and metrics
This commit adds support for OpenCensus tracing
and metrics collection. This required support for
context.Context propogation throughout the cluster
codebase, and in particular, the ipfscluster component
interfaces.

The tracing propogates across RPC and HTTP boundaries.
The current default tracing backend is Jaeger.

The metrics currently exports the metrics exposed by
the opencensus http plugin as well as the pprof metrics
to a prometheus endpoint for scraping.
The current default metrics backend is Prometheus.

Metrics are currently exposed by default due to low
overhead, can be turned off if desired, whereas tracing
is off by default as it has a much higher performance
overhead, though the extent of the performance hit can be
adjusted with smaller sampling rates.

License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-02-04 18:53:21 +10:00
Hector Sanjuan
954ede931f Monitor: more refactoring. Rename util to metrics
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
2018-05-09 11:01:41 +02:00