Commit Graph

50 Commits

Author SHA1 Message Date
Hector Sanjuan
12756e0220 monitor: fix panic in checker
Alert might be launched when no metrics for peer are received at all.
2021-01-14 00:04:40 +01:00
Hector Sanjuan
90208b45f9 health/alerts endpoint: brush up old PR 2021-01-13 22:09:21 +01:00
Hector Sanjuan
4bcb91ee2b Merge branch 'master' into feat/alerts 2021-01-13 21:08:49 +01:00
Hector Sanjuan
0dfa9ca185
(chore) Upgrade dependencies
Upgrade dependencies and bump to go1.15.
2020-08-27 14:10:58 +02:00
Hector Sanjuan
b513ec194d Fix some mispellings 2020-04-14 23:47:09 +02:00
Hector Sanjuan
f83ff9b655 staticcheck: fix all staticcheck warnings in the project 2020-04-14 20:16:10 +02:00
Hector Sanjuan
b3853caf36 Dependency ugprade: changes needed
* Libp2p protectors no longer needed, use PSK directly
* Generate cluster 32-byte secret here (helper gone from pnet)
* Switch to go-log/v2 in all places
* DHT bootstrapping not needed. Adjust DHT options for tests.
* Do not rely on dissappeared CidToDsKey and DsKeyToCid functions fro dshelp.
* Disable QUIC (does not support private networks)
* Fix tests: autodiscovery started working properly
2020-03-22 14:50:25 +01:00
Kishan Mohanbhai Sagathiya
618ebd23f4 Check expiry in alert 2019-12-13 12:25:28 +05:30
Kishan Sagathiya
31534a429b Fix #374: health metrics improvements
- Human-sizes for freespace metrics. Display whether if metric is
expires in something like "expires in 3m".
- When not passing metric name `ipfs-cluster-ctl health metrics` hits
the the metrics endpoint which returns a list of available metrics and
displays to user
- Humanize metrics output
- Sort metrics output
2019-10-24 16:37:26 +02:00
Kishan Mohanbhai Sagathiya
76857112b2 Test that expired PeerMetrics gets deleted 2019-09-13 08:01:15 +07:00
Hector Sanjuan
e240c2a19f Simplify failed peer detection 2019-06-27 16:55:51 +01:00
Adrian Lanzafame
2255ba737b
fix ttl expiration check
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-25 12:54:41 +02:00
Hector Sanjuan
563a0da9ae Do alert for all metric types 2019-06-23 10:14:29 +01:00
Adrian Lanzafame
27295c10ac fix check failed
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-23 10:14:29 +01:00
Adrian Lanzafame
5e09da9d63 address pr feedback
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-23 10:14:29 +01:00
Adrian Lanzafame
e1b40d49c1 fix how accrual fd treats ttls
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-23 10:14:29 +01:00
Hector Sanjuan
b804e61ef0 Update deps along with go-libp2p-core refactor
Lots of rewrites in imports...
2019-06-14 13:10:45 +02:00
Hector Sanjuan
27368ab077 Fix: alert at most once PER METRIC
Before it would alert at most once per peer, which prevented some metrics
from alerting at all.
2019-06-11 11:44:12 +02:00
Hector Sanjuan
a0d93fc62c Change MaxAlertThreshold to 1 2019-06-11 10:54:12 +02:00
Adrian Lanzafame
14841e4e24 address pr feedback
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-11 10:54:12 +02:00
Adrian Lanzafame
7459917275 alerting for peers stops after one alert
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-06-11 10:54:12 +02:00
Adrian Lanzafame
a763560e0c
extend the initial size of metrics distribution to 5
This prevents accrual failure detection from kicking in too
soon after a cluster has started.

License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-05-07 19:07:11 +10:00
Adrian Lanzafame
9464759ae6
remove hard timeout limits and use only accrual failure detection
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-30 12:06:01 +10:00
Adrian Lanzafame
42693eb06d
fix passing ctx from daemon to pubsub
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-29 17:58:28 +10:00
Adrian Lanzafame
32ca9167d6
use accrual instead of metric expiration
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-26 17:58:29 +10:00
Adrian Lanzafame
3c09ebcc71
add Alerts measure
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-26 17:56:44 +10:00
Adrian Lanzafame
b0dbcbaa8d
add reference to original prob.go
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-26 12:20:31 +10:00
Adrian Lanzafame
d5ecd9ef6a
WIP
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-23 20:30:26 +10:00
Adrian Lanzafame
eae4329cb3
address pr feedback
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:18:19 +10:00
Adrian Lanzafame
31af640e33
use allocations list to choose peer to repin
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:16:40 +10:00
Adrian Lanzafame
1349e99c1e
fix time taken by tests
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:16:39 +10:00
Adrian Lanzafame
4338ea6905
refactor prob to use gonum and pass []float64
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:16:39 +10:00
Adrian Lanzafame
bcbe7b453f
refactor from big.Float to float64 and add prob tests
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:14:13 +10:00
Adrian Lanzafame
e187b800cf
rename TS to ReceivedAt
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:14:13 +10:00
Adrian Lanzafame
3d6eb64db6
Add accrual failure detection method
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:14:13 +10:00
Adrian Lanzafame
13ed78786c
fix distribution test and general clean up
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:09:19 +10:00
Hector Sanjuan
4e61935379
Use defer for locks. Move to Prev() in All()
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
2019-04-18 16:09:19 +10:00
Hector Sanjuan
da3c543ce2
Revert "attempt copying slice"
This reverts commit 0d4d40513fccd31b9cdc4db369aa87e87c529be4.
2019-04-18 16:09:19 +10:00
Adrian Lanzafame
46d6cb155d
attempt copying slice
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:09:19 +10:00
Adrian Lanzafame
2b1b8a4389
remove use of last
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:09:18 +10:00
Adrian Lanzafame
ebcf40cf7d
rename TS to ReceivedAt
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:09:18 +10:00
Hector Sanjuan
7711ab8cfd
Replace underlying slice with ring.Ring in metrics window
License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-04-18 16:09:18 +10:00
Hector Sanjuan
acbd7fda60 Consensus: add new "crdt" consensus component
This adds a new "crdt" consensus component using go-ds-crdt.

This implies several refactors to fully make cluster consensus-component
independent:

* Delete mapstate and fully adopt dsstate (after people have migrated).
* Return errors from state methods rather than ignoring them.
* Add a new "datastore" modules so that we can configure datastores in the
   main configuration like other components.
* Let the consensus components fully define the "state.State". Thus, they do
not receive the state, they receive the storage where we put the state (a
go-datastore).
* Allow to customize how the monitor component obtains Peers() (the current
  peerset), including avoiding using the current peerset. At the moment the
  crdt consensus uses the monitoring component to define the current peerset.
  Therefore the monitor component cannot rely on the consensus component to
  produce a peerset.
* Re-factor/re-implementation of "ipfs-cluster-service state"
  operations. Includes the dissapearance of the "migrate" one.

The CRDT consensus component defines creates a crdt-datastore (with ipfs-lite)
and uses it to intitialize a dssate. Thus the crdt-store is elegantly
wrapped. Any modifications to the state get automatically replicated to other
peers. We store all the CRDT DAG blocks in the local datastore.

The consensus components only expose a ReadOnly state, as any modifications to
the shared state should happen through them.

DHT and PubSub facilities must now be created outside of Cluster and passed in
so they can be re-used by different components.
2019-04-17 19:14:26 +02:00
Hector Sanjuan
ea85cf7805 Rename "test.Test*" to "test.*" (test.TestCid1 -> test.Cid1)
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
2019-02-27 20:19:10 +00:00
Hector Sanjuan
6447ea51d2 Remove *Serial types. Use pointers for all types.
This takes advantange of the latest features in go-cid, peer.ID and
go-multiaddr and makes the Go types serializable by default.

This means we no longer need to copy between Pin <-> PinSerial, or ID <->
IDSerial etc. We can now efficiently binary-encode these types using short
field keys and without parsing/stringifying (in many cases it just a cast).

We still get the same json output as before (with minor modifications for
Cids).

This should greatly improve Cluster performance and memory usage when dealing
with large collections of items.

License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
2019-02-27 17:04:35 +00:00
Adrian Lanzafame
3b3f786d68
add opencensus tracing and metrics
This commit adds support for OpenCensus tracing
and metrics collection. This required support for
context.Context propogation throughout the cluster
codebase, and in particular, the ipfscluster component
interfaces.

The tracing propogates across RPC and HTTP boundaries.
The current default tracing backend is Jaeger.

The metrics currently exports the metrics exposed by
the opencensus http plugin as well as the pprof metrics
to a prometheus endpoint for scraping.
The current default metrics backend is Prometheus.

Metrics are currently exposed by default due to low
overhead, can be turned off if desired, whereas tracing
is off by default as it has a much higher performance
overhead, though the extent of the performance hit can be
adjusted with smaller sampling rates.

License: MIT
Signed-off-by: Adrian Lanzafame <adrianlanzafame92@gmail.com>
2019-02-04 18:53:21 +10:00
Hector Sanjuan
19b1124999 Make metrics human
Issue #572 exposes metrics but they carry the peer ID in binary.

This was ok with our internal codecs but it doesn't seem to work
very well with json, and makes the output format unusable.

This makes the Metric.Peer field a string.

Additinoally, fixes calling the command without arguments and displaying
the date in the right format.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
2018-10-26 14:11:30 +02:00
Hector Sanjuan
5ca8ca39eb Monitor/tests: Allow to run tests using the basic monitor.
Do it in additional stage in Travis.

Also, test fixes.

License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
2018-05-09 11:39:21 +02:00
Hector Sanjuan
69c47fe811 Monitor: remove safe parameter for metrics.Window
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
2018-05-09 11:01:52 +02:00
Hector Sanjuan
954ede931f Monitor: more refactoring. Rename util to metrics
License: MIT
Signed-off-by: Hector Sanjuan <code@hector.link>
2018-05-09 11:01:41 +02:00