179 lines
7.7 KiB
Markdown
179 lines
7.7 KiB
Markdown
# `putex`
|
|
|
|
Process Mutex
|
|
|
|
Used to manage a lock and timing components of an at-most-once execution system.
|
|
|
|
### NATS
|
|
|
|
This currently uses [NATS](https://nats.io) exclusively for locking but there is
|
|
the possibility to extend this to using other locks.
|
|
|
|
NATS is used via optimistic locking of a kv bucket entry. If it's able to write
|
|
the lock with its own value before it's changed by the current owner through
|
|
renewal or by another agent becoming the owner, it gets to be the owner. The
|
|
agents follow a set of rules and a common execution plan which ensures they will
|
|
always elect a single leader and never accidentally overlap execution (as long
|
|
as the operator also follows the rules!)
|
|
|
|
## The Variables
|
|
###### `R` - Lock Renewal Interval
|
|
How often the active agent will run its health check and assert its activeness.
|
|
|
|
###### `F` - Failure Threshold
|
|
How many `R` must pass without the lock being updated before an agent may
|
|
take the lock. This is the maximum timeout of a health check as well because
|
|
returning after this time will result in an expected failover anyway.
|
|
|
|
###### `C` - Confirmation Count
|
|
How many `R` must pass between a new agent becomes active and when it
|
|
is allowed to fence other nodes and activate its service.
|
|
|
|
## The Rules
|
|
|
|
### Operator (YOU)
|
|
|
|
* healthcheck
|
|
* The healthcheck will be given an argument of `active` or `standby` indicating
|
|
whether the lock is currently held by this agent. If it is `active`, a more
|
|
thorough check should be performed, if possible. If it is `standby`, the
|
|
healthcheck should indicate whether we believe that `healthcheck active` would
|
|
pass after the confirmation cycles
|
|
* The health check should not take more than R in normal circumstances but may
|
|
run over R in exceptional circumstances which do _not_ require a failover
|
|
such as a switch power cycling somewhere causing a 2 second outage where you
|
|
have R=1000 and F=3 (so you have allotted 3 seconds and the health check takes
|
|
2 seconds to return a result because of the packet loss -- this is greater than
|
|
R but less than R*F so a warning will be produced but failover will not occur.
|
|
* The health check MUST NOT take more than R*F to run or else it will have lost
|
|
its lock to another agent. If the health check might take more than R*F in
|
|
circumstances where failover would _not_ be appropriate, R, F, or both should
|
|
be increased, with F increasing the overall failover window and R slowing the
|
|
system down overall to allow the health check to keep up.
|
|
* activation
|
|
* The activation script should first fence the other hosts, if possible and
|
|
necessary for the use case. The activation script will have been called after
|
|
at least C*R so this should happen immediately followed immediately by the
|
|
service being started.
|
|
* deactivation
|
|
* Deactivation must take place within C*R. If deactivation will need more time
|
|
than is allotted, increasing C or R will accomplish this with R affecting all
|
|
aspects of the system and slowing it down in general.
|
|
|
|
### Agent Operation Theory
|
|
|
|
Uses a specific key in a specific bucket in a NATS cluster to lock a
|
|
runner for a specific process.
|
|
|
|
my-hostname:~$ natslock nats.cluster.local locks spof-service my-hostname
|
|
|
|
Each token should be unique, generally the hostname. These will also be in the
|
|
kv store and can serve to document the current state of the system.
|
|
|
|
The implementation should ensure that time skew does not affect
|
|
synchronization.
|
|
|
|
R = lock renewal interval
|
|
F = lock renewal failures before taking the lock
|
|
C = lock renewals to wait before starting after takeover
|
|
T = lock timeout = R * F
|
|
consider the example of F=1 to ensure the logic makes sense. if R = 1s and
|
|
F=1 then a renewal needs to happen within a 1s period from when we check
|
|
first to when we want to take the lock. With these parameters, after 1s
|
|
with no updates, we would expect to take the lock immediately and so T=R*F.
|
|
|
|
A client has the lock and should run the service when its token is present in
|
|
the key for C full renewal interval(s) or immediately if this is the first
|
|
revision or if the current token is empty.
|
|
|
|
* When done after takeover, this gives the former system time to shutdown, if
|
|
necessary. C should generally be 1 and the shutdown/fencing mechanism should
|
|
be instantaneous (iptables/kill -9).
|
|
|
|
A client obtains the lock by writing its token to the key.
|
|
|
|
A client may write to the key when there have been no writes for T seconds or
|
|
when the key does not exist.
|
|
|
|
A client may only write to the key if it does not exist or if it has not been
|
|
updated since the last read (in addition to other rules of the protocol).
|
|
Does not exist: use create which will fail if the key has been created
|
|
Exists: use update with revision to ensure the revision that was read
|
|
is also the one being replaced.
|
|
|
|
Both may be attempted in parallel since only one may possibly succeed.
|
|
|
|
Any failure to write to the key must result in the underlying application
|
|
being killed or otherwise fenced immediately (within R from the last interval).
|
|
|
|
A client renews its lock after running its health check. The health
|
|
check should take less than R but MUST take less than T or else the client
|
|
will lose its lock. Since it _must_ lose its lock and _must_ recognize that
|
|
this has happened, the health check will always timeout after T and the lock
|
|
will be given up immediately.
|
|
|
|
#### Not init
|
|
|
|
This solution is intended to signal services to start or stop, not act as an
|
|
init system in its own right. Thus, you should be doing something like
|
|
`systemctl start database.service`, not `redis-server`. Fencing should be
|
|
handled via the systemd unit DAG as well.
|
|
|
|
#### Pump
|
|
|
|
Runs in an explicit, stateless trampoline executor (in addition to the tokio
|
|
async executor), with an action of "pump" in the code. Each run builds its
|
|
state at the start and may loop a finite number of times but _must_ return to
|
|
the outer loop occasionally to allow it to do housekeeping as well as to clear
|
|
the stack depth for when tail call optimization fails (reportedly unreliable
|
|
in rust currently). Stated differently, the pump action is also a DAG with a
|
|
single entry and single exit.
|
|
|
|
A single pumping stroke happens from Start to End in the diagram.
|
|
|
|
Each stroke lasts _at least_ R. This is accomplished by awaiting a timer
|
|
running in parallel to pump at the end of the pump stroke (i.e., it is not
|
|
part of the pump function but is an important implementation detail of pump).
|
|
|
|
|
|
### State Machine
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
Start(((Start))) --> GetLock
|
|
|
|
GetLock[lk = Lock Value\nrev = Revision] --> Empty{"lk empty or mine?\n(or failed to retrieve\ntimeout = 1R)"}
|
|
Empty -- yes --> HCSunnyLoop
|
|
Empty -- no --> WaitKill
|
|
WaitKill[Kill Process] --> HCWait
|
|
|
|
HCWrite[Run Healthcheck] --> HCWriteSuccess{Success?}
|
|
HCWriteSuccess -- yes --> Write
|
|
HCWriteSuccess -- no --> WaitR
|
|
Write[Write Lock] --> WriteSuccess{Success?}
|
|
WriteSuccess -- yes --> WriteWaitCR
|
|
WriteSuccess -- no ---> WaitR
|
|
WriteWaitCR[["Wait R\n(C-1 times):\n->Write Lock\n->Wait R"]] --> WaitR
|
|
style WriteWaitCR text-align:left
|
|
|
|
HCWait[Run Healthcheck] --> HCWaitSuccess{Success?}
|
|
HCWaitSuccess -- yes --> WaitFR
|
|
HCWaitSuccess -- no --> WaitR
|
|
WaitR[Wait remainder of R, if any]
|
|
|
|
WaitFR[Wait F*R for takeover] --> HCWrite
|
|
|
|
HCSunnyLoop[Run Healthcheck] --> HCSunnyLoopSuccess{Success?}
|
|
HCSunnyLoopSuccess -- yes --> SunnyLoopWrite
|
|
SunnyLoopWrite[Renew lock\nMay attempt for\nup to F*R seconds] --> SunnyLoopWriteSuccess{Success?}
|
|
SunnyLoopWriteSuccess -- yes --> SunnyStartProcess
|
|
SunnyLoopWriteSuccess -- no --> SunnyLoopAbort
|
|
SunnyStartProcess[Unfence Self\nFence Others\nEnable Process] --> WaitR
|
|
HCSunnyLoopSuccess -- no --> SunnyLoopAbort
|
|
SunnyLoopAbort[Attempt to write blank to lock\nOne attempt only.] --> SunnyKill
|
|
SunnyKill[Kill Process] --> WaitR
|
|
WaitR --> End((End)) --> ToStart[Back to start]
|
|
|
|
```
|
|
|