Commit graph

42 commits

Author SHA1 Message Date
Luke Parker
593aefd229
Extend time in sync test 2024-04-18 02:51:38 -04:00
Luke Parker
fea16df567
Only reply to heartbeats after a certain distance 2024-04-18 01:39:34 -04:00
Luke Parker
4960c3222e
Ensure we don't reply to stale heartbeats 2024-04-18 01:24:38 -04:00
Luke Parker
6b4df4f2c0
Only have some nodes respond to latent heartbeats
Also only respond if they're more than 2 blocks behind to minimize redundant
sending of blocks.
2024-04-17 21:54:10 -04:00
Luke Parker
bc44fbdbac
Add TODO to coordinator P2P 2024-03-23 23:32:21 -04:00
Luke Parker
b7d49af1d5
Track total peer count in the coordinator 2024-03-23 18:02:48 -04:00
Luke Parker
4914420a37
Don't add as an explicit peer if already connected 2024-03-22 23:51:51 -04:00
Luke Parker
f11a08c436
Peer finding which won't get stuck on one specific network 2024-03-22 23:47:43 -04:00
Luke Parker
35b58a45bd
Split peer finding into a dedicated task 2024-03-22 23:40:15 -04:00
Luke Parker
b493e3e31f
Validator DHT (#494)
* Route validators for any active set through sc-authority-discovery

Additionally adds an RPC route to retrieve their P2P addresses.

* Have the coordinator get peers from substrate

* Have the RPC return one address, not up to 3

Prevents the coordinator from believing it has 3 peers when it has one.

* Add missing feature to serai-client

* Correct network argument in serai-client for p2p_validators call

* Add a test in serai-client to check DHT population with a much quicker failure than the coordinator tests

* Update to latest Substrate

Removes distinguishing BABE/AuthorityDiscovery keys which causes
sc_authority_discovery to populate as desired.

* Update to a properly tagged substrate commit

* Add all dialed to peers to GossipSub

* cargo fmt

* Reduce common code in serai-coordinator-tests with amore involved new_test

* Use a recursive async function to spawn `n` DockerTests with the necessary networking configuration

* Merge UNIQUE_ID and ONE_AT_A_TIME

* Tidy up the new recursive code in tests/coordinator

* Use a Mutex in CONTEXT to let it be set multiple times

* Make complimentary edits to full-stack tests

* Augment coordinator P2p connection logs

* Drop lock acquisitions before recursing

* Better scope lock acquisitions in full-stack, preventing a deadlock

* Ensure OUTER_OPS is reset across the test boundary

* Add cargo deny allowance for dockertest fork
2023-12-22 21:09:18 -05:00
Luke Parker
00774c29d7
Replace remaining direct uses of futures with futures_util
Slight downscope which helps combat the antipattern which is the futures glob
crate. While futures_util is still a large crate, it has better defaults and
is smaller by virtue of not pulling the executor.
2023-12-18 19:45:08 -05:00
Luke Parker
065d314e2a
Further expand clippy workspace lints
Achieves a notable amount of reduced async and clones.
2023-12-17 00:04:49 -05:00
Luke Parker
ea3af28139
Add workspace lints 2023-12-17 00:04:47 -05:00
Luke Parker
6a172825aa
Reattempts (#483)
* Schedule re-attempts and add a (not filled out) match statement to actually execute them

A comment explains the methodology. To copy it here:

"""
This is because we *always* re-attempt any protocol which had participation. That doesn't
mean we *should* re-attempt this protocol.

The alternatives were:
1) Note on-chain we completed a protocol, halting re-attempts upon 34%.
2) Vote on-chain to re-attempt a protocol.

This schema doesn't have any additional messages upon the success case (whereas
alternative #1 does) and doesn't have overhead (as alternative #2 does, sending votes and
then preprocesses. This only sends preprocesses).
"""

Any signing protocol which reaches sufficient participation will be
re-attempted until it no longer does.

* Have the Substrate scanner track DKG removals/completions for the Tributary code

* Don't keep trying to publish a participant removal if we've already set keys

* Pad out the re-attempt match a bit more

* Have CosignEvaluator reload from the DB

* Correctly schedule cosign re-attempts

* Actuall spawn new DKG removal attempts

* Use u32 for Batch ID in SubstrateSignableId, finish Batch re-attempt routing

The batch ID was an opaque [u8; 5] which also included the network, yet that's
redundant and unhelpful.

* Clarify a pair of TODOs in the coordinator

* Remove old TODO

* Final comment cleanup

* Correct usage of TARGET_BLOCK_TIME in reattempt scheduler

It's in ms and I assumed it was in s.

* Have coordinator tests drop BatchReattempts which aren't relevant yet may exist

* Bug fix and pointless oddity removal

We scheduled a re-attempt upon receiving 2/3rds of preprocesses and upon
receiving 2/3rds of shares, so any signing protocol could cause two re-attempts
(not one more).

The coordinator tests randomly generated the Batch ID since it was prior an
opaque byte array. While that didn't break the test, it was pointless and did
make the already-succeeded check before re-attempting impossible to hit.

* Add log statements, correct dead-lock in coordinator tests

* Increase pessimistic timeout on recv_message to compensate for tighter best-case timeouts

* Further bump timeout by a minute

AFAICT, GH failed by just a few seconds.

This also is worst-case in a single instance, making it fine to be decently long.

* Further further bump timeout due to lack of distinct error
2023-12-12 12:28:53 -05:00
Luke Parker
f0ff3a18d2
Use debug builds in our Dockerfiles to reduce CI times (#462)
* Use debug builds in our Dockerfiles to reduce CI times

Also enables only spawning the mdns service when debug in the coordinator.

* Correct underflow in processor

Prior undetected due to relase builds not having bounds checks enabled.

* Restore Serai release due to CI/RPC failures caused by compiling it in debug mode

This is *probably* worth an issue filed upstream, if it can be tracked down.

* Correct failing debug asserts in Monero

These debug asserts assumed there was a change address to take the remainder.
If there's no change address, the remainder is shunted to the fee, causing the
fee to be distinct from the estimate.

We presumably need to modify monero-serai such that change: None isn't valid,
and users must use Change::Fingerprintable(None).
2023-11-29 00:24:37 -05:00
Luke Parker
1822e31142
Grow the yamux buffers to exceed the maximum message size 2023-11-21 02:01:41 -05:00
Luke Parker
74a8df4c7b
Add a new primitive of a DB-backed channel
The coordinator already had one of these, albeit implemented much worse than
the one now properly introduced. It had to either be sending or receiving,
whereas the new one can do both at the same time.

This replaces said instance and enables pleasant patterns when implementing the
processor/coordinator.
2023-11-19 02:05:01 -05:00
Luke Parker
be48dcc4a4
Use multiple LibP2P topics
We still need a peering protocol... Hopefully, we can read peers off of the
Substrate node's DHT.
2023-11-18 20:37:55 -05:00
Luke Parker
ee50f584aa
Add saner log statements to the coordinator
Disables trace on every single P2P message.

Logs a short-form of each transaction.
2023-11-16 13:37:39 -05:00
Luke Parker
369af0fab5
\#339 addendum 2023-11-15 20:23:19 -05:00
Luke Parker
96f1d26f7a
Add a cosigning protocol to ensure finalizations are unique (#433)
* Add a function to deterministically decide which Serai blocks should be co-signed

Has a 5 minute latency between co-signs, also used as the maximal latency
before a co-sign is started.

* Get all active tributaries we're in at a specific block

* Add and route CosignSubstrateBlock, a new provided TX

* Split queued cosigns per network

* Rename BatchSignId to SubstrateSignId

* Add SubstrateSignableId, a meta-type for either Batch or Block, and modularize around it

* Handle the CosignSubstrateBlock provided TX

* Revert substrate_signer.rs to develop (and patch to still work)

Due to SubstrateSigner moving when the prior multisig closes, yet cosigning
occurring with the most recent key, a single SubstrateSigner can be reused.
We could manage multiple SubstrateSigners, yet considering the much lower
specifications for cosigning, I'd rather treat it distinctly.

* Route cosigning through the processor

* Add note to rename SubstrateSigner post-PR

I don't want to do so now in order to preserve the diff's clarity.

* Implement cosign evaluation into the coordinator

* Get tests to compile

* Bug fixes, mark blocks without cosigners available as cosigned

* Correct the ID Batch preprocesses are saved under, add log statements

* Create a dedicated function to handle cosigns

* Correct the flow around Batch verification/queueing

Verifying `Batch`s could stall when a `Batch` was signed before its
predecessors/before the block it's contained in was cosigned (the latter being
inevitable as we can't sign a block containing a signed batch before signing
the batch).

Now, Batch verification happens on a distinct async task in order to not block
the handling of processor messages. This task is the sole caller of verify in
order to ensure last_verified_batch isn't unexpectedly mutated.

When the processor message handler needs to access it, or needs to queue a
Batch, it associates the DB TXN with a lock preventing the other task from
doing so.

This lock, as currently implemented, is a poor and inefficient design. It
should be modified to the pattern used for cosign management. Additionally, a
new primitive of a DB-backed channel may be immensely valuable.

Fixes a standing potential deadlock and a deadlock introduced with the
cosigning protocol.

* Working full-stack tests

After the last commit, this only required extending a timeout.

* Replace "co-sign" with "cosign" to make finding text easier

* Update the coordinator tests to support cosigning

* Inline prior_batch calculation to prevent panic on rotation

Noticed when doing a final review of the branch.
2023-11-15 16:57:21 -05:00
Luke Parker
057c3b7cf1
libp2p 0.52.4
Adds several packages into tree, deprecates an API we use. This commit does
update accordingly.

While this may not be preferable, it is inevitable.
2023-10-19 00:27:21 -04:00
Luke Parker
e4adaa8947
Further tweaks re: retiry 2023-10-14 19:55:14 -04:00
Luke Parker
3b3fdd104b
Most of coordinator Tributary retiry
Adds Event::SetRetired to validator-sets.

Emit TributaryRetired.

Replaces is_active_set, which made multiple network requests, with
is_retired_tributary, a DB read.

Performs most of the removals necessary upon TributaryRetired.

Still needs to clean up the actual Tributary/Tendermint tasks.
2023-10-14 16:47:25 -04:00
Luke Parker
863a7842ca
Have every node respond to Heartbeat so they don't download the messages over the net 2023-10-14 15:27:40 -04:00
Luke Parker
f414735be5
Redo new_tributary from being over ActiveTributary to TributaryEvent
TributaryEvent also allows broadcasting a retiry event.
2023-10-14 15:27:39 -04:00
Luke Parker
80e5ca9328
Move heartbeat_tributaries and handle_p2p to p2p.rs 2023-10-13 22:40:11 -04:00
Luke Parker
77f7794452
Remove lazy_static for proper use of channels 2023-09-25 18:23:52 -04:00
Luke Parker
06a6cd29b0
Set nodelay on coordinator's P2P sockets 2023-09-06 22:57:33 -04:00
Luke Parker
3af9dc5d6f
Tweak Heartbeat configuration so LibP2P can be expected to deliver messages within latency window 2023-08-31 01:33:52 -04:00
Luke Parker
1e79de87e8
Remove contention between LibP2p spawned task and consumers via channels 2023-08-30 23:31:09 -04:00
Luke Parker
dc88b29b92
Add keep-alive timeout to coordinator
The Heartbeat was meant to serve for this, yet no Heartbeats are fired when we
don't have active tributaries.

libp2p does offer an explicit KeepAlive protocol, yet it's not recommended in
prod. While this likely has the same pit falls as LibP2p's KeepAlive protocol,
it's at least tailored to our timing.
2023-08-21 02:36:03 -04:00
Luke Parker
7e71450dc4
Bug fixes and log statements
Also shims next nonce code with a fine-for-now piece of code which is unviable
in production, yet should survive testnet.
2023-08-13 04:03:59 -04:00
Luke Parker
f6f945e747
Add a LibP2P instantiation to coordinator
It's largely unoptimized, and not yet exclusive to validators, yet has basic
sanity (using message content for ID instead of sender + index).

Fixes bugs as found. Notably, we used a time in milliseconds where the
Tributary expected  seconds.

Also has Tributary::new jump to the presumed round number. This reduces slashes
when starting new chains (whose times will be before the current time) and was
the only way I was able to observe successful confirmations given current
surrounding infrastructure.
2023-08-08 15:12:47 -04:00
Luke Parker
2feebe536e
Test handle_p2p and Tributary syncing
Includes bug fixes.
2023-04-24 03:30:19 -04:00
Luke Parker
14388e746c
Implement Tributary syncing
Also adds a forwards-lookup to the Tributary blockchain.
2023-04-24 00:53:18 -04:00
Luke Parker
c476f9b640
Break coordinator main into multiple functions
Also moves from std::sync::RwLock to tokio::sync::RwLock to prevent wasting
cycles on spinning.
2023-04-23 23:15:15 -04:00
Luke Parker
05b1fc5f05
Send a heartbeat message when a Tributary falls behind 2023-04-23 18:55:43 -04:00
Luke Parker
ad5522d854
Start handling P2P messages
This defines the tart of a very complex series of locks I'm really unhappy
with. At the same time, there's not immediately a better solution. This also
should work without issue.
2023-04-23 17:01:30 -04:00
Luke Parker
af84b7f707
Add a test for Tributary
Further fleshes out the Tributary testing code.
2023-04-22 22:28:20 -04:00
Luke Parker
8c74576cf0
Add a test to the coordinator for running a Tributary
Impls a LocalP2p for testing.

Moves rebroadcasting into Tendermint, since it's what knows if a message is
fully valid + original.

Removes TributarySpec::validators() HashMap, as its non-determinism caused
different instances to have different round robin schedules. It was already
prior moved to a Vec for this issue, so I'm unsure why this remnant existed.

Also renames the GH no-std workflow from the prior commit.
2023-04-22 10:49:52 -04:00
Luke Parker
79655672ef
Make progres on handling NewSet events
Further bones out the coordinator.
2023-04-16 00:51:56 -04:00