Commit graph

306 commits

Author SHA1 Message Date
Luke Parker
fd4f247917
Correct log which didn't work as intended 2024-04-20 19:54:16 -04:00
Luke Parker
ac9e356af4
Correct log targets in tendermint-machine 2024-04-20 19:15:15 -04:00
Luke Parker
bba7d2a356
Better logs in tendermint-machine 2024-04-20 18:13:44 -04:00
Luke Parker
4c349ae605
Redo how tendermint-machine checks if messages were prior sent
Instead of saving, for every sent message, if it was sent or not, we track the
latest block/round participated in. These two keys are comprehensive to all
prior block/rounds. We then use three keys for the latest round's
proposal/prevote/precommit, enabling tracking current state as necessary to
prevent equivocations with just 5 keys.

The storage of the latest three messages also enables proper rebroadcasting of
the current round (not implemented in this commit).
2024-04-20 18:10:51 -04:00
Luke Parker
593aefd229
Extend time in sync test 2024-04-18 02:51:38 -04:00
Luke Parker
5830c2463d
fmt 2024-04-18 02:03:28 -04:00
Luke Parker
bcc88c3e86
Don't broadcast added blocks
Online validators should inherently have them. Offline validators will receive
from the sync protocol.

This does somewhat eliminate the class of nodes who would follow the blockchain
(without validating it), yet that's fine for the performance benefit.
2024-04-18 01:48:11 -04:00
Luke Parker
fea16df567
Only reply to heartbeats after a certain distance 2024-04-18 01:39:34 -04:00
Luke Parker
4960c3222e
Ensure we don't reply to stale heartbeats 2024-04-18 01:24:38 -04:00
Luke Parker
6b4df4f2c0
Only have some nodes respond to latent heartbeats
Also only respond if they're more than 2 blocks behind to minimize redundant
sending of blocks.
2024-04-17 21:54:10 -04:00
Luke Parker
bc44fbdbac
Add TODO to coordinator P2P 2024-03-23 23:32:21 -04:00
Luke Parker
4cacce5e55
Perform key share amortization on-chain to avoid discrepancies 2024-03-23 23:32:14 -04:00
Luke Parker
b7d49af1d5
Track total peer count in the coordinator 2024-03-23 18:02:48 -04:00
Luke Parker
4914420a37
Don't add as an explicit peer if already connected 2024-03-22 23:51:51 -04:00
Luke Parker
f11a08c436
Peer finding which won't get stuck on one specific network 2024-03-22 23:47:43 -04:00
Luke Parker
35b58a45bd
Split peer finding into a dedicated task 2024-03-22 23:40:15 -04:00
Luke Parker
af9b1ad5f9
Initial pruning of backlogged consensus messages 2024-03-22 23:18:53 -04:00
Luke Parker
2f07d04d88
Extend timeout for rebroadcast of consensus messages in coordinator 2024-03-22 16:06:31 -04:00
Luke Parker
0889627e60
Typo fix for prior commit 2024-03-11 02:20:51 -04:00
Luke Parker
ace41c79fd
Tidy the BlockHasEvents cache 2024-03-11 01:44:00 -04:00
Luke Parker
f7d16b3fc5
Fix 0 - 1 which caused a panic 2024-03-09 05:37:41 -05:00
Luke Parker
6374d9987e
Correct how we save the block to scan from 2024-03-09 03:48:44 -05:00
Luke Parker
c93f6bf901
Replace yield_now with sleep 100 to prevent hammering a task, despite still being over-eager 2024-03-09 03:34:31 -05:00
Luke Parker
61a81e53e1
Further optimize cosign DB 2024-03-09 03:31:06 -05:00
Luke Parker
89b237af7e
Correct the return value of block_has_events 2024-03-09 02:44:04 -05:00
Luke Parker
2347bf5fd3
Bound cosign work and ensure it progress forward even when cosigns don't occur
Should resolve the DB load observed on testnet.
2024-03-09 02:20:23 -05:00
Luke Parker
454bebaa77
Have the TendermintMachine domain-separate by genesis
Enbables support for multiple machines over the same DB.
2024-03-08 01:22:02 -05:00
Luke Parker
e266bc2e32
Stop validators from equivocating on reboot
Part of https://github.com/serai-dex/serai/issues/345.

The lack of full DB persistence does mean enough nodes rebooting at the same
time may cause a halt. This will prevent slashes.
2024-03-07 22:56:35 -05:00
Luke Parker
f0694172ef
Fix potential generation of invalid SignData in shim 2024-02-09 02:52:08 -05:00
akildemir
347d4cf413
Fix tendermint distinct precommit bug (#517)
* fix tendermint distinct precommit bug

* remove conflicting precommit error
2024-02-08 13:47:37 -05:00
akildemir
ad0ecc5185
complete various todos in tributary (#520)
* complete various todos

* fix pr comments

* Document bounds on unique hashes in TransactionKind

---------

Co-authored-by: Luke Parker <lukeparker5132@gmail.com>
2024-02-05 03:50:55 -05:00
Luke Parker
4913873b10
Slash reports (#523)
* report_slashes plumbing in Substrate

Notably delays the SetRetired event until it provides a slash report or the set
after it becomes the set to report its slashes.

* Add dedicated AcceptedHandover event

* Add SlashReport TX to Tributary

* Create SlashReport TXs

* Handle SlashReport TXs

* Add logic to generate a SlashReport to the coordinator

* Route SlashReportSigner into the processor

* Finish routing the SlashReport signing/TX publication

* Add serai feature to processor's serai-client
2024-01-29 03:48:53 -05:00
Luke Parker
f3429ec1ef
Inside publish (for a Serai transaction from the coordinator), use RetiredDb over latest session
Not only is this more performant, the definition of retired won't be if a newer
session is active. It will be if the session has posted a slash report or the
stake for that session has unlocked.

Initial commit towards implementing SlashReports.
2024-01-05 23:40:15 -05:00
Luke Parker
7eb388e546
PR to track down CI failures (#501)
* Use an extended timeout for DKGs specifically

* Add a log statement when message-queue connection fails

* Add a 60 second keep-alive to connections

* Use zalloc for processor/message-queue/coordinator

An additional layer which protects us against edge cases with Zeroizing
(objects which don't support it or don't miss it).

* Add further logs to message-queue

* Further increase re-attempt timeouts in CI

* Remove misplaced continue inmessage-queue client

Fixes observed CI failures.

* Revert "Further increase re-attempt timeouts in CI"

This reverts commit 3723530cf6.
2024-01-04 01:08:13 -05:00
Luke Parker
02776c54a8
Increase reattempt delays in the GH CI, which is extremely latent 2023-12-30 22:11:04 -05:00
Luke Parker
ec8dfd4639
Correct SignData serialization test from creating 256 signers of data
This overflows the u8 allowed and caused a CI failure. The actual
code/assumption is fine.
2023-12-30 19:08:29 -05:00
Luke Parker
b493e3e31f
Validator DHT (#494)
* Route validators for any active set through sc-authority-discovery

Additionally adds an RPC route to retrieve their P2P addresses.

* Have the coordinator get peers from substrate

* Have the RPC return one address, not up to 3

Prevents the coordinator from believing it has 3 peers when it has one.

* Add missing feature to serai-client

* Correct network argument in serai-client for p2p_validators call

* Add a test in serai-client to check DHT population with a much quicker failure than the coordinator tests

* Update to latest Substrate

Removes distinguishing BABE/AuthorityDiscovery keys which causes
sc_authority_discovery to populate as desired.

* Update to a properly tagged substrate commit

* Add all dialed to peers to GossipSub

* cargo fmt

* Reduce common code in serai-coordinator-tests with amore involved new_test

* Use a recursive async function to spawn `n` DockerTests with the necessary networking configuration

* Merge UNIQUE_ID and ONE_AT_A_TIME

* Tidy up the new recursive code in tests/coordinator

* Use a Mutex in CONTEXT to let it be set multiple times

* Make complimentary edits to full-stack tests

* Augment coordinator P2p connection logs

* Drop lock acquisitions before recursing

* Better scope lock acquisitions in full-stack, preventing a deadlock

* Ensure OUTER_OPS is reset across the test boundary

* Add cargo deny allowance for dockertest fork
2023-12-22 21:09:18 -05:00
Luke Parker
00774c29d7
Replace remaining direct uses of futures with futures_util
Slight downscope which helps combat the antipattern which is the futures glob
crate. While futures_util is still a large crate, it has better defaults and
is smaller by virtue of not pulling the executor.
2023-12-18 19:45:08 -05:00
Luke Parker
a4c82632fb
Use pub(crate) for create_db items, not pub 2023-12-18 17:15:02 -05:00
Luke Parker
c8747e23c5 Remove offline participants from future DKG protocols so long as the threshold is met
Makes RemoveParticipantDueToDkg a voted-on event instead of a Provided.
This removes the requirement for offline parties to be able to fully validate
blame, yet unfortunately lets an dishonest supermajority have an honest node
label any arbitrary node as dishonest.

Corrects a variety of `.i(...)` calls which panicked when they shouldn't have.

Cleans up a couple no-longer-used storage values.
2023-12-18 17:14:51 -05:00
Luke Parker
c2fffb9887
Correct a couple years of accumulated typos 2023-12-17 02:06:51 -05:00
Luke Parker
065d314e2a
Further expand clippy workspace lints
Achieves a notable amount of reduced async and clones.
2023-12-17 00:04:49 -05:00
Luke Parker
ea3af28139
Add workspace lints 2023-12-17 00:04:47 -05:00
Luke Parker
2532423d42
Remove the RemoveParticipant protocol for having new DKGs specify the participants which were removed
Obvious code cleanup is obvious.
2023-12-14 23:51:57 -05:00
Luke Parker
b60e3c2524
Replace PSTTrait and PstTxType with PublishSeraiTransaction 2023-12-14 16:06:08 -05:00
Luke Parker
77edd00725
Handle the combination of DKG removals with re-attempts
With a DKG removal comes a reduction in the amount of participants which was
ignored by re-attempts.

Now, we determine n/i based on the parties removed, and deterministically
obtain the context of who was removd.
2023-12-13 14:03:07 -05:00
Luke Parker
6a172825aa
Reattempts (#483)
* Schedule re-attempts and add a (not filled out) match statement to actually execute them

A comment explains the methodology. To copy it here:

"""
This is because we *always* re-attempt any protocol which had participation. That doesn't
mean we *should* re-attempt this protocol.

The alternatives were:
1) Note on-chain we completed a protocol, halting re-attempts upon 34%.
2) Vote on-chain to re-attempt a protocol.

This schema doesn't have any additional messages upon the success case (whereas
alternative #1 does) and doesn't have overhead (as alternative #2 does, sending votes and
then preprocesses. This only sends preprocesses).
"""

Any signing protocol which reaches sufficient participation will be
re-attempted until it no longer does.

* Have the Substrate scanner track DKG removals/completions for the Tributary code

* Don't keep trying to publish a participant removal if we've already set keys

* Pad out the re-attempt match a bit more

* Have CosignEvaluator reload from the DB

* Correctly schedule cosign re-attempts

* Actuall spawn new DKG removal attempts

* Use u32 for Batch ID in SubstrateSignableId, finish Batch re-attempt routing

The batch ID was an opaque [u8; 5] which also included the network, yet that's
redundant and unhelpful.

* Clarify a pair of TODOs in the coordinator

* Remove old TODO

* Final comment cleanup

* Correct usage of TARGET_BLOCK_TIME in reattempt scheduler

It's in ms and I assumed it was in s.

* Have coordinator tests drop BatchReattempts which aren't relevant yet may exist

* Bug fix and pointless oddity removal

We scheduled a re-attempt upon receiving 2/3rds of preprocesses and upon
receiving 2/3rds of shares, so any signing protocol could cause two re-attempts
(not one more).

The coordinator tests randomly generated the Batch ID since it was prior an
opaque byte array. While that didn't break the test, it was pointless and did
make the already-succeeded check before re-attempting impossible to hit.

* Add log statements, correct dead-lock in coordinator tests

* Increase pessimistic timeout on recv_message to compensate for tighter best-case timeouts

* Further bump timeout by a minute

AFAICT, GH failed by just a few seconds.

This also is worst-case in a single instance, making it fine to be decently long.

* Further further bump timeout due to lack of distinct error
2023-12-12 12:28:53 -05:00
Luke Parker
11fdb6da1d
Coordinator Cleanup (#481)
* Move logic for evaluating if a cosign should occur to its own file

Cleans it up and makes it more robust.

* Have expected_next_batch return an error instead of retrying

While convenient to offer an error-free implementation, it potentially caused
very long lived lock acquisitions in handle_processor_message.

* Unify and clean DkgConfirmer and DkgRemoval

Does so via adding a new file for the common code, SigningProtocol.

Modifies from_cache to return the preprocess with the machine, as there's no
reason not to. Also removes an unused Result around the type.

Clarifies the security around deterministic nonces, removing them for
saved-to-disk cached preprocesses. The cached preprocesses are encrypted as the
DB is not a proper secret store.

Moves arguments always present in the protocol from function arguments into the
struct itself.

Removes the horribly ugly code in DkgRemoval, fixing multiple issues present
with it which would cause it to fail on use.

* Set SeraiBlockNumber in cosign.rs as it's used by the cosigning protocol

* Remove unnecessary Clone from lambdas in coordinator

* Remove the EventDb from Tributary scanner

We used per-Transaction DB TXNs so on error, we don't have to rescan the entire
block yet only the rest of it. We prevented scanning multiple transactions by
tracking which we already had.

This is over-engineered and not worth it.

* Implement borsh for HasEvents, removing the manual encoding

* Merge DkgConfirmer and DkgRemoval into signing_protocol.rs

Fixes a bug in DkgConfirmer which would cause it to improperly handle indexes
if any validator had multiple key shares.

* Strictly type DataSpecification's Label

* Correct threshold_i_map_to_keys_and_musig_i_map

It didn't include the participant's own index and accordingly was offset.

* Create TributaryBlockHandler

This struct contains all variables prior passed to handle_block and stops them
from being passed around again and again.

This also ensures fatal_slash is only called while handling a block, as needed
as it expects to operate under perfect consensus.

* Inline accumulate, store confirmation nonces with shares

Inlining accumulate makes sense due to the amount of data accumulate needed to
be passed.

Storing confirmation nonces with shares ensures that both are available or
neither. Prior, one could be yet the other may not have been (requiring an
assert in runtime to ensure we didn't bungle it somehow).

* Create helper functions for handling DkgRemoval/SubstrateSign/Sign Tributary TXs

* Move Label into SignData

All of our transactions which use SignData end up with the same common usage
pattern for Label, justifying this.

Removes 3 transactions, explicitly de-duplicating their handlers.

* Remove CurrentlyCompletingKeyPair for the non-contextual DkgKeyPair

* Remove the manual read/write for TributarySpec for borsh

This struct doesn't have any optimizations booned by the manual impl. Using
borsh reduces our scope.

* Use temporary variables to further minimize LoC in tributary handler

* Remove usage of tuples for non-trivial Tributary transactions

* Remove serde from dkg

serde could be used to deserialize intenrally inconsistent objects which could
lead to panics or faults.

The BorshDeserialize derives have been replaced with a manual implementation
which won't produce inconsistent objects.

* Abstract Future generics using new trait definitions in coordinator

* Move published_signed_transaction to tributary/mod.rs to reduce the size of main.rs

* Split coordinator/src/tributary/mod.rs into spec.rs and transaction.rs
2023-12-10 20:21:44 -05:00
Luke Parker
6caf45ea1d
Downscope usage of futures 2023-12-10 19:32:52 -05:00
Luke Parker
7122e0faf4
Cache the block's events within TemporalSerai
Event retrieval was prior:
- Retrieve all events in the block, which may be hundreds of KB
- Filter to just a few

Since it's frequent to want multiple sets of events, each filtered in their own
way, this caused the retrieval to happen multiple times. Now, it only will
happen once.

Also has the scoped clients take a reference, not an owned TemporalSerai.
2023-12-08 10:46:10 -05:00