mirror of https://github.com/Cuprate/cuprate.git synced 2024-09-28 20:51:03 +00:00

storage: split cuprate-blockchain <-> cuprate-database (#160 )

* storage: port some code `cuprate-blockchain` -> `database`

* database: remove `Tables` references

* database: remove old `cuprate-blockchain` type references

* find/replace `cuprate_blockchain` -> `database`, add `create_db()`

* database: fix redb

* database: use readme for docs, link in `lib.rs`

* database: fix `open_db_ro`, `open_db_rw`, `create_db` behavior

* database: add open table tests

* database: fix tests, remove blockchain specific references

* database: remove `ReaderThreads`, make `db_directory` mandatory

* initial `cuprate-blockchain` split

* fix doc links

* rename, fix database config

* blockchain: create `crate::open()`, `OpenTables::create_tables()`

* more compat fixes

* fix imports

* fix conflicts

* align cargo.toml

* docs

* fixes

* add `unused_crate_dependencies` lint, fix

* blockchain: add open table tests

2024-06-26 22:51:06 +01:00

32 KiB

Raw Blame History

Database

FIXME: This documentation must be updated and moved to the architecture book.

Cuprate's blockchain implementation.

1. Documentation
2. File structure
3. Backends
4. Layers
5. The service
6. Syncing
7. Resizing
8. (De)serialization
9. Schema
- 9.1 Tables
- 9.2 Multimap tables
10. Known issues and tradeoffs

1. Documentation

Documentation for database/ is split into 3 locations:

Documentation location	Purpose
`database/README.md`	High level design of `cuprate-database`
`cuprate-database`	Practical usage documentation/warnings/notes/etc
Source file `// comments`	Implementation-specific details (e.g, how many reader threads to spawn?)

This README serves as the implementation design document.

For actual practical usage, cuprate-database's types and general usage are documented via standard Rust tooling.

Run:

cargo doc --package cuprate-database --open

at the root of the repo to open/read the documentation.

If this documentation is too abstract, refer to any of the source files, they are heavily commented. There are many // Regular comments that explain more implementation specific details that aren't present here or in the docs. Use the file reference below to find what you're looking for.

The code within src/ is also littered with some grep-able comments containing some keywords:

Word	Meaning
`INVARIANT`	This code makes an assumption that must be upheld for correctness
`SAFETY`	This `unsafe` code is okay, for `x,y,z` reasons
`FIXME`	This code works but isn't ideal
`HACK`	This code is a brittle workaround
`PERF`	This code is weird for performance reasons
`TODO`	This must be implemented; There should be 0 of these in production code
`SOMEDAY`	This should be implemented... someday

2. File structure

A quick reference of the structure of the folders & files in cuprate-database.

Note that lib.rs/mod.rs files are purely for re-exporting/visibility/lints, and contain no code. Each sub-directory has a corresponding mod.rs.

2.1 `src/`

The top-level src/ files.

File	Purpose
`constants.rs`	General constants used throughout `cuprate-database`
`database.rs`	Abstracted database; `trait DatabaseR{o,w}`
`env.rs`	Abstracted database environment; `trait Env`
`error.rs`	Database error types
`free.rs`	General free functions (related to the database)
`key.rs`	Abstracted database keys; `trait Key`
`resize.rs`	Database resizing algorithms
`storable.rs`	Data (de)serialization; `trait Storable`
`table.rs`	Database table abstraction; `trait Table`
`tables.rs`	All the table definitions used by `cuprate-database`
`tests.rs`	Utilities for `cuprate_database` testing
`transaction.rs`	Database transaction abstraction; `trait TxR{o,w}`
`types.rs`	Database-specific types
`unsafe_unsendable.rs`	Marker type to impl `Send` for objects not `Send`

2.2 `src/backend/`

This folder contains the implementation for actual databases used as the backend for cuprate-database.

Each backend has its own folder.

Folder/File	Purpose
`heed/`	Backend using using `heed` (LMDB)
`redb/`	Backend using `redb`
`tests.rs`	Backend-agnostic tests

All backends follow the same file structure:

File	Purpose
`database.rs`	Implementation of `trait DatabaseR{o,w}`
`env.rs`	Implementation of `trait Env`
`error.rs`	Implementation of backend's errors to `cuprate_database`'s error types
`storable.rs`	Compatibility layer between `cuprate_database::Storable` and backend-specific (de)serialization
`transaction.rs`	Implementation of `trait TxR{o,w}`
`types.rs`	Type aliases for long backend-specific types

2.3 `src/config/`

This folder contains the cupate_database::config module; configuration options for the database.

File	Purpose
`config.rs`	Main database `Config` struct
`reader_threads.rs`	Reader thread configuration for `service` thread-pool
`sync_mode.rs`	Disk sync configuration for backends

2.4 `src/ops/`

This folder contains the cupate_database::ops module.

These are higher-level functions abstracted over the database, that are Monero-related.

File	Purpose
`block.rs`	Block related (main functions)
`blockchain.rs`	Blockchain related (height, cumulative values, etc)
`key_image.rs`	Key image related
`macros.rs`	Macros specific to `ops/`
`output.rs`	Output related
`property.rs`	Database properties (pruned, version, etc)
`tx.rs`	Transaction related

2.5 `src/service/`

This folder contains the cupate_database::service module.

The asynchronous request/response API other Cuprate crates use instead of managing the database directly themselves.

File	Purpose
`free.rs`	General free functions used (related to `cuprate_database::service`)
`read.rs`	Read thread-pool definitions and logic
`tests.rs`	Thread-pool tests and test helper functions
`types.rs`	`cuprate_database::service`-related type aliases
`write.rs`	Writer thread definitions and logic

3. Backends

cuprate-database's traits allow abstracting over the actual database, such that any backend in particular could be used.

Each database's implementation for those trait's are located in its respective folder in src/backend/${DATABASE_NAME}/.

3.1 heed

The default database used is heed (LMDB). The upstream versions from crates.io are used. LMDB should not need to be installed as heed has a build script that pulls it in automatically.

heed's filenames inside Cuprate's database folder (~/.local/share/cuprate/database/) are:

Filename	Purpose
`data.mdb`	Main data file
`lock.mdb`	Database lock file

heed-specific notes:

There is a maximum reader limit. Other potential processes (e.g. xmrblocks) that are also reading the data.mdb file need to be accounted for
LMDB does not work on remote filesystem

3.2 redb

The 2nd database backend is the 100% Rust redb.

The upstream versions from crates.io are used.

redb's filenames inside Cuprate's database folder (~/.local/share/cuprate/database/) are:

Filename	Purpose
`data.redb`	Main data file

3.3 redb-memory

This backend is 100% the same as redb, although, it uses redb::backend::InMemoryBackend which is a database that completely resides in memory instead of a file.

All other details about this should be the same as the normal redb backend.

3.4 sanakirja

sanakirja was a candidate as a backend, however there were problems with maximum value sizes.

The default maximum value size is 1012 bytes which was too small for our requirements. Using sanakirja::Slice and sanakirja::UnsizedStorage was attempted, but there were bugs found when inserting a value in-between 512..=4096 bytes.

As such, it is not implemented.

3.5 MDBX

MDBX was a candidate as a backend, however MDBX deprecated the custom key/value comparison functions, this makes it a bit trickier to implement 9.2 Multimap tables. It is also quite similar to the main backend LMDB (of which it was originally a fork of).

As such, it is not implemented (yet).

4. Layers

cuprate_database is logically abstracted into 5 layers, with each layer being built upon the last.

Starting from the lowest:

Backend
Trait
ConcreteEnv
ops
service

4.1 Backend

This is the actual database backend implementation (or a Rust shim over one).

Examples:

heed (LMDB)
redb

cuprate_database itself just uses a backend, it does not implement one.

All backends have the following attributes:

Embedded
Multiversion concurrency control
ACID
Are (key, value) oriented and have the expected API (get(), insert(), delete())
Are table oriented ("table_name" -> (key, value))
Allows concurrent readers

4.2 Trait

cuprate_database provides a set of traits that abstract over the various database backends.

This allows the function signatures and behavior to stay the same but allows for swapping out databases in an easier fashion.

All common behavior of the backend's are encapsulated here and used instead of using the backend directly.

Examples:

For example, instead of calling LMDB or redb's get() function directly, DatabaseRo::get() is called.

4.3 ConcreteEnv

This is the non-generic, concrete struct provided by cuprate_database that contains all the data necessary to operate the database. The actual database backend ConcreteEnv will use internally depends on which backend feature is used.

ConcreteEnv implements trait Env, which opens the door to all the other traits.

The equivalent objects in the backends themselves are:

This is the main object used when handling the database directly, although that is not strictly necessary as a user if the 4.5 service layer is used.

4.4 ops

These are Monero-specific functions that use the abstracted trait forms of the database.

Instead of dealing with the database directly:

get()
delete()

the ops layer provides more abstract functions that deal with commonly used Monero operations:

add_block()
pop_block()

4.5 service

The final layer abstracts the database completely into a Monero-specific async request/response API using tower::Service.

For more information on this layer, see the next section: 5. The service.

5. The service

The main API cuprate_database exposes for other crates to use is the cuprate_database::service module.

This module exposes an async request/response API with tower::Service, backed by a threadpool, that allows reading/writing Monero-related data from/to the database.

cuprate_database::service itself manages the database using a separate writer thread & reader thread-pool, and uses the previously mentioned 4.4 ops functions when responding to requests.

5.1 Initialization

The service is started simply by calling: cuprate_database::service::init().

This function initializes the database, spawns threads, and returns a:

Read handle to the database (cloneable)
Write handle to the database (not cloneable)

These "handles" implement the tower::Service trait, which allows sending requests and receiving responses asynchronously.

5.2 Requests

Along with the 2 handles, there are 2 types of requests:

ReadRequest is for retrieving various types of information from the database.

WriteRequest currently only has 1 variant: to write a block to the database.

5.3 Responses

After sending one of the above requests using the read/write handle, the value returned is not the response, yet an asynchronous channel that will eventually return the response:

// Send a request.
//                                   tower::Service::call()
//                                          V
let response_channel: Channel = read_handle.call(ReadResponse::ChainHeight)?;

// Await the response.
let response: ReadResponse = response_channel.await?;

// Assert the response is what we expected.
assert_eq!(matches!(response), Response::ChainHeight(_));

After awaiting the returned channel, a Response will eventually be returned when the service threadpool has fetched the value from the database and sent it off.

Both read/write requests variants match in name with Response variants, i.e.

ReadRequest::ChainHeight leads to Response::ChainHeight
WriteRequest::WriteBlock leads to Response::WriteBlockOk

5.4 Thread model

As mentioned in the 4. Layers section, the base database abstractions themselves are not concerned with parallelism, they are mostly functions to be called from a single-thread.

However, the cuprate_database::service API, does have a thread model backing it.

When cuprate_database::service's initialization function is called, threads will be spawned and maintained until the user drops (disconnects) the returned handles.

The current behavior for thread count is:

For example, on a system with 32-threads, cuprate_database will spawn:

1 writer thread
32 reader threads

whose sole responsibility is to listen for database requests, access the database (potentially in parallel), and return a response.

Note that the 1 system thread = 1 reader thread model is only the default setting, the reader thread count can be configured by the user to be any number between 1 .. amount_of_system_threads.

The reader threads are managed by rayon.

For an example of where multiple reader threads are used: given a request that asks if any key-image within a set already exists, cuprate_database will split that work between the threads with rayon.

5.5 Shutdown

Once the read/write handles are Droped, the backing thread(pool) will gracefully exit, automatically.

Note the writer thread and reader threadpool aren't connected whatsoever; dropping the write handle will make the writer thread exit, however, the reader handle is free to be held onto and can be continued to be read from - and vice-versa for the write handle.

6. Syncing

cuprate_database's database has 5 disk syncing modes.

FastThenSafe
Safe
Async
Threshold
Fast

The default mode is Safe.

This means that upon each transaction commit, all the data that was written will be fully synced to disk. This is the slowest, but safest mode of operation.

Note that upon any database Drop, whether via service or dropping the database directly, the current implementation will sync to disk regardless of any configuration.

For more information on the other modes, read the documentation here.

7. Resizing

Database backends that require manually resizing will, by default, use a similar algorithm as monerod's.

Note that this only relates to the service module, where the database is handled by cuprate_database itself, not the user. In the case of a user directly using cuprate_database, it is up to them on how to resize.

Within service, the resizing logic defined here does the following:

If there's not enough space to fit a write request's data, start a resize
Each resize adds around 1_073_745_920 bytes to the current map size
A resize will be attempted 3 times before failing

There are other resizing algorithms that define how the database's memory map grows, although currently the behavior of monerod is closely followed.

8. (De)serialization

All types stored inside the database are either bytes already, or are perfectly bitcast-able.

As such, they do not incur heavy (de)serialization costs when storing/fetching them from the database. The main (de)serialization used is bytemuck's traits and casting functions.

The size & layout of types is stable across compiler versions, as they are set and determined with #[repr(C)] and bytemuck's derive macros such as bytemuck::Pod.

Note that the data stored in the tables are still type-safe; we still refer to the key and values within our tables by the type.

The main deserialization trait for database storage is: cuprate_database::Storable.

Before storage, the type is simply cast into bytes
When fetching, the bytes are simply cast into the type

When a type is casted into bytes, the reference is casted, i.e. this is zero-cost serialization.

However, it is worth noting that when bytes are casted into the type, it is copied. This is due to byte alignment guarantee issues with both backends, see:

Without this, bytemuck will panic with TargetAlignmentGreaterAndInputNotAligned when casting.

Copying the bytes fixes this problem, although it is more costly than necessary. However, in the main use-case for cuprate_database (the service module) the bytes would need to be owned regardless as the Request/Response API uses owned data types (T, Vec<T>, HashMap<K, V>, etc).

Practically speaking, this means lower-level database functions that normally look like such:

fn get(key: &Key) -> &Value;

end up looking like this in cuprate_database:

fn get(key: &Key) -> Value;

Since each backend has its own (de)serialization methods, our types are wrapped in compatibility types that map our Storable functions into whatever is required for the backend, e.g:

Compatibility structs also exist for any Storable containers:

Again, it's unfortunate that these must be owned, although in service's use-case, they would have to be owned anyway.

9. Schema

This following section contains Cuprate's database schema, it may change throughout the development of Cuprate, as such, nothing here is final.

9.1 Tables

The CamelCase names of the table headers documented here (e.g. TxIds) are the actual type name of the table within cuprate_database.

Note that words written within code blocks mean that it is a real type defined and usable within cuprate_database. Other standard types like u64 and type aliases (TxId) are written normally.

Within cuprate_database::tables, the below table is essentially defined as-is with a macro.

Many of the data types stored are the same data types, although are different semantically, as such, a map of aliases used and their real data types is also provided below.

Alias	Real Type
BlockHeight, Amount, AmountIndex, TxId, UnlockTime	u64
BlockHash, KeyImage, TxHash, PrunableHash	[u8; 32]

Table	Key	Value	Description
`BlockBlobs`	BlockHeight	`StorableVec<u8>`	Maps a block's height to a serialized byte form of a block
`BlockHeights`	BlockHash	BlockHeight	Maps a block's hash to its height
`BlockInfos`	BlockHeight	`BlockInfo`	Contains metadata of all blocks
`KeyImages`	KeyImage	()	This table is a set with no value, it stores transaction key images
`NumOutputs`	Amount	u64	Maps an output's amount to the number of outputs with that amount
`Outputs`	`PreRctOutputId`	`Output`	This table contains legacy CryptoNote outputs which have clear amounts. This table will not contain an output with 0 amount.
`PrunedTxBlobs`	TxId	`StorableVec<u8>`	Contains pruned transaction blobs (even if the database is not pruned)
`PrunableTxBlobs`	TxId	`StorableVec<u8>`	Contains the prunable part of a transaction
`PrunableHashes`	TxId	PrunableHash	Contains the hash of the prunable part of a transaction
`RctOutputs`	AmountIndex	`RctOutput`	Contains RingCT outputs mapped from their global RCT index
`TxBlobs`	TxId	`StorableVec<u8>`	Serialized transaction blobs (bytes)
`TxIds`	TxHash	TxId	Maps a transaction's hash to its index/ID
`TxHeights`	TxId	BlockHeight	Maps a transaction's ID to the height of the block it comes from
`TxOutputs`	TxId	`StorableVec<u64>`	Gives the amount indices of a transaction's outputs
`TxUnlockTime`	TxId	UnlockTime	Stores the unlock time of a transaction (only if it has a non-zero lock time)

The definitions for aliases and types (e.g. RctOutput) are within the cuprate_database::types module.

9.2 Multimap tables

When referencing outputs, Monero will use the amount and the amount index. This means 2 keys are needed to reach an output.

With LMDB you can set the DUP_SORT flag on a table and then set the key/value to:

Key = KEY_PART_1

Value = {
    KEY_PART_2,
    VALUE // The actual value we are storing.
}

Then you can set a custom value sorting function that only takes KEY_PART_2 into account; this is how monerod does it.

This requires that the underlying database supports:

multimap tables
custom sort functions on values
setting a cursor on a specific key/value

Another way to implement this is as follows:

Key = { KEY_PART_1, KEY_PART_2 }

Value = VALUE

Then the key type is simply used to look up the value; this is how cuprate_database does it.

For example, the key/value pair for outputs is:

PreRctOutputId => Output

where PreRctOutputId looks like this:

struct PreRctOutputId {
    amount: u64,
    amount_index: u64,
}

10. Known issues and tradeoffs

cuprate_database takes many tradeoffs, whether due to:

Prioritizing certain values over others
Not having a better solution
Being "good enough"

This is a list of the larger ones, along with issues that don't have answers yet.

10.1 Traits abstracting backends

Although all database backends used are very similar, they have some crucial differences in small implementation details that must be worked around when conforming them to cuprate_database's traits.

Put simply: using cuprate_database's traits is less efficient and more awkward than using the backend directly.

For example:

This is a tradeoff that cuprate_database takes, as:

The backend itself is usually not the source of bottlenecks in the greater system, as such, small inefficiencies are OK
None of the lost functionality is crucial for operation
The ability to use, test, and swap between multiple database backends is worth it

10.2 Hot-swappable backends

Using a different backend is really as simple as re-building cuprate_database with a different feature flag:

# Use LMDB.
cargo build --package cuprate-database --features heed

# Use redb.
cargo build --package cuprate-database --features redb

This is "good enough" for now, however ideally, this hot-swapping of backends would be able to be done at runtime.

As it is now, cuprate_database cannot compile both backends and swap based on user input at runtime; it must be compiled with a certain backend, which will produce a binary with only that backend.

This also means things like CI testing multiple backends is awkward, as we must re-compile with different feature flags instead.

10.3 Copying unaligned bytes

As mentioned in 8. (De)serialization, bytes are copied when they are turned into a type T due to unaligned bytes being returned from database backends.

Using a regular reference cast results in an improperly aligned type T; such a type even existing causes undefined behavior. In our case, bytemuck saves us by panicking before this occurs.

Thus, when using cuprate_database's database traits, an owned T is returned.

This is doubly unfortunately for &[u8] as this does not even need deserialization.

For example, StorableVec could have been this:

enum StorableBytes<'a, T: Storable> {
    Owned(T),
    Ref(&'a T),
}

but this would require supporting types that must be copied regardless with the occasional &[u8] that can be returned without casting. This was hard to do so in a generic way, thus all [u8]'s are copied and returned as owned StorableVecs.

This is a tradeoff cuprate_database takes as:

bytemuck::pod_read_unaligned is cheap enough
The main API, service, needs to return owned value anyway
Having no references removes a lot of lifetime complexity

The alternative is either:

Using proper (de)serialization instead of casting (which comes with its own costs)
Somehow fixing the alignment issues in the backends mentioned previously

10.4 Endianness

cuprate_database's (de)serialization and storage of bytes are native-endian, as in, byte storage order will depend on the machine it is running on.

As Cuprate's build-targets are all little-endian (big-endian by default machines barely exist), this doesn't matter much and the byte ordering can be seen as a constant.

Practically, this means cuprated's database files can be transferred across computers, as can monerod's.

10.5 Extra table data

Some of cuprate_database's tables differ from monerod's tables, for example, the way 9.2 Multimap tables tables are done requires that the primary key is stored for all entries, compared to monerod only needing to store it once.