Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#620: Add structured logging to basic transaction flow (Authority/Client) #706

Merged
merged 12 commits into from
Mar 9, 2022
96 changes: 96 additions & 0 deletions doc/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Logging, Tracing, Metrics, and Observability

Good observability facilities will be key to the development and productionization of Sui. This is made
more challenging by the distributed and asynchronous nature of Sui, with multiple client and authority
processes distributed over a potentially global network.

The observability stack in Sui is based on the Tokio [tracing](https://tokio.rs/blog/2019-08-tracing) library.
The rest of this document highlights specific aspects of achieving good observability such as useful logging
and metrics in Sui.

## Contexts, Scopes, and Tracing Transaction Flow

The main idea of logging and tracing in a distributed and asynchronous context, where you cannot rely on looking
at individual logs over time in a single thread, is to assign context to logging/tracing so that we can trace,
for example, individual transactions. Context uses key-value pairs so that we can easily filter on subsets
and examine individual flows, across both thread/task and process boundaries.

Here is a table/summary of context information that we will want:

- TX Digest
- Object reference/ID, when applicable
- Address/account
- Certificate digest, if applicable
- For Client HTTP endpoint: route, method, status
- Epoch
- Host information, for both clients and authorities

Example output which shows both context (tx digests) and key-value pairs enhancing observability/filtering:

```
2022-03-03T18:44:13.579135Z DEBUG test_native_transfer:process_cert{tx_digest=t#82fd2ee00ab4d23c498369ff4d6fc0fc1a74be6f56a6d0490022e0920577f4c7}: sui_core::authority_aggregator: Received effects responses from authorities num_unique_effects=1 bad_stake=2
2022-03-03T18:44:13.579165Z DEBUG test_native_transfer:process_cert{tx_digest=t#82fd2ee00ab4d23c498369ff4d6fc0fc1a74be6f56a6d0490022e0920577f4c7}: sui_core::authority_aggregator: Found an effect with good stake over threshold good_stake=4
```

## Key-Value Pairs Schema

### Span names

Spans capture not a single event but an entire block of time, so start, end, duration, etc. can be captured
and analyzed for tracing, performance analysis, etc.

| Name | Place | Meaning |
| ----------------------- | ------------------ | ------------------------------------------------------------------------------ |
| process_tx | Gateway, Authority | Send transaction request, get back 2f+1 signatures and make certificate |
| process_cert | Gateway | Send certificate to authorities to execute transaction |
| handle_cert | Gateway, Authority | Handle certificate processing and Move execution |
| sync_cert | Gateway, Authority | Gateway-initiated sync of data to authority |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precede this meaning with a verb for consistency with other entries here, like so:
Conduct gateway-initiated sync of data to authority

| tx_check_locks | Authority | Check locks on input objects of incoming transaction request |
| db_set_transaction_lock | Authority | Database set transaction locks on new transaction |
| db_update_state | Authority | Update the database with certificate, effects after transaction Move execution |
| | | |

### Tags - Keys

The idea is that every event and span would get tagged with key/value pairs. Events/logs that log within any context or nested contexts would also inherit the context-level tags.
These tags represent "fields" that can be analyzed and filtered by. For example, one could filter out broadcasts and see the errors for all instances where the bad stake exceeded a certain amount, but not enough for an error.

TODO: see if keys need to be scoped by contexts

| Key | Place(s) | Meaning |
| ------------------- | ------------------ | -------------------------------------------------------------------------- |
| tx_digest | Gateway, Authority | Hex digest of transaction |
| quorum_threshold | Gateway | Numeric threshold of quorum stake needed for a transaction |
| validity_threshold | Gateway | Numeric threshold of maximum "bad stake" from errors that can be tolerated |
| num_errors | Gateway | Number of errors from authorities broadcast |
| good_stake | Gateway | Total amount of good stake from authorities who answered a broadcast |
| bad_stake | Gateway | Total amount of bad stake from authorities, including errors |
| num_signatures | Gateway | Number of signatures received from authorities broadcast |
| num_unique_effects | Gateway | Number of unique effects responses from authorities |
| num_inputs | Authority | Number of inputs for transaction processing |
| num_mutable_objects | Authority | Number of mutable objects for transaction processing |
| gas_used | Authority | Amount of gas used by the transaction |
| | | |

## Logging Levels

This is always tricky, to balance the right amount of verbosity especially by default, but keeping in mind this is a high performance system.

| Level | Type of Messages |
| ----- | ---------------------------------------------------------------------------------------------------------- |
| Error | Process-level faults (not transaction-level errors, there could be a ton of those) |
| Warn | Unusual or byzantine activity |
| Info | High level aggregate stats. Major events related to data sync, epoch changes. |
| Debug | High level tracing for individual transactions. Eg Gateway/client side -> authority -> Move execution etc. |
| Trace | Extremely detailed tracing for individual transactions |
| | |

## Metrics

## Live Async Inspection / Tokio Console

## (Re)Configuration

TODO: Discuss live changing of logging levels for example

## Viewing Logs, Traces, Metrics
2 changes: 2 additions & 0 deletions doc/src/wallet.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,8 @@ or
The network config file path defaults to `./network.conf` if not
specified.

NOTE: For logs, set `RUST_LOG=debug` before invoking `./sui start`.


### Running interactive wallet

Expand Down
2 changes: 1 addition & 1 deletion sui/src/rest_server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ async fn main() -> Result<(), String> {
.to_logger("rest_server")
.map_err(|error| format!("failed to create logger: {}", error))?;

tracing_subscriber::fmt().init();
tracing_subscriber::fmt::init();

let mut api = ApiDescription::new();

Expand Down
2 changes: 1 addition & 1 deletion sui/src/sui.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ struct SuiOpt {

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
tracing_subscriber::fmt().init();
tracing_subscriber::fmt::init();

let options: SuiOpt = SuiOpt::from_args();
let network_conf_path = options.config;
Expand Down
8 changes: 7 additions & 1 deletion sui/src/wallet.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ use sui::config::{Config, WalletConfig};
use sui::shell::{install_shell_plugins, AsyncHandler, CommandStructure, Shell};
use sui::wallet_commands::*;
use tracing::error;
use tracing_subscriber::EnvFilter;

const SUI: &str = " _____ _ _ __ ____ __
/ ___/__ __(_) | | / /___ _/ / /__ / /_
Expand Down Expand Up @@ -46,7 +47,12 @@ async fn main() -> Result<(), anyhow::Error> {
.with_thread_names(false)
.without_time()
.compact();
tracing_subscriber::fmt().event_format(format).init();

let env_filter = EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info"));
tracing_subscriber::fmt()
.event_format(format)
.with_env_filter(env_filter)
.init();

let mut app: App = ClientOpt::clap();
app = app.unset_setting(AppSettings::NoBinaryName);
Expand Down
22 changes: 21 additions & 1 deletion sui_core/src/authority.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ use sui_types::{
storage::Storage,
MOVE_STDLIB_ADDRESS, SUI_FRAMEWORK_ADDRESS,
};
use tracing::{field::debug, *};

use crate::authority_batch::BatchSender;

Expand Down Expand Up @@ -254,6 +255,7 @@ impl AuthorityState {

let mutable_objects: Vec<_> = self
.check_locks(&transaction)
.instrument(tracing::trace_span!("tx_check_locks"))
.await?
.into_iter()
.filter_map(|(object_kind, object)| match object_kind {
Expand All @@ -269,13 +271,19 @@ impl AuthorityState {
})
.collect();

debug!(
num_mutable_objects = mutable_objects.len(),
"Checked locks and found mutable objects"
);

let signed_transaction = SignedTransaction::new(transaction, self.name, &*self.secret);

// Check and write locks, to signed transaction, into the database
// The call to self.set_transaction_lock checks the lock is not conflicting,
// and returns ConflictingTransaction error in case there is a lock on a different
// existing transaction.
self.set_transaction_lock(&mutable_objects, signed_transaction)
.instrument(tracing::trace_span!("db_set_transaction_lock"))
.await?;

// Return the signed Transaction or maybe a cert.
Expand All @@ -296,6 +304,7 @@ impl AuthorityState {
let transaction = &certificate.transaction;
let transaction_digest = transaction.digest();
if transaction.contains_shared_object() {
debug!("Validating shared object sequence numbers from consensus...");
let mut lock_errors = Vec::new();
for object_id in transaction.shared_input_objects() {
// Check whether the shared objects have already been assigned a sequence number by
Expand All @@ -310,6 +319,9 @@ impl AuthorityState {
Some(lock) => {
if let Some(object) = self._database.get_object(object_id)? {
if object.version() != lock {
warn!(object_version =? object.version(),
locked_version =? lock,
"Unexpected version number in locked shared object");
lock_errors.push(SuiError::InvalidSequenceNumber);
}
}
Expand Down Expand Up @@ -378,6 +390,10 @@ impl AuthorityState {
inputs.push(object);
}
}
debug!(
num_inputs = inputs.len(),
"Read inputs for transaction from DB"
);

let mut transaction_dependencies: BTreeSet<_> = inputs
.iter()
Expand All @@ -390,6 +406,10 @@ impl AuthorityState {
let gas_object_id = transaction.gas_payment_object_ref().0;
let (mut temporary_store, status) =
self.execute_transaction(transaction, inputs, &mut tx_ctx)?;
debug!(
gas_used = status.gas_used(),
"Finished execution of transaction with status {:?}", status
);

// Remove from dependencies the generic hash
transaction_dependencies.remove(&TransactionDigest::genesis());
Expand All @@ -405,9 +425,9 @@ impl AuthorityState {
&gas_object_id,
);
// Update the database in an atomic manner

let (seq, resp) = self
.update_state(temporary_store, certificate, to_signed_effects)
.instrument(tracing::debug_span!("db_update_state"))
.await?; // Returns the OrderInfoResponse

// If there is a notifier registered, notify:
Expand Down
5 changes: 5 additions & 0 deletions sui_core/src/authority/authority_store.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ use std::path::Path;

use std::sync::atomic::AtomicU64;
use sui_types::base_types::SequenceNumber;
use tracing::{warn, Instrument};
use typed_store::rocks::{open_cf, DBBatch, DBMap};

use std::sync::atomic::Ordering;
Expand Down Expand Up @@ -333,6 +334,7 @@ impl AuthorityStore {
// new locks must be atomic, and not writes should happen in between.
{
// Aquire the lock to ensure no one else writes when we are in here.
// MutexGuards are unlocked on drop (ie end of this block)
let _mutexes = self.acquire_locks(mutable_input_objects);

let locks = self.transaction_lock.multi_get(mutable_input_objects)?;
Expand All @@ -347,6 +349,9 @@ impl AuthorityStore {
.get_transaction_lock(obj_ref)?
.expect("If we have a lock we should have a transaction.");

warn!(prev_tx_digest =? previous_tx_digest,
cur_tx_digest =? tx_digest,
"Conflicting transaction! Lock state changed in unexpected way");
// TODO: modify ConflictingTransaction to only return the transaction digest here.
return Err(SuiError::ConflictingTransaction {
pending_transaction: prev_transaction.transaction,
Expand Down
Loading