From dbb3ef33653497ca0cc66097ea10f4cd1e031d1b Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Thu, 3 Mar 2022 10:56:09 -0800 Subject: [PATCH 01/10] Initial observability doc and example of context/structured logging in authority_aggregator --- doc/observability.md | 90 ++++++++++++++++++++++++++++ sui_core/src/authority_aggregator.rs | 63 +++++++++++++++++++ 2 files changed, 153 insertions(+) create mode 100644 doc/observability.md diff --git a/doc/observability.md b/doc/observability.md new file mode 100644 index 0000000000000..1c8528e6abcc1 --- /dev/null +++ b/doc/observability.md @@ -0,0 +1,90 @@ +# Logging, Tracing, Metrics, and Observability + +Good observability facilities will be key to the development and productionization of Sui. This is made +more challenging by the distributed and asynchronous nature of Sui, with multiple client and authority +processes distributed over a potentially global network. + +The observability stack in Sui is based on the Tokio [tracing](https://tokio.rs/blog/2019-08-tracing) library. +The rest of this document highlights specific aspects of achieving good observability such as useful logging +and metrics in Sui. + +## Contexts, Scopes, and Tracing Transaction Flow + +The main idea of logging and tracing in a distributed and asynchronous context, where you cannot rely on looking +at individual logs over time in a single thread, is to assign context to logging/tracing so that we can trace, +for example, individual transactions. Context uses key-value pairs so that we can easily filter on subsets +and examine individual flows, across both thread/task and process boundaries. + +Here is a table/summary of context information that we will want: + +- TX Digest +- Object reference/ID, when applicable +- Address/account +- Certificate digest, if applicable +- For Client HTTP endpoint: route, method, status +- Epoch +- Host information, for both clients and authorities + +Example output which shows both context (tx digests) and key-value pairs enhancing observability/filtering: + +``` +2022-03-03T18:44:13.579135Z DEBUG test_native_transfer:process_cert{tx_digest=t#82fd2ee00ab4d23c498369ff4d6fc0fc1a74be6f56a6d0490022e0920577f4c7}: sui_core::authority_aggregator: Received effects responses from authorities num_unique_effects=1 bad_stake=2 +2022-03-03T18:44:13.579165Z DEBUG test_native_transfer:process_cert{tx_digest=t#82fd2ee00ab4d23c498369ff4d6fc0fc1a74be6f56a6d0490022e0920577f4c7}: sui_core::authority_aggregator: Found an effect with good stake over threshold good_stake=4 +``` + +## Key-Value Pairs Schema + +### Span names + +Spans capture not a single event but an entire block of time, so start, end, duration, etc. can be captured +and analyzed for tracing, performance analysis, etc. + +| Name | Place | Meaning | +| ------------ | ------------------ | ----------------------------------------------------------------------- | +| process_tx | Gateway, Authority | Send transaction request, get back 2f+1 signatures and make certificate | +| process_cert | Gateway | Send certificate to authorities to execute transaction | +| handle_cert | Gateway, Authority | Handle certificate processing and Move execution | +| sync_cert | Gateway, Authority | Gateway-initiated sync of data to authority | +| | | | + +### Tags - Keys + +The idea is that every event and span would get tagged with key/value pairs. Events/logs that log within any context or nested contexts would also inherit the context-level tags. +These tags represent "fields" that can be analyzed and filtered by. For example, one could filter out broadcasts and see the errors for all instances where the bad stake exceeded a certain amount, but not enough for an error. + +TODO: see if keys need to be scoped by contexts + +| Key | Place(s) | Meaning | +| ------------------ | ------------------ | -------------------------------------------------------------------------- | +| tx_digest | Gateway, Authority | Hex digest of transaction | +| quorum_threshold | Gateway | Numeric threshold of quorum stake needed for a transaction | +| validity_threshold | Gateway | Numeric threshold of maximum "bad stake" from errors that can be tolerated | +| num_errors | Gateway | Number of errors from authorities broadcast | +| good_stake | Gateway | Total amount of good stake from authorities who answered a broadcast | +| bad_stake | Gateway | Total amount of bad stake from authorities, including errors | +| num_signatures | Gateway | Number of signatures received from authorities broadcast | +| num_unique_effects | Gateway | Number of unique effects responses from authorities | +| | | | + +## Logging Levels + +This is always tricky, to balance the right amount of verbosity especially by default, but keeping in mind this is a high performance system. + +| Level | Type of Messages | +| ----- | ---------------------------------------------------------------------------------------------------------- | +| Error | Process-level faults (not transaction-level errors, there could be a ton of those) | +| Warn | Unusual or byzantine activity | +| Info | High level aggregate stats. Major events related to data sync, epoch changes. | +| Debug | High level tracing for individual transactions. Eg Gateway/client side -> authority -> Move execution etc. | +| Trace | Extremely detailed tracing for individual transactions | +| | | + +## Metrics + +## Live Async Inspection / Tokio Console + +## (Re)Configuration + +TODO: Discuss live changing of logging levels for example + +## Viewing Logs, Traces, Metrics \ No newline at end of file diff --git a/sui_core/src/authority_aggregator.rs b/sui_core/src/authority_aggregator.rs index b7903633fe92f..6f222d0a657cd 100644 --- a/sui_core/src/authority_aggregator.rs +++ b/sui_core/src/authority_aggregator.rs @@ -15,6 +15,7 @@ use sui_types::{ error::{SuiError, SuiResult}, messages::*, }; +use tracing::{debug, info, trace, Instrument}; use std::collections::{BTreeMap, BTreeSet, HashMap, HashSet}; use std::time::Duration; @@ -740,6 +741,13 @@ where // Now broadcast the transaction to all authorities. let threshold = self.committee.quorum_threshold(); let validity = self.committee.validity_threshold(); + debug!( + quorum_threshold = threshold, + validity_threshold = validity, + ?timeout_after_quorum, + "Broadcasting transaction request to authorities" + ); + trace!("Transaction data: {:?}", transaction.data); struct ProcessTransactionState { // The list of signatures gathered at any point @@ -780,6 +788,7 @@ where certified_transaction: Some(inner_certificate), .. }) => { + trace!(?name, weight, "Received prev certificate from authority"); state.certificate = Some(inner_certificate); } @@ -813,6 +822,12 @@ where state.bad_stake += weight; // This is the bad stake counter if state.bad_stake > validity { // Too many errors + debug!( + num_errors = state.errors.len(), + bad_stake = state.bad_stake, + "Too many errors, validity threshold exceeded. Errors={:?}", + state.errors + ); return Err(SuiError::QuorumNotReached { errors: state.errors, }); @@ -824,6 +839,12 @@ where state.bad_stake += weight; // This is the bad stake counter if state.bad_stake > validity { // Too many errors + debug!( + num_errors = state.errors.len(), + bad_stake = state.bad_stake, + "Too many errors, validity threshold exceeded. Errors={:?}", + state.errors + ); return Err(SuiError::QuorumNotReached { errors: state.errors, }); @@ -844,6 +865,18 @@ where ) .await?; + debug!( + num_errors = state.errors.len(), + good_stake = state.good_stake, + bad_stake = state.bad_stake, + num_signatures = state.signatures.len(), + has_certificate = state.certificate.is_some(), + "Received signatures response from authorities for transaction req broadcast" + ); + if !state.errors.is_empty() { + trace!("Errors received: {:?}", state.errors); + } + // If we have some certificate return it, or return an error. state .certificate @@ -862,6 +895,9 @@ where timeout_after_quorum: Duration, ) -> Result { struct ProcessCertificateState { + // Different authorities could return different effects. We want at least one effect to come + // from 2f+1 authorities, which meets quorum and can be considered the approved effect. + // The map here allows us to count the stake for each unique effect. effects_map: HashMap<[u8; 32], (usize, TransactionEffects)>, bad_stake: usize, } @@ -874,6 +910,13 @@ where let cert_ref = &certificate; let threshold = self.committee.quorum_threshold(); let validity = self.committee.validity_threshold(); + debug!( + quorum_threshold = threshold, + validity_threshold = validity, + ?timeout_after_quorum, + "Broadcasting certificate to authorities" + ); + let state = self .quorum_map_then_reduce_with_timeout( state, @@ -888,6 +931,7 @@ where .handle_confirmation_transaction(ConfirmationTransaction::new( cert_ref.clone(), )) + .instrument(tracing::trace_span!("handle_cert", authority =? _name)) .await; if res.is_ok() { @@ -900,6 +944,7 @@ where // We only attempt to update authority and retry if we are seeing LockErrors. // For any other error, we stop here and return. if !matches!(res, Err(SuiError::LockErrors { .. })) { + debug!("Error from handle_confirmation_transaction(): {:?}", res); return res; } @@ -911,6 +956,7 @@ where timeout_after_quorum, DEFAULT_RETRIES, ) + .instrument(tracing::trace_span!("sync_cert", authority =? _name)) .await?; // Now try again @@ -918,6 +964,7 @@ where .handle_confirmation_transaction(ConfirmationTransaction::new( cert_ref.clone(), )) + .instrument(tracing::trace_span!("handle_cert", authority =? _name, retry = true)) .await }) }, @@ -946,9 +993,12 @@ where } } + // Evan: couldn't we get here if entry.0 above is < threshold but response was valid? wtf?? // If instead we have more than f bad responses, then we fail. state.bad_stake += weight; if state.bad_stake > validity { + debug!(bad_stake = state.bad_stake, + "Too many bad responses from cert processing, validity threshold exceeded."); return Err(SuiError::ErrorWhileRequestingCertificate); } @@ -960,10 +1010,20 @@ where ) .await?; + debug!( + num_unique_effects = state.effects_map.len(), + bad_stake = state.bad_stake, + "Received effects responses from authorities" + ); + // Check that one effects structure has more than 2f votes, // and return it. for (stake, effects) in state.effects_map.into_values() { if stake >= threshold { + debug!( + good_stake = stake, + "Found an effect with good stake over threshold" + ); return Ok(effects); } } @@ -1019,11 +1079,14 @@ where &self, transaction: &Transaction, ) -> Result<(CertifiedTransaction, TransactionEffects), anyhow::Error> { + let tx_digest = transaction.digest(); let new_certificate = self .process_transaction(transaction.clone(), Duration::from_secs(60)) + .instrument(tracing::debug_span!("process_tx", ?tx_digest)) .await?; let response = self .process_certificate(new_certificate.clone(), Duration::from_secs(60)) + .instrument(tracing::debug_span!("process_cert", ?tx_digest)) .await?; Ok((new_certificate, response)) From 6495bc6f08ac9c22ca5efe2146bf8337760a84ff Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Mon, 7 Mar 2022 15:07:47 -0800 Subject: [PATCH 02/10] Start logging tx / cert processing in authority.rs --- doc/observability.md | 42 +++++++++++++---------- sui_core/src/authority.rs | 23 ++++++++++++- sui_core/src/authority/authority_store.rs | 5 +++ sui_core/src/authority_server.rs | 14 +++++--- sui_types/src/messages.rs | 8 +++++ 5 files changed, 68 insertions(+), 24 deletions(-) diff --git a/doc/observability.md b/doc/observability.md index 1c8528e6abcc1..b65061a99a084 100644 --- a/doc/observability.md +++ b/doc/observability.md @@ -39,13 +39,16 @@ Example output which shows both context (tx digests) and key-value pairs enhanci Spans capture not a single event but an entire block of time, so start, end, duration, etc. can be captured and analyzed for tracing, performance analysis, etc. -| Name | Place | Meaning | -| ------------ | ------------------ | ----------------------------------------------------------------------- | -| process_tx | Gateway, Authority | Send transaction request, get back 2f+1 signatures and make certificate | -| process_cert | Gateway | Send certificate to authorities to execute transaction | -| handle_cert | Gateway, Authority | Handle certificate processing and Move execution | -| sync_cert | Gateway, Authority | Gateway-initiated sync of data to authority | -| | | | +| Name | Place | Meaning | +| ----------------------- | ------------------ | ------------------------------------------------------------------------------ | +| process_tx | Gateway, Authority | Send transaction request, get back 2f+1 signatures and make certificate | +| process_cert | Gateway | Send certificate to authorities to execute transaction | +| handle_cert | Gateway, Authority | Handle certificate processing and Move execution | +| sync_cert | Gateway, Authority | Gateway-initiated sync of data to authority | +| tx_check_locks | Authority | Check locks on input objects of incoming transaction request | +| db_set_transaction_lock | Authority | Database set transaction locks on new transaction | +| db_update_state | Authority | Update the database with certificate, effects after transaction Move execution | +| | | | ### Tags - Keys @@ -54,17 +57,20 @@ These tags represent "fields" that can be analyzed and filtered by. For example TODO: see if keys need to be scoped by contexts -| Key | Place(s) | Meaning | -| ------------------ | ------------------ | -------------------------------------------------------------------------- | -| tx_digest | Gateway, Authority | Hex digest of transaction | -| quorum_threshold | Gateway | Numeric threshold of quorum stake needed for a transaction | -| validity_threshold | Gateway | Numeric threshold of maximum "bad stake" from errors that can be tolerated | -| num_errors | Gateway | Number of errors from authorities broadcast | -| good_stake | Gateway | Total amount of good stake from authorities who answered a broadcast | -| bad_stake | Gateway | Total amount of bad stake from authorities, including errors | -| num_signatures | Gateway | Number of signatures received from authorities broadcast | -| num_unique_effects | Gateway | Number of unique effects responses from authorities | -| | | | +| Key | Place(s) | Meaning | +| ------------------- | ------------------ | -------------------------------------------------------------------------- | +| tx_digest | Gateway, Authority | Hex digest of transaction | +| quorum_threshold | Gateway | Numeric threshold of quorum stake needed for a transaction | +| validity_threshold | Gateway | Numeric threshold of maximum "bad stake" from errors that can be tolerated | +| num_errors | Gateway | Number of errors from authorities broadcast | +| good_stake | Gateway | Total amount of good stake from authorities who answered a broadcast | +| bad_stake | Gateway | Total amount of bad stake from authorities, including errors | +| num_signatures | Gateway | Number of signatures received from authorities broadcast | +| num_unique_effects | Gateway | Number of unique effects responses from authorities | +| num_inputs | Authority | Number of inputs for transaction processing | +| num_mutable_objects | Authority | Number of mutable objects for transaction processing | +| gas_used | Authority | Amount of gas used by the transaction | +| | | | ## Logging Levels diff --git a/sui_core/src/authority.rs b/sui_core/src/authority.rs index 1c63f2c98a347..ab86f42756a8e 100644 --- a/sui_core/src/authority.rs +++ b/sui_core/src/authority.rs @@ -26,6 +26,7 @@ use sui_types::{ storage::Storage, MOVE_STDLIB_ADDRESS, SUI_FRAMEWORK_ADDRESS, }; +use tracing::{field::debug, *}; #[cfg(test)] #[path = "unit_tests/authority_tests.rs"] @@ -237,6 +238,7 @@ impl AuthorityState { let mutable_objects: Vec<_> = self .check_locks(&transaction) + .instrument(tracing::trace_span!("tx_check_locks")) .await? .into_iter() .filter_map(|(object_kind, object)| match object_kind { @@ -252,6 +254,11 @@ impl AuthorityState { }) .collect(); + debug!( + num_mutable_objects = mutable_objects.len(), + "Checked locks and found mutable objects" + ); + let signed_transaction = SignedTransaction::new(transaction, self.name, &*self.secret); // Check and write locks, to signed transaction, into the database @@ -259,6 +266,7 @@ impl AuthorityState { // and returns ConflictingTransaction error in case there is a lock on a different // existing transaction. self.set_transaction_lock(&mutable_objects, signed_transaction) + .instrument(tracing::trace_span!("db_set_transaction_lock")) .await?; // Return the signed Transaction or maybe a cert. @@ -279,6 +287,7 @@ impl AuthorityState { let transaction = &certificate.transaction; let transaction_digest = transaction.digest(); if transaction.contains_shared_object() { + debug!("Validating shared object sequence numbers from consensus..."); let mut lock_errors = Vec::new(); for object_id in transaction.shared_input_objects() { // Check whether the shared objects have already been assigned a sequence number by @@ -291,6 +300,9 @@ impl AuthorityState { Some(lock) => { if let Some(object) = self._database.get_object(object_id)? { if object.version() != lock { + warn!(object_version =? object.version(), + locked_version =? lock, + "Unexpected version number in locked shared object"); lock_errors.push(SuiError::InvalidSequenceNumber); } } @@ -311,7 +323,7 @@ impl AuthorityState { .process_certificate(confirmation_transaction.clone()) .await; - // If the execution is successfully, we cleanup some data structures. + // If the execution is successful, we cleanup some data structures. if result.is_ok() { for object_id in transaction.shared_input_objects() { self._database @@ -354,6 +366,10 @@ impl AuthorityState { inputs.push(object); } } + debug!( + num_inputs = inputs.len(), + "Read inputs for transaction from DB" + ); let mut transaction_dependencies: BTreeSet<_> = inputs .iter() @@ -366,6 +382,10 @@ impl AuthorityState { let gas_object_id = transaction.gas_payment_object_ref().0; let (mut temporary_store, status) = self.execute_transaction(transaction, inputs, &mut tx_ctx)?; + debug!( + gas_used = status.gas_used(), + "Finished execution of transaction with status {:?}", status + ); // Remove from dependencies the generic hash transaction_dependencies.remove(&TransactionDigest::genesis()); @@ -382,6 +402,7 @@ impl AuthorityState { ); // Update the database in an atomic manner self.update_state(temporary_store, certificate, to_signed_effects) + .instrument(tracing::debug_span!("db_update_state")) .await // Returns the TransactionInfoResponse } diff --git a/sui_core/src/authority/authority_store.rs b/sui_core/src/authority/authority_store.rs index c136001b64095..7c5311afbcbde 100644 --- a/sui_core/src/authority/authority_store.rs +++ b/sui_core/src/authority/authority_store.rs @@ -7,6 +7,7 @@ use std::collections::BTreeSet; use std::convert::TryInto; use std::path::Path; use sui_types::base_types::SequenceNumber; +use tracing::{warn, Instrument}; use typed_store::rocks::{open_cf, DBBatch, DBMap}; use typed_store::traits::Map; @@ -292,6 +293,7 @@ impl AuthorityStore { // new locks must be atomic, and not writes should happen in between. { // Aquire the lock to ensure no one else writes when we are in here. + // MutexGuards are unlocked on drop (ie end of this block) let _mutexes = self.acquire_locks(mutable_input_objects); let locks = self.transaction_lock.multi_get(mutable_input_objects)?; @@ -306,6 +308,9 @@ impl AuthorityStore { .get_transaction_lock(obj_ref)? .expect("If we have a lock we should have a transaction."); + warn!(prev_tx_digest =? previous_tx_digest, + cur_tx_digest =? tx_digest, + "Conflicting transaction! Lock state changed in unexpected way"); // TODO: modify ConflictingTransaction to only return the transaction digest here. return Err(SuiError::ConflictingTransaction { pending_transaction: prev_transaction.transaction, diff --git a/sui_core/src/authority_server.rs b/sui_core/src/authority_server.rs index b2c42f113fcc2..54434ef2bf504 100644 --- a/sui_core/src/authority_server.rs +++ b/sui_core/src/authority_server.rs @@ -50,11 +50,14 @@ impl MessageHandler for AuthorityServer { Err(_) => Err(SuiError::InvalidDecoding), Ok(result) => { match result { - SerializedMessage::Transaction(message) => self - .state - .handle_transaction(*message) - .await - .map(|info| Some(serialize_transaction_info(&info))), + SerializedMessage::Transaction(message) => { + let tx_digest = message.digest(); + self.state + .handle_transaction(*message) + .instrument(tracing::debug_span!("process_tx", ?tx_digest)) + .await + .map(|info| Some(serialize_transaction_info(&info))) + } SerializedMessage::Cert(message) => { let confirmation_transaction = ConfirmationTransaction { certificate: message.as_ref().clone(), @@ -62,6 +65,7 @@ impl MessageHandler for AuthorityServer { match self .state .handle_confirmation_transaction(confirmation_transaction) + .instrument(tracing::debug_span!("process_cert", tx_digest =? message.transaction.digest())) .await { Ok(info) => { diff --git a/sui_types/src/messages.rs b/sui_types/src/messages.rs index 8540bc420bc89..298199955e99f 100644 --- a/sui_types/src/messages.rs +++ b/sui_types/src/messages.rs @@ -271,6 +271,14 @@ impl ExecutionStatus { ExecutionStatus::Failure { gas_used, error } => (gas_used, *error), } } + + /// Returns the gas used from the status + pub fn gas_used(&self) -> u64 { + match &self { + ExecutionStatus::Success { gas_used } => *gas_used, + ExecutionStatus::Failure { gas_used, .. } => *gas_used, + } + } } /// The response from processing a transaction or a certified transaction From 9d07e33868439e1e905a667c66f92d4469e386ce Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Mon, 7 Mar 2022 16:47:36 -0800 Subject: [PATCH 03/10] Fix tracing_subscriber initialization to respect RUST_LOG environment var --- sui/src/rest_server.rs | 2 +- sui/src/sui.rs | 2 +- sui/src/wallet.rs | 8 +++++++- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/sui/src/rest_server.rs b/sui/src/rest_server.rs index 7fa3ea03f772a..75a166cf1d8f0 100644 --- a/sui/src/rest_server.rs +++ b/sui/src/rest_server.rs @@ -44,7 +44,7 @@ async fn main() -> Result<(), String> { .to_logger("rest_server") .map_err(|error| format!("failed to create logger: {}", error))?; - tracing_subscriber::fmt().init(); + tracing_subscriber::fmt::init(); let mut api = ApiDescription::new(); diff --git a/sui/src/sui.rs b/sui/src/sui.rs index 735525a07f0bb..f938c16130d6c 100644 --- a/sui/src/sui.rs +++ b/sui/src/sui.rs @@ -26,7 +26,7 @@ struct SuiOpt { #[tokio::main] async fn main() -> Result<(), anyhow::Error> { - tracing_subscriber::fmt().init(); + tracing_subscriber::fmt::init(); let options: SuiOpt = SuiOpt::from_args(); let network_conf_path = options.config; diff --git a/sui/src/wallet.rs b/sui/src/wallet.rs index 42fed211a4bff..9dd5ef741f385 100644 --- a/sui/src/wallet.rs +++ b/sui/src/wallet.rs @@ -10,6 +10,7 @@ use sui::config::{Config, WalletConfig}; use sui::shell::{install_shell_plugins, AsyncHandler, CommandStructure, Shell}; use sui::wallet_commands::*; use tracing::error; +use tracing_subscriber::EnvFilter; const SUI: &str = " _____ _ _ __ ____ __ / ___/__ __(_) | | / /___ _/ / /__ / /_ @@ -46,7 +47,12 @@ async fn main() -> Result<(), anyhow::Error> { .with_thread_names(false) .without_time() .compact(); - tracing_subscriber::fmt().event_format(format).init(); + + let env_filter = EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")); + tracing_subscriber::fmt() + .event_format(format) + .with_env_filter(env_filter) + .init(); let mut app: App = ClientOpt::clap(); app = app.unset_setting(AppSettings::NoBinaryName); From a25d6334fda12b23cc47486be9853f42cb7a6c94 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Mon, 7 Mar 2022 16:49:38 -0800 Subject: [PATCH 04/10] Add note about enabling logs --- doc/src/wallet.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/src/wallet.md b/doc/src/wallet.md index 9ab5ab71d0c4f..a6aaf06bdddfa 100644 --- a/doc/src/wallet.md +++ b/doc/src/wallet.md @@ -118,6 +118,8 @@ or The network config file path defaults to `./network.conf` if not specified. +NOTE: For logs, set `RUST_LOG=debug` before invoking `./sui start`. + ### Running interactive wallet From 63e63ef09d440620f4048fc75d6d69c9bbb21454 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Tue, 8 Mar 2022 15:18:27 -0800 Subject: [PATCH 05/10] Add JSON span logging output by setting SUI_JSON_SPAN_LOGS=1 --- doc/observability.md | 53 +++++++++++++++++++++++++++++++++++++++----- sui/Cargo.toml | 3 ++- sui/src/sui.rs | 33 ++++++++++++++++++++++++++- 3 files changed, 82 insertions(+), 7 deletions(-) diff --git a/doc/observability.md b/doc/observability.md index b65061a99a084..4f4e74a6e061e 100644 --- a/doc/observability.md +++ b/doc/observability.md @@ -25,11 +25,18 @@ Here is a table/summary of context information that we will want: - Epoch - Host information, for both clients and authorities -Example output which shows both context (tx digests) and key-value pairs enhancing observability/filtering: +Example output which shows both context (tx digests) and key-value pairs enhancing observability/filtering, +and tracing a transaction across the gateway (`authority_aggregator`) as well as the authority: ``` -2022-03-03T18:44:13.579135Z DEBUG test_native_transfer:process_cert{tx_digest=t#82fd2ee00ab4d23c498369ff4d6fc0fc1a74be6f56a6d0490022e0920577f4c7}: sui_core::authority_aggregator: Received effects responses from authorities num_unique_effects=1 bad_stake=2 -2022-03-03T18:44:13.579165Z DEBUG test_native_transfer:process_cert{tx_digest=t#82fd2ee00ab4d23c498369ff4d6fc0fc1a74be6f56a6d0490022e0920577f4c7}: sui_core::authority_aggregator: Found an effect with good stake over threshold good_stake=4 +7ab7774d1f7bd40848}: sui_core::authority_aggregator: Broadcasting transaction request to authorities quorum_threshold=3 validity_threshold=2 timeout_after_quorum=60s +2022-03-05T01:35:03.383791Z TRACE test_move_call_args_linter_command:process_tx{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority_aggregator: Transaction data: TransactionData { kind: Call(MoveCall { package: (0000000000000000000000000000000000000002, SequenceNumber(1), o#3104eb8786a94f58d88564c38e22f13d79e3868c5cf81c9c9228fe91465eccad), module: Identifier("ObjectBasics"), function: Identifier("transfer"), type_arguments: [], object_arguments: [(DA40C299F382CBC3C1EBEEA97351F5F185BAD359, SequenceNumber(1), o#d299113b3b52fd1b9dc01e3ba9cf70345faed592af04a56e287057f166ed2783)], shared_object_arguments: [], pure_arguments: [[145, 123, 205, 38, 175, 158, 193, 63, 122, 56, 238, 127, 139, 117, 186, 164, 89, 46, 222, 252]], gas_budget: 1000 }), sender: k#37ebb9c16574a57bcc7b52a6312a35991748be55, gas_payment: (3EE0283D2D12D5C49D0E4E2F509D07227A64ADF2, SequenceNumber(1), o#3ad1a71ee65e8e6675e6a0fb1e893e48c1820b274d3055d75f4abb850c9663e5) } +2022-03-05T01:35:03.385294Z DEBUG test_move_call_args_linter_command:process_tx{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority: Checked locks and found mutable objects num_mutable_objects=2 +2022-03-05T01:35:03.386500Z DEBUG test_move_call_args_linter_command:process_tx{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority: Checked locks and found mutable objects num_mutable_objects=2 +2022-03-05T01:35:03.387681Z DEBUG test_move_call_args_linter_command:process_tx{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority_aggregator: Received signatures response from authorities for transaction req broadcast num_errors=0 good_stake=3 bad_stake=0 num_signatures=3 has_certificate=true +2022-03-05T01:35:03.391891Z DEBUG test_move_call_args_linter_command:process_cert{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority_aggregator: Broadcasting certificate to authorities quorum_threshold=3 validity_threshold=2 timeout_after_quorum=60s +2022-03-05T01:35:03.394529Z DEBUG test_move_call_args_linter_command:process_cert{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority: Read inputs for transaction from DB num_inputs=3 +2022-03-05T01:35:03.395917Z DEBUG test_move_call_args_linter_command:process_cert{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority: Finished execution of transaction with status Success { gas_used: 7 } gas_used=7 ``` ## Key-Value Pairs Schema @@ -85,12 +92,48 @@ This is always tricky, to balance the right amount of verbosity especially by de | Trace | Extremely detailed tracing for individual transactions | | | | +Going from info to debug results in a much larger spew of messages. + +The `RUST_LOG` environment variable can be used to set both the overall logging level as well as the level for +individual components, and even filtering down to specific spans or tags within spans are possible too. +For more details, please see the [EnvFilter](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html) docs. + +## Tracing and Span Output + +By default, only output from logging events (eg `info!`, `debug!`, etc) are output, and not span information. +However, detailed span start and end logs can be generated by defining the `SUI_JSON_SPAN_LOGS` environment variable. Note that this causes all output to be in JSON format, which is not as human-readable, so it is not enabled by default. + +The example output below shows certificate processing in the authority with span logging. Note the START and END annotations, +and notice how DB_UPDATE_STATE which is nested is embedded within PROCESS_CERT. +Also notice `elapsed_milliseconds` which logs the duration of each span. + +``` +{"v":0,"name":"sui","msg":"[PROCESS_CERT - START]","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.241421Z","target":"sui_core::authority_server","line":67,"file":"sui_core/src/authority_server.rs","tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493"} +{"v":0,"name":"sui","msg":"[PROCESS_CERT - EVENT] Read inputs for transaction from DB","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.246688Z","target":"sui_core::authority","line":393,"file":"sui_core/src/authority.rs","num_inputs":2,"tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493"} +{"v":0,"name":"sui","msg":"[PROCESS_CERT - EVENT] Finished execution of transaction with status Success { gas_used: 18 }","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.246759Z","target":"sui_core::authority","line":409,"file":"sui_core/src/authority.rs","gas_used":18,"tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493"} +{"v":0,"name":"sui","msg":"[DB_UPDATE_STATE - START]","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.247888Z","target":"sui_core::authority","line":430,"file":"sui_core/src/authority.rs","tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493"} +{"v":0,"name":"sui","msg":"[DB_UPDATE_STATE - END]","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.248114Z","target":"sui_core::authority","line":430,"file":"sui_core/src/authority.rs","tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493","elapsed_milliseconds":0} +{"v":0,"name":"sui","msg":"[PROCESS_CERT - END]","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.248688Z","target":"sui_core::authority_server","line":67,"file":"sui_core/src/authority_server.rs","tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493","elapsed_milliseconds":2} +``` + +## Observability of MOVE programs, Events, etc. + +TODO + ## Metrics +## Ways to View Logs, Traces, Metrics + +TODO: add architecture diagram + +The idea here is to enable the viewing of logging, traces, and metrics with industry-standard open source tools. Thoughts: +- JSON logs from `SUI_JSON_SPAN_LOGS` could be fed directly to ElasticSearch or similar backends for easy indexing of fields and aggregation +- Numbers from the output could be fed into Prometheus +- OpenTelemetry output could allow the use of Yaeger and similar tracing backends + + ## Live Async Inspection / Tokio Console ## (Re)Configuration -TODO: Discuss live changing of logging levels for example - ## Viewing Logs, Traces, Metrics \ No newline at end of file diff --git a/sui/Cargo.toml b/sui/Cargo.toml index ff38798a88398..bd248621a34c3 100644 --- a/sui/Cargo.toml +++ b/sui/Cargo.toml @@ -27,7 +27,8 @@ hex = "0.4.3" async-trait = "0.1.52" serde_with = { version = "1.12.0", features = ["hex"] } tracing = { version = "0.1.31", features = ["log"] } -tracing-subscriber = { version = "0.3.9", features = ["time", "env-filter"] } +tracing-subscriber = { version = "0.3.9", features = ["time", "registry", "env-filter"] } +tracing-bunyan-formatter = "0.3" serde-value = "0.7.0" log = "0.4.14" diff --git a/sui/src/sui.rs b/sui/src/sui.rs index f938c16130d6c..859527a4e42a8 100644 --- a/sui/src/sui.rs +++ b/sui/src/sui.rs @@ -7,6 +7,11 @@ use structopt::StructOpt; use sui::config::{Config, NetworkConfig}; use sui::sui_commands::SuiCommand; +use tracing::info; +use tracing::subscriber::set_global_default; +use tracing_bunyan_formatter::{BunyanFormattingLayer, JsonStorageLayer}; +use tracing_subscriber::{layer::SubscriberExt, EnvFilter, Registry}; + #[cfg(test)] #[path = "unit_tests/cli_tests.rs"] mod cli_tests; @@ -26,7 +31,33 @@ struct SuiOpt { #[tokio::main] async fn main() -> Result<(), anyhow::Error> { - tracing_subscriber::fmt::init(); + if std::env::var("SUI_JSON_SPAN_LOGS").is_ok() { + // Code to add logging/tracing config from environment, including RUST_LOG + // See https://www.lpalmieri.com/posts/2020-09-27-zero-to-production-4-are-we-observable-yet/#5-7-tracing-bunyan-formatter + // Also Bunyan layer addes JSON logging for tracing spans with duration information + let env_filter = + EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")); + let formatting_layer = BunyanFormattingLayer::new( + "sui".into(), + // Output the formatted spans to stdout. + std::io::stdout, + ); + // The `with` method is provided by `SubscriberExt`, an extension + // trait for `Subscriber` exposed by `tracing_subscriber` + let subscriber = Registry::default() + .with(env_filter) + .with(JsonStorageLayer) + .with(formatting_layer); + // `set_global_default` can be used by applications to specify + // what subscriber should be used to process spans. + set_global_default(subscriber).expect("Failed to set subscriber"); + + info!("Enabling JSON and span logging"); + } else { + // Initializes default logging. Will use RUST_LOG but no JSON/span info. + tracing_subscriber::fmt::init(); + info!("Standard user-friendly logging, no spans no JSON"); + } let options: SuiOpt = SuiOpt::from_args(); let network_conf_path = options.config; From 2bf190b3c9edf1aa83a04c0cafaa3546e6d39f43 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Wed, 9 Mar 2022 11:42:19 -0800 Subject: [PATCH 06/10] Add tx_kind to both gateway side and authority side transaction tags --- doc/observability.md | 1 + sui_core/src/authority_aggregator.rs | 13 +++++++++++-- sui_core/src/authority_server.rs | 9 +++++++-- sui_types/Cargo.toml | 2 ++ sui_types/src/messages.rs | 19 ++++++++++++++++++- 5 files changed, 39 insertions(+), 5 deletions(-) diff --git a/doc/observability.md b/doc/observability.md index 4f4e74a6e061e..bae3d8163ca73 100644 --- a/doc/observability.md +++ b/doc/observability.md @@ -67,6 +67,7 @@ TODO: see if keys need to be scoped by contexts | Key | Place(s) | Meaning | | ------------------- | ------------------ | -------------------------------------------------------------------------- | | tx_digest | Gateway, Authority | Hex digest of transaction | +| tx_kind | Gateway, Authority | Kind of transaction: Transfer/Publish/Call | | quorum_threshold | Gateway | Numeric threshold of quorum stake needed for a transaction | | validity_threshold | Gateway | Numeric threshold of maximum "bad stake" from errors that can be tolerated | | num_errors | Gateway | Number of errors from authorities broadcast | diff --git a/sui_core/src/authority_aggregator.rs b/sui_core/src/authority_aggregator.rs index 807a44f49fda5..d3d02b89e73e2 100644 --- a/sui_core/src/authority_aggregator.rs +++ b/sui_core/src/authority_aggregator.rs @@ -18,6 +18,7 @@ use sui_types::{ use tracing::{debug, info, trace, Instrument}; use std::collections::{BTreeMap, BTreeSet, HashMap, HashSet}; +use std::string::ToString; use std::time::Duration; use tokio::sync::mpsc::Receiver; use tokio::time::timeout; @@ -1082,11 +1083,19 @@ where let tx_digest = transaction.digest(); let new_certificate = self .process_transaction(transaction.clone(), Duration::from_secs(60)) - .instrument(tracing::debug_span!("process_tx", ?tx_digest)) + .instrument(tracing::debug_span!( + "process_tx", + ?tx_digest, + tx_kind = transaction.data.kind_as_str() + )) .await?; let response = self .process_certificate(new_certificate.clone(), Duration::from_secs(60)) - .instrument(tracing::debug_span!("process_cert", ?tx_digest)) + .instrument(tracing::debug_span!( + "process_cert", + ?tx_digest, + tx_kind = transaction.data.kind_as_str() + )) .await?; Ok((new_certificate, response)) diff --git a/sui_core/src/authority_server.rs b/sui_core/src/authority_server.rs index cce928b4c878c..2cf46a302b25d 100644 --- a/sui_core/src/authority_server.rs +++ b/sui_core/src/authority_server.rs @@ -172,9 +172,11 @@ impl AuthorityServer { match result { SerializedMessage::Transaction(message) => { let tx_digest = message.digest(); + // No allocations: it's a 'static str! + let tx_kind = message.data.kind_as_str(); self.state .handle_transaction(*message) - .instrument(tracing::debug_span!("process_tx", ?tx_digest)) + .instrument(tracing::debug_span!("process_tx", ?tx_digest, tx_kind)) .await .map(|info| Some(serialize_transaction_info(&info))) } @@ -182,10 +184,13 @@ impl AuthorityServer { let confirmation_transaction = ConfirmationTransaction { certificate: message.as_ref().clone(), }; + let tx_kind = message.transaction.data.kind_as_str(); match self .state .handle_confirmation_transaction(confirmation_transaction) - .instrument(tracing::debug_span!("process_cert", tx_digest =? message.transaction.digest())) + .instrument(tracing::debug_span!("process_cert", + tx_digest =? message.transaction.digest(), + tx_kind)) .await { Ok(info) => { diff --git a/sui_types/Cargo.toml b/sui_types/Cargo.toml index 42da12911e690..bb354459a4a0d 100644 --- a/sui_types/Cargo.toml +++ b/sui_types/Cargo.toml @@ -24,6 +24,8 @@ serde_json = "1.0.79" serde_with = "1.12.0" signature = "1.5.0" static_assertions = "1.1.0" +strum = "0.24.0" +strum_macros = "0.24.0" typed-store = { git = "https://github.com/MystenLabs/mysten-infra", rev ="a36c852c30729a16ee3c8a6a863f654887f0c30e"} diff --git a/sui_types/src/messages.rs b/sui_types/src/messages.rs index a83ba92097730..ac2f823da8ae8 100644 --- a/sui_types/src/messages.rs +++ b/sui_types/src/messages.rs @@ -22,6 +22,7 @@ use std::{ collections::{BTreeSet, HashSet}, hash::{Hash, Hasher}, }; +use strum::VariantNames; #[derive(Debug, PartialEq, Eq, Hash, Clone, Serialize, Deserialize)] pub struct Transfer { @@ -52,7 +53,9 @@ pub struct MoveModulePublish { pub gas_budget: u64, } -#[derive(Debug, PartialEq, Eq, Hash, Clone, Serialize, Deserialize)] +#[derive( + Debug, PartialEq, Eq, Hash, Clone, Serialize, Deserialize, strum_macros::EnumVariantNames, +)] pub enum TransactionKind { /// Initiate an object transfer between addresses Transfer(Transfer), @@ -129,6 +132,20 @@ impl TransactionData { }); Self::new(kind, sender, gas_payment) } + + /// Returns the transaction kind as a &str (variant name, no fields) + pub fn kind_as_str(&self) -> &'static str { + // NOTE: Ideally we could have used something like https://docs.rs/strum/latest/strum/derive.AsRefStr.html + // The problem is that it doesn't actually return &'static ref due to &self above + // and we really want 'static for common situations, such as authority_server dispatch where + // by the time we instrument the transaction kind, the message or Transaction might have been moved + // and so the lifetime and Kind is out of scope and we cannot borrow it. + match self.kind { + TransactionKind::Transfer(_) => TransactionKind::VARIANTS[0], + TransactionKind::Publish(_) => TransactionKind::VARIANTS[1], + TransactionKind::Call(_) => TransactionKind::VARIANTS[2], + } + } } /// An transaction signed by a client. signature is applied on data. From afc32fb25aa5da5b1ed1b0bba30d3372b799f7a9 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Wed, 9 Mar 2022 12:00:13 -0800 Subject: [PATCH 07/10] Cargo clippy --- sui_core/src/authority.rs | 2 +- sui_core/src/authority/authority_store.rs | 2 +- sui_core/src/authority_aggregator.rs | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/sui_core/src/authority.rs b/sui_core/src/authority.rs index cc105175d4fac..5e19600dfed1f 100644 --- a/sui_core/src/authority.rs +++ b/sui_core/src/authority.rs @@ -27,7 +27,7 @@ use sui_types::{ storage::{DeleteKind, Storage}, MOVE_STDLIB_ADDRESS, SUI_FRAMEWORK_ADDRESS, }; -use tracing::{field::debug, *}; +use tracing::*; use crate::authority_batch::{BatchSender, BroadcastReceiver, BroadcastSender}; diff --git a/sui_core/src/authority/authority_store.rs b/sui_core/src/authority/authority_store.rs index e0106c1f13e73..5d1ee3f9efc73 100644 --- a/sui_core/src/authority/authority_store.rs +++ b/sui_core/src/authority/authority_store.rs @@ -10,7 +10,7 @@ use std::path::Path; use std::sync::atomic::AtomicU64; use sui_types::base_types::SequenceNumber; use sui_types::batch::{SignedBatch, TxSequenceNumber}; -use tracing::{warn, Instrument}; +use tracing::warn; use typed_store::rocks::{open_cf, DBBatch, DBMap}; use std::sync::atomic::Ordering; diff --git a/sui_core/src/authority_aggregator.rs b/sui_core/src/authority_aggregator.rs index d3d02b89e73e2..90ea19e3e85fd 100644 --- a/sui_core/src/authority_aggregator.rs +++ b/sui_core/src/authority_aggregator.rs @@ -15,7 +15,7 @@ use sui_types::{ error::{SuiError, SuiResult}, messages::*, }; -use tracing::{debug, info, trace, Instrument}; +use tracing::{debug, trace, Instrument}; use std::collections::{BTreeMap, BTreeSet, HashMap, HashSet}; use std::string::ToString; From 52c579501c0e0cf21441717e1d3af6551865efb6 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Wed, 9 Mar 2022 13:04:12 -0800 Subject: [PATCH 08/10] CR feedback esp on docs --- {doc => dev-docs}/observability.md | 44 +++++++++++++++++------------- sui/src/sui.rs | 1 + 2 files changed, 26 insertions(+), 19 deletions(-) rename {doc => dev-docs}/observability.md (89%) diff --git a/doc/observability.md b/dev-docs/observability.md similarity index 89% rename from doc/observability.md rename to dev-docs/observability.md index bae3d8163ca73..22f9a8cc787eb 100644 --- a/doc/observability.md +++ b/dev-docs/observability.md @@ -1,14 +1,18 @@ # Logging, Tracing, Metrics, and Observability -Good observability facilities will be key to the development and productionization of Sui. This is made +Good observability facilities are key to the development and growth of Sui. This is made more challenging by the distributed and asynchronous nature of Sui, with multiple client and authority processes distributed over a potentially global network. -The observability stack in Sui is based on the Tokio [tracing](https://tokio.rs/blog/2019-08-tracing) library. -The rest of this document highlights specific aspects of achieving good observability such as useful logging +The observability stack in Sui is based on the [Tokio tracing](https://tokio.rs/blog/2019-08-tracing) library. +The rest of this document highlights specific aspects of achieving good observability through useful logging and metrics in Sui. -## Contexts, Scopes, and Tracing Transaction Flow +NOTE: The output here is largely for the consumption of Sui operators, administrators, and developers. The +content of logs and traces do not represent the authoritative, certified output of authorities and are subject +to potentially byzantine behavior. + +## Contexts, scopes, and tracing transaction flow The main idea of logging and tracing in a distributed and asynchronous context, where you cannot rely on looking at individual logs over time in a single thread, is to assign context to logging/tracing so that we can trace, @@ -39,11 +43,11 @@ and tracing a transaction across the gateway (`authority_aggregator`) as well as 2022-03-05T01:35:03.395917Z DEBUG test_move_call_args_linter_command:process_cert{tx_digest=t#7e5f08ab09ec80e3372c101c5858c96965a25326c21af27ab7774d1f7bd40848}: sui_core::authority: Finished execution of transaction with status Success { gas_used: 7 } gas_used=7 ``` -## Key-Value Pairs Schema +## Key-value pairs schema ### Span names -Spans capture not a single event but an entire block of time, so start, end, duration, etc. can be captured +Spans capture not a single event but an entire block of time; so start, end, duration, etc. can be captured and analyzed for tracing, performance analysis, etc. | Name | Place | Meaning | @@ -57,10 +61,10 @@ and analyzed for tracing, performance analysis, etc. | db_update_state | Authority | Update the database with certificate, effects after transaction Move execution | | | | | -### Tags - Keys +### Tags - keys -The idea is that every event and span would get tagged with key/value pairs. Events/logs that log within any context or nested contexts would also inherit the context-level tags. -These tags represent "fields" that can be analyzed and filtered by. For example, one could filter out broadcasts and see the errors for all instances where the bad stake exceeded a certain amount, but not enough for an error. +The idea is that every event and span would get tagged with key-value pairs. Events that log within any context or nested contexts would also inherit the context-level tags. +These tags represent *fields* that can be analyzed and filtered by. For example, one could filter out broadcasts and see the errors for all instances where the bad stake exceeded a certain amount, but not enough for an error. TODO: see if keys need to be scoped by contexts @@ -80,16 +84,16 @@ TODO: see if keys need to be scoped by contexts | gas_used | Authority | Amount of gas used by the transaction | | | | | -## Logging Levels +## Logging levels -This is always tricky, to balance the right amount of verbosity especially by default, but keeping in mind this is a high performance system. +This is always tricky, to balance the right amount of verbosity especially by default -- while keeping in mind this is a high performance system. | Level | Type of Messages | | ----- | ---------------------------------------------------------------------------------------------------------- | | Error | Process-level faults (not transaction-level errors, there could be a ton of those) | | Warn | Unusual or byzantine activity | -| Info | High level aggregate stats. Major events related to data sync, epoch changes. | -| Debug | High level tracing for individual transactions. Eg Gateway/client side -> authority -> Move execution etc. | +| Info | High level aggregate stats, major events related to data sync, epoch changes. | +| Debug | High level tracing for individual transactions, eg Gateway/client side -> authority -> Move execution etc. | | Trace | Extremely detailed tracing for individual transactions | | | | @@ -99,7 +103,7 @@ The `RUST_LOG` environment variable can be used to set both the overall logging individual components, and even filtering down to specific spans or tags within spans are possible too. For more details, please see the [EnvFilter](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html) docs. -## Tracing and Span Output +## Tracing and span output By default, only output from logging events (eg `info!`, `debug!`, etc) are output, and not span information. However, detailed span start and end logs can be generated by defining the `SUI_JSON_SPAN_LOGS` environment variable. Note that this causes all output to be in JSON format, which is not as human-readable, so it is not enabled by default. @@ -117,13 +121,15 @@ Also notice `elapsed_milliseconds` which logs the duration of each span. {"v":0,"name":"sui","msg":"[PROCESS_CERT - END]","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.248688Z","target":"sui_core::authority_server","line":67,"file":"sui_core/src/authority_server.rs","tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493","elapsed_milliseconds":2} ``` -## Observability of MOVE programs, Events, etc. +## Observability of MOVE programs, events, etc. + +TODO. -TODO +We need to provide means to authenticate the events etc (by linking them to signed effects). ## Metrics -## Ways to View Logs, Traces, Metrics +## Ways to view logs, traces, metrics TODO: add architecture diagram @@ -133,8 +139,8 @@ The idea here is to enable the viewing of logging, traces, and metrics with indu - OpenTelemetry output could allow the use of Yaeger and similar tracing backends -## Live Async Inspection / Tokio Console +## Live async inspection / Tokio Console ## (Re)Configuration -## Viewing Logs, Traces, Metrics \ No newline at end of file +## Viewing logs, traces, metrics \ No newline at end of file diff --git a/sui/src/sui.rs b/sui/src/sui.rs index 859527a4e42a8..5f948fe3de0f2 100644 --- a/sui/src/sui.rs +++ b/sui/src/sui.rs @@ -31,6 +31,7 @@ struct SuiOpt { #[tokio::main] async fn main() -> Result<(), anyhow::Error> { + // See [[dev-docs/observability.md]] for more information on span logging. if std::env::var("SUI_JSON_SPAN_LOGS").is_ok() { // Code to add logging/tracing config from environment, including RUST_LOG // See https://www.lpalmieri.com/posts/2020-09-27-zero-to-production-4-are-we-observable-yet/#5-7-tracing-bunyan-formatter From f01d23cce91bb41339dfe96bae611288470e9e61 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Wed, 9 Mar 2022 14:06:32 -0800 Subject: [PATCH 09/10] Change authors in build files --- sui/Cargo.toml | 2 +- sui_core/Cargo.toml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/sui/Cargo.toml b/sui/Cargo.toml index 0deaf68efd9b7..1cc275fe99b12 100644 --- a/sui/Cargo.toml +++ b/sui/Cargo.toml @@ -1,7 +1,7 @@ [package] name = "sui" version = "0.1.0" -authors = ["Libra "] +authors = ["Mysten Labs "] license = "Apache-2.0" publish = false edition = "2021" diff --git a/sui_core/Cargo.toml b/sui_core/Cargo.toml index 12fcb2fd920ac..fd917b7b4eebd 100644 --- a/sui_core/Cargo.toml +++ b/sui_core/Cargo.toml @@ -1,7 +1,7 @@ [package] name = "sui_core" version = "0.1.0" -authors = ["Libra "] +authors = ["Mysten Labs "] license = "Apache-2.0" publish = false edition = "2021" From 6784274a87ba9041931c176de130210174da4e33 Mon Sep 17 00:00:00 2001 From: Evan Chan Date: Wed, 9 Mar 2022 15:05:48 -0800 Subject: [PATCH 10/10] Add to observability doc --- {dev-docs => doc/developer}/observability.md | 70 +++++++++++++------- 1 file changed, 47 insertions(+), 23 deletions(-) rename {dev-docs => doc/developer}/observability.md (85%) diff --git a/dev-docs/observability.md b/doc/developer/observability.md similarity index 85% rename from dev-docs/observability.md rename to doc/developer/observability.md index 22f9a8cc787eb..7e520c15fbc08 100644 --- a/dev-docs/observability.md +++ b/doc/developer/observability.md @@ -103,10 +103,50 @@ The `RUST_LOG` environment variable can be used to set both the overall logging individual components, and even filtering down to specific spans or tags within spans are possible too. For more details, please see the [EnvFilter](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html) docs. -## Tracing and span output +## (Re)Configuration + +## Observability of MOVE programs, events, etc. + +TODO. + +We need to provide means to authenticate the events etc (by linking them to signed effects). + +## Viewing logs, traces, metrics + +The tracing architecture is based on the idea of [subscribers](https://github.com/tokio-rs/tracing#project-layout) which +can be plugged into the tracing library to process and forward output to different sinks for viewing. Multiple +subscribers can be active at the same time. + +```mermaid +graph TB; +Authority1 --> S1(open-telemetry) +Authority1 --> S2(stdout logging) +Authority1 --> S3(bunyan-formatter) +S3 --> Vector +Vector --> ElasticSearch +S1 --> Jaeger +Gateway --> SG1(open-telemetry) +Gateway --> SG3(bunyan-formatter) +SG1 --> Jaeger +SG3 --> Vector2 +Vector2 --> ElasticSearch +``` + +In the graph above, there are multiple subscribers, JSON logs can be for example fed via a local sidecar log forwarder such as +[Vector](https://vector.dev), and then onwards to destinations such as ElasticSearch. + +The use of a log and metrics aggregator such as Vector allows for easy reconfiguration without interrupting the authority server, +as well as offloading observability traffic. + +### Stdout (default) + +By default, logs (but not spans) are formatted for human readability and output to stdout, with key-value tags at the end of every line. +`RUST_LOG` can be configured for custom logging output, including filtering - see the logging levels section above. -By default, only output from logging events (eg `info!`, `debug!`, etc) are output, and not span information. -However, detailed span start and end logs can be generated by defining the `SUI_JSON_SPAN_LOGS` environment variable. Note that this causes all output to be in JSON format, which is not as human-readable, so it is not enabled by default. +### Tracing and span output + +Detailed span start and end logs can be generated by defining the `SUI_JSON_SPAN_LOGS` environment variable. Note that this causes all output to be in JSON format, which is not as human-readable, so it is not enabled by default. +This output can easily be fed to backends such as ElasticSearch for indexing, alerts, aggregation, and analysis. The example output below shows certificate processing in the authority with span logging. Note the START and END annotations, and notice how DB_UPDATE_STATE which is nested is embedded within PROCESS_CERT. @@ -121,26 +161,10 @@ Also notice `elapsed_milliseconds` which logs the duration of each span. {"v":0,"name":"sui","msg":"[PROCESS_CERT - END]","level":20,"hostname":"Evan-MLbook.lan","pid":51425,"time":"2022-03-08T22:48:11.248688Z","target":"sui_core::authority_server","line":67,"file":"sui_core/src/authority_server.rs","tx_digest":"t#d1385064287c2ad67e4019dd118d487a39ca91a40e0fd8e678dbc32e112a1493","elapsed_milliseconds":2} ``` -## Observability of MOVE programs, events, etc. - -TODO. - -We need to provide means to authenticate the events etc (by linking them to signed effects). - -## Metrics - -## Ways to view logs, traces, metrics - -TODO: add architecture diagram +### Live async inspection / Tokio Console -The idea here is to enable the viewing of logging, traces, and metrics with industry-standard open source tools. Thoughts: -- JSON logs from `SUI_JSON_SPAN_LOGS` could be fed directly to ElasticSearch or similar backends for easy indexing of fields and aggregation -- Numbers from the output could be fed into Prometheus -- OpenTelemetry output could allow the use of Yaeger and similar tracing backends +[Tokio-console](https://github.com/tokio-rs/console) is an awesome CLI tool designed to analyze and help debug Rust apps using Tokio, in real time! It relies on a special subscriber. +TODO: Add instructions for setting env var and building Sui specially to enable Tokio-console support. -## Live async inspection / Tokio Console - -## (Re)Configuration - -## Viewing logs, traces, metrics \ No newline at end of file +NOTE: Adding Tokio-console support may significantly slow down Sui authorities/gateways.