feat: add influxdb trace collection #970

evan-forbes · 2023-03-04T17:26:35Z

Description

This PR adds a influx db client to the Node struct that we can use to push arbitrary trace data to a db. It also adds this functionality to the e2e test.

see README.md

Closes: #968

evan-forbes · 2023-03-04T17:29:02Z

config/config.go

+	Storage         *StorageConfig               `mapstructure:"storage"`
+	TxIndex         *TxIndexConfig               `mapstructure:"tx_index"`
+	Instrumentation *InstrumentationConfig       `mapstructure:"instrumentation"`
+	EventCollector  *remote.EventCollectorConfig `mapstructure:"event_collector"`


kinda just threw these names together, so I'm definitely open to better names. I'm leaning towards just calling it an influxdb client, because that's basically what it is

defer to you b/c EventCollector or InfluxDBClient both SGTM.

The naming looks good to me, but it may be beneficial to include a brief comment about the service it provides.

You don't think this should fall under Instrumentation?

It could fall under instrumentation, but I'm viewing trace level data collection as much more arbitrary and specific atm. Its not really something we use to get a glance at what is happening at the network, its something that we use in a more one-off way to debug really specific issues or measure very specific things. Perhaps in the future we have traces that we look at routinely, but if that's the case, then perhaps we should make that thing a metric instead. We can include in it in that config though

Its not really something we use to get a glance at what is happening at the network

Why not? :)

Aren't you using it now to work out how often we have multi-round heights in our testnets

Anyway, this strikes me as Instrumentation

Anyway, this strikes me as Instrumentation

okay cool, will add to the instumentation config instead 👍

Why not? :)

If I'm interpreting the question correctly, I think this comes down to how much detail is being extracted or summarized. for example, if we were simply wanting to view how many blocks took multiple rounds to come to consensus, then we could do that in a simple light way using the existing metrics to store a summary of the data. That summary is useful to "glance" at. However, if we wanted to examine why those heights are taking multiple rounds, then we would need something significantly less summarized that inherently takes more time to analyze. I feel like I'm not answering the question though, so pls ask slightly differently if that's the case 🙂

also, when we store the data in a detailed (as opposed to summarized) way, in order to glance at it we have to write significantly more complex queries, which we don't currently have atm and can be time consuming to write

test/e2e/runner/start.go

test/e2e/runner/main.go

rootulp

Exceptionally clear PR! Great work 👏

pkg/remote/README.md

rootulp · 2023-03-06T15:13:40Z

config/config.go

+	Storage         *StorageConfig               `mapstructure:"storage"`
+	TxIndex         *TxIndexConfig               `mapstructure:"tx_index"`
+	Instrumentation *InstrumentationConfig       `mapstructure:"instrumentation"`
+	EventCollector  *remote.EventCollectorConfig `mapstructure:"event_collector"`


defer to you b/c EventCollector or InfluxDBClient both SGTM.

consensus/state.go

staheri14

Nice! Looks good to me, just added some non-blocking questions and suggestions.

staheri14 · 2023-03-08T00:37:21Z

config/config.go

+	Storage         *StorageConfig               `mapstructure:"storage"`
+	TxIndex         *TxIndexConfig               `mapstructure:"tx_index"`
+	Instrumentation *InstrumentationConfig       `mapstructure:"instrumentation"`
+	EventCollector  *remote.EventCollectorConfig `mapstructure:"event_collector"`


The naming looks good to me, but it may be beneficial to include a brief comment about the service it provides.

pkg/remote/README.md

pkg/remote/client.go

Co-authored-by: Sanaz Taheri <[email protected]>

Co-authored-by: Rootul P <[email protected]>

cmwaters

Awesome work! Have written down a few comments.

cmwaters · 2023-03-09T09:47:17Z

config/config.go

+	Storage         *StorageConfig               `mapstructure:"storage"`
+	TxIndex         *TxIndexConfig               `mapstructure:"tx_index"`
+	Instrumentation *InstrumentationConfig       `mapstructure:"instrumentation"`
+	EventCollector  *remote.EventCollectorConfig `mapstructure:"event_collector"`


You don't think this should fall under Instrumentation?

cmwaters · 2023-03-09T09:59:26Z

consensus/state.go

+	cs.eventCollector.WritePoint(consensusTable, map[string]interface{}{
+		"height": rs.Height,
+		"round":  rs.Round,
+		"step":   rs.Step,
+	})


Instead of duplicating ourselves, can we implement the metrics.Gauge and metrics.Counter interfaces using this eventCollector and then in the metrics.go file of each reactor, include a InfluxDBMetrics function alongside the PrometheusMetrics function. That way an operator can choose to trace metrics using either prometheus or influxDB (perhaps in the future we can add support for both).

would we be able to collect arbitary trace data that way? Like, if I wanted a summary of each message sent and recieved over the wire, would we be able to put that in a gauge or counter?

I'm not exactly sure of how we will use this yet. Initially, I assumed that we would create proprietary one-off branches that push very specific trace data to the db for us to examine some scenario or component. However I could see have some default set of data collection that could be turned on simply by configuring a url and token to also be useful.

I've since removed the above code, so by default we're not actually sending any data to the db even if we have it configured. However we can add things or refactor in the future!

6758a3e

So from where I stand, the utility in this is for analysing feature developments and potentially for bug detection and introspection. I currently have use cases for the former but I can also imagine the latter being helpful on occasions.

I think you're right that for feature developments this would occur on some branch but with any modifications, it's essential that we compare it with the baseline (i.e. the current network version). Thus main should also have the same data tracked.

Secondly, we're probably not going to be looking at one or two data points but look out for changes across several dimensions (which metrics should be tracking) such as the amount of rounds, the block time, bandwidth for certain messages etc... We need to ensure not only that the hypothesis is correct, but also that there aren't any regressions in a different part of the system. I think that often these will be the same data points, thus by just capturing all metrics by default, we save having to rewrite this every time we want to analyse something.

As well as feature development, we can and probably should be using this tool to analyse usage patterns. If done right, it may even help us understand what areas to be focusing on/prioritising. (i.e. if bandwidth is still a bottle neck then do we need to start doing some research into a narwhalesque implementation).

Lastly, I think if we come across some major problems, tracking all metrics could be helpful in any retrospective work.

So my conclusion would be to have the simple option of having all metrics persisted as data points straight out-of-the-box while allowing for developers to inject any custom tracing where they see fit.

Thus main should also have the same data tracked.

I think that often these will be the same data points, thus by just capturing all metrics by default, we save having to rewrite this every time we want to analyse something.

I agree, but wasn't sure how much data we want to collect atm and felt like that could be pushed to a different PR.

However we can add things or refactor in the future!

added #978 ! 🙂

For example, I think collecting the consensus info makes sense here, but what about a trace of all of the messages sent over the wire? That seems both very useful in some cases, but also very heavy, so idk.

Instead of duplicating ourselves, can we implement the metrics.Gauge and metrics.Counter interfaces using this eventCollector and then in the metrics.go file of each reactor, include a InfluxDBMetrics function alongside the PrometheusMetrics function.

also sorry I think I mistunderstood this at first to say that we should refactor the eventCollector to use guages and counters for collection. being able to keep everything in a single db is a good idea 😅 #979

That seems both very useful in some cases, but also very heavy, so idk.

We could look at filtering metrics by reactor (since they're already divided that way) similar to the filtered logger Tendermint had in the past

config/toml.go

pkg/remote/client.go

test/e2e/runner/main.go

Co-authored-by: Callum Waters <[email protected]>

pkg/remote/README.md

evan-forbes · 2023-03-09T18:11:04Z

I've removed all default instances of data collection, since I'm not actually sure how we want to use this yet. This way, I think we can merge as is and adjust later. @cmwaters raises a good point that we can likely do a better job of combining this with the existing metrics/instrumentation. Originally, we went with this approach because we needed a push based approach to get past testground's limitations, but since then, the devops team has implemented a telegraf scraper that can get around this problem, so perhaps we can just use that influx instance for everything. This is even more relevant if we ever get to the point where we replace our e2e test with testground.

However, this approach does exactly what we need for the time being, so we might want to merge as is to better debug things like #965 (what I'm currently doing on the temp/feat/trace-collection-plus-timings branch)

cmwaters · 2023-03-10T09:40:14Z

This is even more relevant if we ever get to the point where we replace our e2e test with testground.

IMO, this should be one of our milestones: converge on testground as the framework for covering all "e2e" styled tests

cmwaters

Looks good. I just have a couple more minor comments

consensus/state.go

pkg/remote/client.go

pkg/remote/README.md

cmwaters

🚀

…ackport #865) (#970) * fix: avoid recursive call after rename to (*PeerState).MarshalJSON (#865) * avoid recursive call after rename to (*PeerState).MarshalJSON * add test * add change doc * explain for nolint * fix lint * fix golangci-lint to v1.52.2 * fix golangci-lint to v1.52.2 * Revert "fix golangci-lint to v1.52.2" This reverts commit 598a9ef4c86fc29cf038251676c33a222217826c. * Revert "fix golangci-lint to v1.52.2" This reverts commit a8aad121e27382813e95b1911b1b560c62e1c7c3. * Reintroduced `cmtjson` * Avoid copying Mutex * Avoid copying Mutex -- 2nd try, more succint * Update .changelog/unreleased/bug-fixes/865-fix-peerstate-marshaljson.md * Update consensus/reactor_test.go --------- Co-authored-by: Sergio Mena <[email protected]> (cherry picked from commit f6ea091) # Conflicts: # consensus/reactor_test.go * Revert "fix: avoid recursive call after rename to (*PeerState).MarshalJSON (#865)" * fix: avoid recursive call after rename to (*PeerState).MarshalJSON (#865) * avoid recursive call after rename to (*PeerState).MarshalJSON * add test * add change doc * explain for nolint * fix lint * fix golangci-lint to v1.52.2 * fix golangci-lint to v1.52.2 * Revert "fix golangci-lint to v1.52.2" This reverts commit 598a9ef4c86fc29cf038251676c33a222217826c. * Revert "fix golangci-lint to v1.52.2" This reverts commit a8aad121e27382813e95b1911b1b560c62e1c7c3. * Reintroduced `cmtjson` * Avoid copying Mutex * Avoid copying Mutex -- 2nd try, more succint * Update .changelog/unreleased/bug-fixes/865-fix-peerstate-marshaljson.md * Update consensus/reactor_test.go --------- Co-authored-by: Sergio Mena <[email protected]> --------- Co-authored-by: mmsqe <[email protected]> Co-authored-by: Sergio Mena <[email protected]>

feat: add influxdb trace collection to the e2e test

a9f8479

evan-forbes added T:devops Type: Develper Operations introspection labels Mar 4, 2023

evan-forbes requested review from cmwaters and staheri14 March 4, 2023 17:26

evan-forbes self-assigned this Mar 4, 2023

fix: add default toml field for batch size

40fe826

evan-forbes commented Mar 4, 2023

View reviewed changes

evan-forbes added 3 commits March 4, 2023 11:02

fix: nil panics when starting or stoping the client

47818ce

fix: nil clients

6970772

fix: silly network naming issue

12a33cf

rootulp previously approved these changes Mar 6, 2023

View reviewed changes

Merge branch 'v0.34.x-celestia' into evan/influx-client

8e2fbd1

staheri14 previously approved these changes Mar 8, 2023

View reviewed changes

evan-forbes dismissed stale reviews from staheri14 and rootulp via 79a2f54 March 8, 2023 15:20

evan-forbes and others added 2 commits March 8, 2023 09:20

chore: docs

79a2f54

Co-authored-by: Sanaz Taheri <[email protected]>

refactor: stateevent -> setevent

136931a

Co-authored-by: Rootul P <[email protected]>

evan-forbes mentioned this pull request Mar 8, 2023

Use a unique chain-id with each test celestiaorg/test-infra#197

Closed

staheri14 previously approved these changes Mar 9, 2023

View reviewed changes

cmwaters reviewed Mar 9, 2023

View reviewed changes

evan-forbes and others added 4 commits March 9, 2023 11:43

chore: remove the default collection of each consensus step

6758a3e

refactor: use "celestia" as default org

014b9db

Co-authored-by: Callum Waters <[email protected]>

feat: add validate basic and IsCollecting method

e087e94

chore: actually validate basic

c05e07d

evan-forbes dismissed staheri14’s stale review via c05e07d March 9, 2023 17:55

evan-forbes commented Mar 9, 2023

View reviewed changes

pkg/remote/README.md Outdated Show resolved Hide resolved

chore: add influx db flags to main command

f40c5e1

This was referenced Mar 10, 2023

Convert e2e tests to use testground #977

Closed

Add a default set of trace data collected #978

Closed

Enable pushing metrics to influx as an alternative to pulling to prometheus #979

Open

evan-forbes added 3 commits March 10, 2023 08:20

refactor: change config to be part of the instrumentation config

efe4b7a

fix: forgot to update these as well

407fac9

fix: consolidate configs

1d083b0

cmwaters previously approved these changes Mar 13, 2023

View reviewed changes

consensus/state.go Outdated Show resolved Hide resolved

pkg/remote/client.go Outdated Show resolved Hide resolved

pkg/remote/README.md Outdated Show resolved Hide resolved

pkg/remote/README.md Outdated Show resolved Hide resolved

evan-forbes added 2 commits March 13, 2023 06:49

chore: remove unused const

3e4a964

chore: rename remote -> trace

b51a444

evan-forbes dismissed cmwaters’s stale review via b51a444 March 13, 2023 11:55

evan-forbes added 2 commits March 13, 2023 07:13

chore: edit readme and copy into go doc

b2c4061

chore: remove redundant logs

8e1de35

evan-forbes requested a review from cmwaters March 13, 2023 12:57

cmwaters approved these changes Mar 13, 2023

View reviewed changes

evan-forbes merged commit 7613e78 into v0.34.x-celestia Mar 13, 2023

evan-forbes deleted the evan/influx-client branch March 13, 2023 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add influxdb trace collection #970

feat: add influxdb trace collection #970

evan-forbes commented Mar 4, 2023 •

edited

Loading

evan-forbes Mar 4, 2023

rootulp Mar 6, 2023

staheri14 Mar 8, 2023

cmwaters Mar 9, 2023

evan-forbes Mar 9, 2023 •

edited

Loading

cmwaters Mar 10, 2023

evan-forbes Mar 10, 2023 •

edited

Loading

evan-forbes Mar 10, 2023

rootulp left a comment

rootulp Mar 6, 2023

staheri14 left a comment

staheri14 Mar 8, 2023

cmwaters left a comment

cmwaters Mar 9, 2023

cmwaters Mar 9, 2023

evan-forbes Mar 9, 2023

evan-forbes Mar 9, 2023 •

edited

Loading

cmwaters Mar 10, 2023

evan-forbes Mar 10, 2023 •

edited

Loading

evan-forbes Mar 10, 2023 •

edited

Loading

cmwaters Mar 13, 2023

evan-forbes commented Mar 9, 2023

cmwaters commented Mar 10, 2023

cmwaters left a comment

cmwaters left a comment

feat: add influxdb trace collection #970

feat: add influxdb trace collection #970

Conversation

evan-forbes commented Mar 4, 2023 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootulp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staheri14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmwaters left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

evan-forbes Mar 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes commented Mar 9, 2023

cmwaters commented Mar 10, 2023

cmwaters left a comment

Choose a reason for hiding this comment

cmwaters left a comment

Choose a reason for hiding this comment

evan-forbes commented Mar 4, 2023 •

edited

Loading

evan-forbes Mar 9, 2023 •

edited

Loading

evan-forbes Mar 10, 2023 •

edited

Loading

evan-forbes Mar 9, 2023 •

edited

Loading

evan-forbes Mar 10, 2023 •

edited

Loading

evan-forbes Mar 10, 2023 •

edited

Loading