Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor tool: query CIDs in BitSwap localhost #8082

Closed
wants to merge 8 commits into from

Conversation

schomatis
Copy link
Contributor

@schomatis schomatis commented Apr 20, 2021

Meta

Patch to track in the BitSwap engine a failure to respond to a requested CID, see #7972 (comment).

This is a draft PR without actual intention of merging, ever. It is only submitted in this format to streamline the review process. As such attention should not be on proficient Go code or alignment with go-ipfs intrinsics but only on what can be added to a production GW node to monitor and log activity related to this issue without disrupting normal operation.

Most of the actual logic is in ipfs/go-bitswap#475.

We hook a task on the node constructor to make a request every X seconds of a list of CIDs to our own BitSwap engine and see if they are all accounted for within those same X seconds.

TODO:

  • Fix the interval X.
  • Fix the number of CIDs we request.
  • Confirm the level of logs we can have without causing any disruption. Do we rotate them?
  • Make all these configuration options we can adjust on the fly without restarting the GW node.
  • For a second iteration: see if we want to replicate this test from an external process outside localhost (to better test the transport layer such as QUIC) and refactor accordingly.
  • Fix go-bitswap dependency replace to a commit (commenting out the branch replace).

PR description

This PR tests the BitSwap (BS) response mechanism with the objective of checking for inconsistencies such as #7972 (comment). It modifies a v0.8.0 node to start a goroutine along the node startup process that will constantly request a list of CIDs to its own BS component. (Analogous to a localhost connection where we bypass part of the network stack, we connect here directly to the node's own Host peer.)

This injected goroutine runs a BS test utility derived from the vole tool to request CIDs and monitor that the BS response accounts for all of them (we don't focus at the moment on whether we have the CIDs or not but only on BS responding to all requests, see cited issue for more background). Most of the BS logic, including the vole request CID tool, lives in a separate PR in the BS repo: ipfs/go-bitswap#475.

The previous architecture translates into the fact that we no longer have to run two separate processes to test BS (as in the original vole), one running a node and the other sending requests to it, but only one: the node itself functioning as usual and also running this constant check in the background requesting itself for a list of artificial CIDs (again, we don't focus on actual meaningful content right now). The rationale for this design is to be able to deploy this modified version in place of a production GW node (where the original issue was first spotted) that handles a meaningful load (making this test run in a more realistic scenario and with more chances of finding anything useful).

While developing this the original issue was likely fixed by ipfs/go-bitswap#477 (cherry picked here), but we still finish and deploy it in case we can find any other related problems.

Along with the requests in this patch we include some extra logging (LogTargetPeer) in the BS code base targeting specifically the "fake" peer (in the goroutine inside the node itself) from where the requests originate to spot if at any point the pipeline of task creation and dispatching ("pushing" and "popping" in BS terminology) that leads to reporting back a requested CID is broken. We can very easily modify this log pattern to go deeper in the code flow of interest if needed. We err for now on the side of less verbosity to not flood the log in this first test. The modified node should be monitored to make sure this is not the case and that the logs are rotated properly.

Configuration and use

There are 3 configuration variable that are added to this patch (in the form of environment variables for simplicity) that need to be set before running it and can also be modified on the fly without restarting the node (to minimize operation disruption in production).

  • BS_CFG_TIMEOUT: the time we wait for the response of the requested CIDs. This is probably the most sensitive configuration of all that can make the test meaningful or meaningless, and will probably need to be experimented with before arriving at an acceptable value. It is probably important to take into account that the connection is in "localhost" mode with virtually zero latency so the timeout will mainly be derived from the node average load. (After the timeout elapses, either with all responses or not, we rerun the test in an infinite loop in the goroutine until the node stops.)

  • BS_CFG_CID_REQ_NUM: Number of CIDs requested on each test. Not sure how critical this configuration is. Ideally we would want to mimic a typical client request.

  • BS_CFG_LOG: Log level for the BS and related components. The rest is set to WARN. During tests it is recommended to set this to INFO (the level at which we added the extra logging). In any case the log levels can be adjusted on the fly through the standard ipfs log CLI.

(See more details in the documentation of the code itself.)

A typical configuration run during local (non-production) testing of this patch can be seen in run-bs-monitor.bash.

Apart from these extra configurations the node should be run as it is normally configured in production. The added configurations shouldn't interfere with the rest of the node's operations.

Log output

A typical log output of a test would be:

INFO	bitswap	connecting through TCP
INFO	engine	[TRACKING-BS-PEER] pushing task for CID [bafkqaaju]
INFO	engine	[TRACKING-BS-PEER] pushing task for CID [bafkrefclej3xpvg5d7dby34ij5egihicwtisduy]
INFO	engine	[TRACKING-BS-PEER] pushing task for CID [bafkqaajq]
[...]
INFO	engine	[TRACKING-BS-PEER] popped [10] tasks from queue
INFO	engine	[TRACKING-BS-PEER] popped task CID: [bafkrefdlq2zhh7zu7tqz224aj37vup2xi6w2j2q]
INFO	engine	[TRACKING-BS-PEER] popped task CID: [bafkrefclej3xpvg5d7dby34ij5egihicwtisduy]
[...]
INFO	engine	[TRACKING-BS-PEER] popped task CID: [bafkrefcoa5aikyv63ofwbtqfyhpm7y5nc23sema]
INFO	engine	[TRACKING-BS-PEER] popped task CID: [bafkqaajq]
INFO	engine	[TRACKING-BS-PEER] popped task CID: [bafkqaajt]
INFO	engine	[TRACKING-BS-PEER] message sent [map[blocks:[bafkqaajr bafkqaajq bafkqaajt bafkqaaju bafkqaajs] presences:[{bafkrefguonpdujs6c3xoap2zogfzwxidagoapwa DontHave} {bafkrefcoa5aikyv63ofwbtqfyhpm7y5nc23sema DontHave} {bafkrefdlq2zhh7zu7tqz224aj37vup2xi6w2j2q DontHave} {bafkrefclej3xpvg5d7dby34ij5egihicwtisduy DontHave} {bafkrefc75tvwn76in44nsutynrwws3dzyln4eoi DontHave}] wants:[]]]

What we need to look out for is an ERROR in the BS log like:

ERROR	bitswap	CheckBitswapCID: did not get HAVE/DONT-HAVE response on CIDs:

@schomatis schomatis self-assigned this Apr 20, 2021
if level := os.Getenv("BS_LOG"); level != "" {
logging.SetAllLoggers(logging.LevelWarn)
logging.SetLogLevel("bitswap", level)
logging.SetLogLevel("engine", level)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engine is too verbose so we might want to skip this and only track the other two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the added logs though are in that component for the INFO level so we might want to move them to the more general BS or create a new log subsystem.

@schomatis schomatis force-pushed the monitor/bitswap-data-stall branch from df14b83 to cc6f9de Compare April 26, 2021 14:51
@schomatis schomatis mentioned this pull request May 12, 2021
@schomatis
Copy link
Contributor Author

We need to make these two configuration values part of the ipfs config CLI interface so we can change them on the fly without restarting the node. Currently they are set as env.vars and once we start the node there is no easy way to modify them (so even if the routine here constantly checks for updates to these they are always constant).

https://github.com/ipfs/go-ipfs/pull/8082/files#diff-4140a14426af9479ef5f4cbe610e3251f8a2425b1503811955c7ea81adc049cbR128-R129

@@ -32,7 +32,7 @@ require (
github.com/ipfs/go-ipfs-blockstore v0.1.4
github.com/ipfs/go-ipfs-chunker v0.0.5
github.com/ipfs/go-ipfs-cmds v0.6.0
github.com/ipfs/go-ipfs-config v0.12.0
github.com/ipfs/go-ipfs-config v0.16.1-0.20210830194616-e846165ea1ca

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be undone if ipfs/go-ipfs-config#144 is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will likely never merge this (at least not as an embedded process) so neither that PR.

OnStart: func(ctx context.Context) error {

go func() {
for cbs != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schomatis I changed this to use the config, this works so long as the tester has Internal.Bitswap setup in the config, and does nothing otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coryschwartz Great, thanks! Just to confirm if I understand this correctly, a change through the ipfs config CLI to these configurations will be reflected on the fly in the *config.InternalBitswap pointer here without needing to restart the daemon, right?

@schomatis
Copy link
Contributor Author

We haven't used this tool in a while (nor needed it) and since it wasn't meant to be merged I'll be closing it now.

@schomatis schomatis closed this Mar 31, 2022
@schomatis schomatis deleted the monitor/bitswap-data-stall branch March 31, 2022 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants