-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor tool: query CIDs in BitSwap localhost #8082
Conversation
cmd/ipfs/daemon.go
Outdated
if level := os.Getenv("BS_LOG"); level != "" { | ||
logging.SetAllLoggers(logging.LevelWarn) | ||
logging.SetLogLevel("bitswap", level) | ||
logging.SetLogLevel("engine", level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The engine is too verbose so we might want to skip this and only track the other two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the added logs though are in that component for the INFO
level so we might want to move them to the more general BS or create a new log subsystem.
df14b83
to
cc6f9de
Compare
We need to make these two configuration values part of the |
@@ -32,7 +32,7 @@ require ( | |||
github.com/ipfs/go-ipfs-blockstore v0.1.4 | |||
github.com/ipfs/go-ipfs-chunker v0.0.5 | |||
github.com/ipfs/go-ipfs-cmds v0.6.0 | |||
github.com/ipfs/go-ipfs-config v0.12.0 | |||
github.com/ipfs/go-ipfs-config v0.16.1-0.20210830194616-e846165ea1ca |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be undone if ipfs/go-ipfs-config#144 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will likely never merge this (at least not as an embedded process) so neither that PR.
OnStart: func(ctx context.Context) error { | ||
|
||
go func() { | ||
for cbs != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schomatis I changed this to use the config, this works so long as the tester has Internal.Bitswap setup in the config, and does nothing otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coryschwartz Great, thanks! Just to confirm if I understand this correctly, a change through the ipfs config
CLI to these configurations will be reflected on the fly in the *config.InternalBitswap
pointer here without needing to restart the daemon, right?
We haven't used this tool in a while (nor needed it) and since it wasn't meant to be merged I'll be closing it now. |
Meta
Patch to track in the BitSwap engine a failure to respond to a requested CID, see #7972 (comment).
This is a draft PR without actual intention of merging, ever. It is only submitted in this format to streamline the review process. As such attention should not be on proficient Go code or alignment with
go-ipfs
intrinsics but only on what can be added to a production GW node to monitor and log activity related to this issue without disrupting normal operation.Most of the actual logic is in ipfs/go-bitswap#475.
We hook a task on the node constructor to make a request every X seconds of a list of CIDs to our own BitSwap engine and see if they are all accounted for within those same X seconds.
TODO:
go-bitswap
dependency replace to a commit (commenting out the branch replace).PR description
This PR tests the BitSwap (BS) response mechanism with the objective of checking for inconsistencies such as #7972 (comment). It modifies a
v0.8.0
node to start a goroutine along the node startup process that will constantly request a list of CIDs to its own BS component. (Analogous to a localhost connection where we bypass part of the network stack, we connect here directly to the node's ownHost
peer.)This injected goroutine runs a BS test utility derived from the
vole
tool to request CIDs and monitor that the BS response accounts for all of them (we don't focus at the moment on whether we have the CIDs or not but only on BS responding to all requests, see cited issue for more background). Most of the BS logic, including thevole
request CID tool, lives in a separate PR in the BS repo: ipfs/go-bitswap#475.The previous architecture translates into the fact that we no longer have to run two separate processes to test BS (as in the original
vole
), one running a node and the other sending requests to it, but only one: the node itself functioning as usual and also running this constant check in the background requesting itself for a list of artificial CIDs (again, we don't focus on actual meaningful content right now). The rationale for this design is to be able to deploy this modified version in place of a production GW node (where the original issue was first spotted) that handles a meaningful load (making this test run in a more realistic scenario and with more chances of finding anything useful).While developing this the original issue was likely fixed by ipfs/go-bitswap#477 (cherry picked here), but we still finish and deploy it in case we can find any other related problems.
Along with the requests in this patch we include some extra logging (
LogTargetPeer
) in the BS code base targeting specifically the "fake" peer (in the goroutine inside the node itself) from where the requests originate to spot if at any point the pipeline of task creation and dispatching ("pushing" and "popping" in BS terminology) that leads to reporting back a requested CID is broken. We can very easily modify this log pattern to go deeper in the code flow of interest if needed. We err for now on the side of less verbosity to not flood the log in this first test. The modified node should be monitored to make sure this is not the case and that the logs are rotated properly.Configuration and use
There are 3 configuration variable that are added to this patch (in the form of environment variables for simplicity) that need to be set before running it and can also be modified on the fly without restarting the node (to minimize operation disruption in production).
BS_CFG_TIMEOUT
: the time we wait for the response of the requested CIDs. This is probably the most sensitive configuration of all that can make the test meaningful or meaningless, and will probably need to be experimented with before arriving at an acceptable value. It is probably important to take into account that the connection is in "localhost" mode with virtually zero latency so the timeout will mainly be derived from the node average load. (After the timeout elapses, either with all responses or not, we rerun the test in an infinite loop in the goroutine until the node stops.)BS_CFG_CID_REQ_NUM
: Number of CIDs requested on each test. Not sure how critical this configuration is. Ideally we would want to mimic a typical client request.BS_CFG_LOG
: Log level for the BS and related components. The rest is set toWARN
. During tests it is recommended to set this toINFO
(the level at which we added the extra logging). In any case the log levels can be adjusted on the fly through the standardipfs log
CLI.(See more details in the documentation of the code itself.)
A typical configuration run during local (non-production) testing of this patch can be seen in
run-bs-monitor.bash
.Apart from these extra configurations the node should be run as it is normally configured in production. The added configurations shouldn't interfere with the rest of the node's operations.
Log output
A typical log output of a test would be:
What we need to look out for is an
ERROR
in the BS log like: