Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: explicit caching and even smaller runners #2518

Merged
merged 21 commits into from
Jan 30, 2024

Conversation

tychoish
Copy link
Contributor

No description provided.

Comment on lines 236 to 240
name: build cache
with:
path: target/debug/glaredb
key: ${{ github.run_id }}
path: |
target/
key: ${{ runner.os }}-cargo--target-${{github.sha}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this new cache? Hard time telling from the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's sort of a stacked cache (looking at the build cache is really where this might become clear):

  • cache toolchains based on the rust-toolchain file (restores before, saves after the build task. (changes rarely, used between different branches so we're not downloading big files
  • cache the user's .cargo directory based on the `Cargo.lock. (restores before and saves after the build task.) should prevent us from re-downloading dependencies.
  • cache the workspace except for the glaredb binary (restores before and after the build task only, keyed off of the Cargo.lock, hopefully with the effect of saving as much incremental build time as possible in a fairly stable way and preventing rebuilds/trigger.
  • cache the glaredb binary explicitly (saves after the build task, restored before any task that runs SLTs or needs glaredb (pytests) keyed off the current run id.
  • save a second cache of target after the build task, restored before any task that needs to do on a build. key'd off the current commit.

Then we restore exactly what we need before the task that it runs.

This means we end up caching many more build-specific things and so the cache accelerates what we're doing in the task. and we don't end up thrashing the cache and rebuilding with different flags as much (e.g. unittests shouldn't thrash the build cache)

The "workspace cache" is maybe of questionable effect, but it seems pretty harmless.

@tychoish
Copy link
Contributor Author

In the penultimate build as of writing, we saw very good time: 15m wall clock, and 25m paid compute (this doesn't count the "free runners" which is about 40m per build, but that's fine).

However, 1 of those wall-clock minutes was spent saving a per-run cache of the target directory which is probably overkill. My desire was to have a cache per run so that any per-run changes didn't blow the cache between different builds if one branch had a lot of churn, but 1 minute in the build task was a lot to justify that. Running a second build with a less radical caching strategy. to see how that ends up.

@tychoish
Copy link
Contributor Author

ok, final report:

  • 20m of large-instance compute time per build (65% reduction)
  • 15m wall clock time (best case) if everything is a cache hit, for a CI run (and the major determinant of this is the SQLserver test)

caveats

  • the above time is really best case because there aren't a lot of builds going on right now, so the cache was very well hit
  • the caching situation is pretty corse, and if we have a lot of builds going on at a time, compiles are going to get slower as cache's are overwritten
  • the only thing that uses large instances now is the compile and (unsurprisingly the unittests. (which always have to be their own compile)

future improvements/observations:

  • SQL server is slow because the tests not the compile or the fixture is slow or the hardware is slow. It has the same runtime on large hosts as small hosts. So either: there's a problem with the configuration of the fixture that is making is slow (possible) or something in our implementation is terribly slow, in a way that might be observable for any user doing a short query, and also it could be impacting the server itself (which is a thing we should probably avoid). My suspicion is that connection creation is just slow.
  • I tried creating per-run caches of the target directory. (the evolution of that here is caching the binaries specifically,) but it added a minute to the build task and didn't save a minute of time elsewhere. It might be more sustainable to have a per-branch build cache, which means you pay the cost once per branch, rather than on every build. The upside is, when lots of builds are running at once, they're less likely to cause caches
  • the python bindings seem to compile more than I think they should (I suspect maturin just compiles everything rather than only the dependencies.) slt, the proxies and rpc service are all compiled, and while it's not huge in terms of time, it's a bit odd seeming. I avoided letting the bindings update the cache themselves, and contemplated having separate caches for the bindings builds.

@tychoish tychoish merged commit 9c79f4c into main Jan 30, 2024
21 checks passed
@tychoish tychoish deleted the tycho/ci-caching-and-runner-size branch January 30, 2024 15:00
tychoish added a commit that referenced this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants