optimise packaging and calling of benchmark #140

davidschlangen · 2024-12-11T10:05:57Z

To enable inclusion of clembench into standard benchmark runs made by others (e.g., in Playpen evaluations, but also for people training a new model and wanting to evaluate it), it would be good if our packaging could still be improved:

Make it very clear what needs to be downloaded to run benchmark (e.g., provide script that does this?)
Make it very clear how new model can be integrated.
Have a single point of entry (script that runs the whole benchmark, in such a way that the result is comparable to other runs of the same pipeline) and exit (e.g., let script return single score, in addition to detailed score breakdown in file).

We're not far away from this, but my understanding is that running the benchmark at the moment still requires a bit of babysitting.

davidschlangen · 2024-12-18T09:16:03Z

Re installing and running the benchmark, possible workflow:

clone the clembench-games repository
install the requirements, which include...
- ... the "clemgame" package, which is installed via pip. Among all other things, this installs cli.py so that it is executable
set up your own key.json inside of the cloned repository
set up your own model_registry_custom.json, inside the cloned repository
run python cli.py run all (or something to that effect)
find the results in results/

(Maybe it would be easier to have cli.py in the games repository, so that it can be called from there. So it would need to be distributed there rather than in the package, which kind of makes sense.)

davidschlangen · 2024-12-19T08:56:29Z

(copying over from discussion on mattermost, so that it is properly archived here)

We could kind of swap things around: We have a canonical repository for the benchmark, which holds: a) all the games that are officially in the benchmark, b) a scripts folder, that contains ba) cli.py, and bb) subfolders evaluation and other_scripts, c) the appropriate game_registry.json file.

This is what people clone if they want to run the core benchmark. The requirements in that repo contain everything that the invidual games need, plus of courseclemgame. (Same for the multimodal benchmark: A canonical repo for the games, plus all the additional stuff that's described above.)

If students now want to create their own game, they start create their own repo and model it on those repos; that is, they create their own games folder, but also copy over the scripts folder, and create whatever entries in their own game_registry.json that they need. (If they want to run canonical games for comparison, they need to clone the clembench repo and either run those game from within that repo, or add entries to their registry for the games that they want to be accessible with a path pointing to that repo).

(I don't know what would be the best way to add backends... Either clone the clemgame repo and work there (and install via pip -e), or maybe that could even work with the standard clemgame package reading in a local model registry pointing to a local backend?)

davidschlangen · 2025-01-14T21:19:55Z

Does anyone have a comment on the last part (would it be possible to define a backend outside of the directory where the other backends live)? @phisad , @Gnurro , @AnneBeyer ?

davidschlangen · 2025-01-14T21:30:15Z

(Actually, it looks like it wouldn't be a big problem to also install scripts (which cli.py could be turned into) with pip. So these could still be part of the framework and wouldn't need to be copied around.)

phisad · 2025-01-15T08:18:06Z

I assigned this to myself, so I will think about it when I have finished writing (hopefully in Feb.).

phisad self-assigned this Jan 15, 2025

davidschlangen mentioned this issue Feb 18, 2025

[feature] Make a pip installable library of clembench framework #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimise packaging and calling of benchmark #140

optimise packaging and calling of benchmark #140

davidschlangen commented Dec 11, 2024

davidschlangen commented Dec 18, 2024

davidschlangen commented Dec 19, 2024

davidschlangen commented Jan 14, 2025

davidschlangen commented Jan 14, 2025

phisad commented Jan 15, 2025

optimise packaging and calling of benchmark #140

optimise packaging and calling of benchmark #140

Comments

davidschlangen commented Dec 11, 2024

davidschlangen commented Dec 18, 2024

davidschlangen commented Dec 19, 2024

davidschlangen commented Jan 14, 2025

davidschlangen commented Jan 14, 2025

phisad commented Jan 15, 2025