Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimise packaging and calling of benchmark #140

Open
davidschlangen opened this issue Dec 11, 2024 · 5 comments
Open

optimise packaging and calling of benchmark #140

davidschlangen opened this issue Dec 11, 2024 · 5 comments
Assignees

Comments

@davidschlangen
Copy link
Contributor

To enable inclusion of clembench into standard benchmark runs made by others (e.g., in Playpen evaluations, but also for people training a new model and wanting to evaluate it), it would be good if our packaging could still be improved:

  • Make it very clear what needs to be downloaded to run benchmark (e.g., provide script that does this?)
  • Make it very clear how new model can be integrated.
  • Have a single point of entry (script that runs the whole benchmark, in such a way that the result is comparable to other runs of the same pipeline) and exit (e.g., let script return single score, in addition to detailed score breakdown in file).

We're not far away from this, but my understanding is that running the benchmark at the moment still requires a bit of babysitting.

@davidschlangen
Copy link
Contributor Author

Re installing and running the benchmark, possible workflow:

  • clone the clembench-games repository
  • install the requirements, which include...
    • ... the "clemgame" package, which is installed via pip. Among all other things, this installs cli.py so that it is executable
  • set up your own key.json inside of the cloned repository
  • set up your own model_registry_custom.json, inside the cloned repository
  • run python cli.py run all (or something to that effect)
  • find the results in results/

(Maybe it would be easier to have cli.py in the games repository, so that it can be called from there. So it would need to be distributed there rather than in the package, which kind of makes sense.)

@davidschlangen
Copy link
Contributor Author

(copying over from discussion on mattermost, so that it is properly archived here)

We could kind of swap things around: We have a canonical repository for the benchmark, which holds: a) all the games that are officially in the benchmark, b) a scripts folder, that contains ba) cli.py, and bb) subfolders evaluation and other_scripts, c) the appropriate game_registry.json file.

This is what people clone if they want to run the core benchmark. The requirements in that repo contain everything that the invidual games need, plus of courseclemgame. (Same for the multimodal benchmark: A canonical repo for the games, plus all the additional stuff that's described above.)

If students now want to create their own game, they start create their own repo and model it on those repos; that is, they create their own games folder, but also copy over the scripts folder, and create whatever entries in their own game_registry.json that they need. (If they want to run canonical games for comparison, they need to clone the clembench repo and either run those game from within that repo, or add entries to their registry for the games that they want to be accessible with a path pointing to that repo).

(I don't know what would be the best way to add backends... Either clone the clemgame repo and work there (and install via pip -e), or maybe that could even work with the standard clemgame package reading in a local model registry pointing to a local backend?)

@davidschlangen
Copy link
Contributor Author

Does anyone have a comment on the last part (would it be possible to define a backend outside of the directory where the other backends live)? @phisad , @Gnurro , @AnneBeyer ?

@davidschlangen
Copy link
Contributor Author

(Actually, it looks like it wouldn't be a big problem to also install scripts (which cli.py could be turned into) with pip. So these could still be part of the framework and wouldn't need to be copied around.)

@phisad phisad self-assigned this Jan 15, 2025
@phisad
Copy link
Collaborator

phisad commented Jan 15, 2025

I assigned this to myself, so I will think about it when I have finished writing (hopefully in Feb.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants