Skip to content

Commit

Permalink
WebArena/VisualWebArena results (#168)
Browse files Browse the repository at this point in the history
* 4o mini webarena

* where is this gone ?

* 4o webarena

* claude webarena

* o1 mini results

* 405 webarena

* vwa + &)b wevarena
  • Loading branch information
TLSDC authored Dec 3, 2024
1 parent 8dc809c commit 38b2c0b
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 2 deletions.
10 changes: 10 additions & 0 deletions reproducibility_journal.csv
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,13 @@ ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,assistantbench,0.13.
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-8b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.001,0.001,15,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.009,0.005,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
ThibaultLSDC,GenericAgent-gpt-4o-mini,webarena,0.13.3,2024-11-29_19-25-49,c6bdeb87-9879-4c06-aa70-00d895001156,0.174,0.013,1,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,None,
ThibaultLSDC,GenericAgent-gpt-4o,webarena,0.13.3,2024-11-29_22-28-32,d2eed215-91bb-4603-b69c-8ef8f9d57f34,0.314,0.016,3,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,430fe9456ba766398380454a6335f094004607af,,0.13.3,None,
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,webarena,0.13.3,2024-11-29_22-37-46,b5fc5be7-54cc-4fc1-a9ee-73447b9c3eae,0.362,0.017,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,7b224971fb7a90fb76924ca9386a1e8bf609dd2a,,0.13.3,None,
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,webarena,0.13.3,2024-11-30_00-22-44,1827983d-5e84-4b63-ad49-bf45ec2a6348,0.286,0.016,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,3f54ef13b778e69a1706c732f776147e9523ad3d,,0.13.3,None,
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,webarena,0.13.3,2024-12-01_00-04-43,aaeca13d-0cf5-444f-8445-590350b54746,0.24,0.015,9,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,5a5b94d544424517cdd11602b27100b82e35eac0,,0.13.3,None,
ThibaultLSDC,GenericAgent-gpt-4o-mini_vision,visualwebarena,0.13.3,2024-12-02_02-54-33,8d8642d3-757a-4346-ba45-01398f85b1f4,0.169,0.012,37,909/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
ThibaultLSDC,GenericAgent-gpt-4o_vision,visualwebarena,0.13.3,2024-12-02_07-17-28,7fb7eac8-4bbd-4ebe-be32-15901a7678f2,0.267,0.015,65,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta_vision,visualwebarena,0.13.3,2024-12-02_09-11-35,22f0611d-aeea-4ee9-a533-b45442b5e080,0.21,0.013,178,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,webarena,0.13.3,2024-12-02_23-18-38,fc5747bc-d998-4942-a0eb-e55a3ccc1cb3,0.184,0.014,213,811/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,

4 changes: 2 additions & 2 deletions src/agentlab/analyze/inspect_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -496,8 +496,8 @@ def display_report(
if rename_bool_flags:
report = _rename_bool_flags(report)

if copy_to_clipboard:
to_clipboard(report)
# if copy_to_clipboard:
# to_clipboard(report)

columns = list(report.columns)

Expand Down

0 comments on commit 38b2c0b

Please sign in to comment.