Benchmarks you can feel
We all love benchmarks, but there's nothing like a hands on vibe check. What if we could meet somewhere in the middle?
Enter BENCHY. A chill, live benchmark tool that lets you see the performance, price, and speed of LLMs in a side by side comparison for SPECIFIC use cases.
Watch the latest development video here
- Thought Bench
- Goal: Compare multiple reasoning models (Deepseek R1, Gemini 2.0 Flash Thinking, OpenAI o1, ,etc) thoughts side by side in parallel.
- Watch the walk through video here
- Front end: src/pages/ThoughtBench.vue
- BIG AI Coding Updates to Benchy
- Watch the walk through video here
- Iso Speed Bench
- Goal: Create a unified, config file based, multi-llm provider, yes/no evaluation based benchmark for high quality insights and iteration.
- Watch o3-mini vibe check, comparison, and benchmark video here
- Watch the M4 Unboxing and benchmark video here
- Front end: src/pages/IsoSpeedBench.vue
- Long Tool Calling
- Goal: Understand the best LLMs and techniques for LONG chains of tool calls / function calls (15+).
- Watch the walk through video here
- Front end: src/pages/AppMultiToolCall.vue
- Multi Autocomplete
- Goal: Understand claude 3.5 haiku & GPT-4o predictive outputs compared to existing models.
- Watch the walk through video here
- Front end: src/pages/AppMultiAutocomplete.vue
.env
- Environment variables for API keysserver/.env
- Environment variables for API keyspackage.json
- Front end dependenciesserver/pyproject.toml
- Server dependenciessrc/store/*
- Stores all front end state and promptsrc/api/*
- API layer for all requestssrc/pages/*
- Front end per app pagessrc/components/*
- Front end componentsserver/server.py
- Server routesserver/modules/llm_models.py
- All LLM modelsserver/modules/openai_llm.py
- OpenAI LLMserver/modules/anthropic_llm.py
- Anthropic LLMserver/modules/gemini_llm.py
- Gemini LLMserver/modules/ollama_llm.py
- Ollama LLMserver/modules/deepseek_llm.py
- Deepseek LLMserver/benchmark_data/*
- Benchmark dataserver/reports/*
- Benchmark results
- Anthropic
- Google Cloud
- OpenAI
- Deepseek
- Ollama
- After installing Ollama, pull the required models:
# Pull Llama 3.2 1B model ollama pull llama3.2:1b # Pull Llama 3.2 latest (3B) model ollama pull llama3.2:latest # Pull Qwen2.5 Coder 14B model ollama pull qwen2.5-coder:14b # Pull Deepseek R1 1.5B, 7b, 8b, 14b, 32b, 70b models ollama pull deepseek-r1:1.5b ollama pull deepseek-r1:latest ollama pull deepseek-r1:8b ollama pull deepseek-r1:14b ollama pull deepseek-r1:32b ollama pull deepseek-r1:70b # Pull mistral-small 3 ollama pull mistral-small:latest
# Install dependencies using bun (recommended)
bun install
# Or using npm
npm install
# Or using yarn
yarn install
# Start development server
bun dev # or npm run dev / yarn dev
# Move into server directory
cd server
# Create and activate virtual environment using uv
uv sync
# Set up environment variables
cp .env.sample .env (client)
cp server/.env.sample server/.env (server)
# Set EVERY .env key with your API keys and settings
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
GEMINI_API_KEY=
DEEPSEEK_API_KEY=
FIREWORKS_API_KEY=
# Start server
uv run python server.py
# Run tests
uv run pytest (**beware will hit APIs and cost money**)
- https://github.com/simonw/llm?tab=readme-ov-file
- https://github.com/openai/openai-python
- https://platform.openai.com/docs/guides/predicted-outputs
- https://community.openai.com/t/introducing-predicted-outputs/1004502
- https://unocss.dev/integrations/vite
- https://www.npmjs.com/package/vue-codemirror6
- https://vuejs.org/guide/scaling-up/state-management
- https://www.ag-grid.com/vue-data-grid/getting-started/
- https://www.ag-grid.com/vue-data-grid/value-formatters/
- https://llm.datasette.io/en/stable/index.html
- https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/get-token-count
- https://ai.google.dev/gemini-api/docs/tokens?lang=python
- https://ai.google.dev/pricing#1_5flash
- https://ai.google.dev/gemini-api/docs/structured-output?lang=python
- https://platform.openai.com/docs/guides/structured-outputs
- https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- https://ai.google.dev/gemini-api/docs/models/experimental-models
- https://sqlparse.readthedocs.io/en/latest/intro.html
- mlx: https://huggingface.co/mlx-community
- ollama docs: https://github.com/ollama/ollama/blob/main/docs/api.md#examples
- deepseek docs: https://platform.deepseek.com/usage