# The Eval Index

> The living leaderboard of LLM and agent evaluation & benchmark tooling, ranked
> daily by momentum (stars, push-recency, rising-newness) from live GitHub signals.

Updated: 2026-06-13T11:04:15.661957+00:00
Tools indexed: 269

## Top eval tools by momentum

- [langfuse/langfuse](https://github.com/langfuse/langfuse) — momentum 87, ⭐29005 — Observability — 🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playgro
- [mlflow/mlflow](https://github.com/mlflow/mlflow) — momentum 87, ⭐26485 — Observability — The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all
- [promptfoo/promptfoo](https://github.com/promptfoo/promptfoo) — momentum 86, ⭐22162 — Red Teaming & Safety — Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare p
- [comet-ml/opik](https://github.com/comet-ml/opik) — momentum 85, ⭐19595 — Observability — Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehe
- [openobserve/openobserve](https://github.com/openobserve/openobserve) — momentum 85, ⭐19253 — Observability — Open source observability platform for logs, metrics, traces, frontend monitoring, pipelines and LLM
- [trycua/cua](https://github.com/trycua/cua) — momentum 84, ⭐17875 — Agent Eval — Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and eva
- [Tencent/WeKnora](https://github.com/Tencent/WeKnora) — momentum 84, ⭐16237 — RAG Eval — Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning
- [confident-ai/deepeval](https://github.com/confident-ai/deepeval) — momentum 84, ⭐16133 — Eval Frameworks — The LLM Evaluation Framework
- [tensorzero/tensorzero](https://github.com/tensorzero/tensorzero) — momentum 82, ⭐11462 — Observability — TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation,
- [dataelement/bisheng](https://github.com/dataelement/bisheng) — momentum 82, ⭐11450 — Observability — BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and 
- [Arize-ai/phoenix](https://github.com/Arize-ai/phoenix) — momentum 81, ⭐10120 — Observability — AI Observability & Evaluation
- [oumi-ai/oumi](https://github.com/oumi-ai/oumi) — momentum 81, ⭐9315 — Eval Frameworks — Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open s
- [Doorman11991/smallcode](https://github.com/Doorman11991/smallcode) — momentum 81, ⭐1845 — Agent Eval — AI coding agent optimized for small LLMs. 87% benchmark with 4B-active model.
- [NVIDIA/garak](https://github.com/NVIDIA/garak) — momentum 80, ⭐8095 — Eval Frameworks — the LLM vulnerability scanner
- [traceloop/openllmetry](https://github.com/traceloop/openllmetry) — momentum 79, ⭐7194 — Observability — Open-source observability for your GenAI or LLM application, based on OpenTelemetry
- [open-compass/opencompass](https://github.com/open-compass/opencompass) — momentum 79, ⭐7081 — Benchmarks — OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, Inter
- [katanemo/plano](https://github.com/katanemo/plano) — momentum 79, ⭐6585 — Observability — Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, o
- [GoogleCloudPlatform/agent-starter-pack](https://github.com/GoogleCloudPlatform/agent-starter-pack) — momentum 79, ⭐6469 — Observability — Ship AI Agents to Google Cloud in minutes, not months. Production-ready templates with built-in CI/C
- [Andyyyy64/whichllm](https://github.com/Andyyyy64/whichllm) — momentum 79, ⭐4661 — Benchmarks — Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aw
- [jeinlee1991/chinese-llm-benchmark](https://github.com/jeinlee1991/chinese-llm-benchmark) — momentum 78, ⭐6160 — Agent Eval — 非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.
- [Helicone/helicone](https://github.com/Helicone/helicone) — momentum 78, ⭐5809 — Observability — 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC 
- [coze-dev/coze-loop](https://github.com/coze-dev/coze-loop) — momentum 78, ⭐5516 — Observability — Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent developmen
- [Giskard-AI/giskard-oss](https://github.com/Giskard-AI/giskard-oss) — momentum 78, ⭐5428 — Red Teaming & Safety — 🐢 Open-Source Evaluation & Testing library for LLM Agents
- [Kiln-AI/Kiln](https://github.com/Kiln-AI/Kiln) — momentum 77, ⭐4899 — RAG Eval — Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data g
- [Marker-Inc-Korea/AutoRAG](https://github.com/Marker-Inc-Korea/AutoRAG) — momentum 77, ⭐4829 — RAG Eval — AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization
- [pydantic/logfire](https://github.com/pydantic/logfire) — momentum 77, ⭐4299 — Observability — AI observability platform for production LLM and agent systems.
- [Agenta-AI/agenta](https://github.com/Agenta-AI/agenta) — momentum 77, ⭐4200 — Observability — The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM obser
- [future-agi/future-agi](https://github.com/future-agi/future-agi) — momentum 77, ⭐1142 — Observability — Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applicati
- [PaperGuru-AI/PaperGuru-Benchmark](https://github.com/PaperGuru-AI/PaperGuru-Benchmark) — momentum 77, ⭐970 — Agent Eval — Lifecycle-Aware Memory for long-horizon LLM agents — 66.05% on PaperBench, 94.66% on SurveyBench, 10
- [EvolvingLMMs-Lab/lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) — momentum 76, ⭐4224 — Benchmarks — One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
- [open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit) — momentum 76, ⭐4219 — Benchmarks — Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchma
- [Tencent/AI-Infra-Guard](https://github.com/Tencent/AI-Infra-Guard) — momentum 76, ⭐3883 — Red Teaming & Safety — A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, 
- [juanjuandog/FinSight-AI](https://github.com/juanjuandog/FinSight-AI) — momentum 76, ⭐1158 — Observability — AI equity research agent with resilient workflows, Redis Lua single-flight, pgvector RAG, versioned 
- [FareedKhan-dev/all-agentic-architectures](https://github.com/FareedKhan-dev/all-agentic-architectures) — momentum 75, ⭐3606 — Agent Eval — 35 production-grade agentic AI architectures (Reflexion, LATS, GraphRAG, MemGPT, Voyager, BrowserAge
- [truera/trulens](https://github.com/truera/trulens) — momentum 75, ⭐3378 — Observability — Evaluation and Tracking for LLM Experiments and AI Agents
- [langwatch/langwatch](https://github.com/langwatch/langwatch) — momentum 75, ⭐3300 — Observability — The platform for LLM evaluations and AI agent testing
- [lmnr-ai/lmnr](https://github.com/lmnr-ai/lmnr) — momentum 75, ⭐3001 — Observability — Laminar - open-source observability platform purpose-built for AI agents. YC S24.
- [modelscope/evalscope](https://github.com/modelscope/evalscope) — momentum 75, ⭐2930 — RAG Eval — A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and p
- [openai/evals](https://github.com/openai/evals) — momentum 74, ⭐18685 — Benchmarks — Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
- [ianarawjo/ChainForge](https://github.com/ianarawjo/ChainForge) — momentum 74, ⭐2996 — Coding Eval — An open-source visual programming environment for battle-testing prompts to LLMs.