# The Eval Index > The living leaderboard of LLM and agent evaluation & benchmark tooling, ranked > daily by momentum (stars, push-recency, rising-newness) from live GitHub signals. Updated: 2026-06-13T11:04:15.661957+00:00 Tools indexed: 269 ## Top eval tools by momentum - [langfuse/langfuse](https://github.com/langfuse/langfuse) โ€” momentum 87, โญ29005 โ€” Observability โ€” ๐Ÿชข Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playgro - [mlflow/mlflow](https://github.com/mlflow/mlflow) โ€” momentum 87, โญ26485 โ€” Observability โ€” The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all - [promptfoo/promptfoo](https://github.com/promptfoo/promptfoo) โ€” momentum 86, โญ22162 โ€” Red Teaming & Safety โ€” Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare p - [comet-ml/opik](https://github.com/comet-ml/opik) โ€” momentum 85, โญ19595 โ€” Observability โ€” Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehe - [openobserve/openobserve](https://github.com/openobserve/openobserve) โ€” momentum 85, โญ19253 โ€” Observability โ€” Open source observability platform for logs, metrics, traces, frontend monitoring, pipelines and LLM - [trycua/cua](https://github.com/trycua/cua) โ€” momentum 84, โญ17875 โ€” Agent Eval โ€” Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and eva - [Tencent/WeKnora](https://github.com/Tencent/WeKnora) โ€” momentum 84, โญ16237 โ€” RAG Eval โ€” Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning - [confident-ai/deepeval](https://github.com/confident-ai/deepeval) โ€” momentum 84, โญ16133 โ€” Eval Frameworks โ€” The LLM Evaluation Framework - [tensorzero/tensorzero](https://github.com/tensorzero/tensorzero) โ€” momentum 82, โญ11462 โ€” Observability โ€” TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, - [dataelement/bisheng](https://github.com/dataelement/bisheng) โ€” momentum 82, โญ11450 โ€” Observability โ€” BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and - [Arize-ai/phoenix](https://github.com/Arize-ai/phoenix) โ€” momentum 81, โญ10120 โ€” Observability โ€” AI Observability & Evaluation - [oumi-ai/oumi](https://github.com/oumi-ai/oumi) โ€” momentum 81, โญ9315 โ€” Eval Frameworks โ€” Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open s - [Doorman11991/smallcode](https://github.com/Doorman11991/smallcode) โ€” momentum 81, โญ1845 โ€” Agent Eval โ€” AI coding agent optimized for small LLMs. 87% benchmark with 4B-active model. - [NVIDIA/garak](https://github.com/NVIDIA/garak) โ€” momentum 80, โญ8095 โ€” Eval Frameworks โ€” the LLM vulnerability scanner - [traceloop/openllmetry](https://github.com/traceloop/openllmetry) โ€” momentum 79, โญ7194 โ€” Observability โ€” Open-source observability for your GenAI or LLM application, based on OpenTelemetry - [open-compass/opencompass](https://github.com/open-compass/opencompass) โ€” momentum 79, โญ7081 โ€” Benchmarks โ€” OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, Inter - [katanemo/plano](https://github.com/katanemo/plano) โ€” momentum 79, โญ6585 โ€” Observability โ€” Plano is an AI-native proxy and data plane for agentic apps โ€” with built-in orchestration, safety, o - [GoogleCloudPlatform/agent-starter-pack](https://github.com/GoogleCloudPlatform/agent-starter-pack) โ€” momentum 79, โญ6469 โ€” Observability โ€” Ship AI Agents to Google Cloud in minutes, not months. Production-ready templates with built-in CI/C - [Andyyyy64/whichllm](https://github.com/Andyyyy64/whichllm) โ€” momentum 79, โญ4661 โ€” Benchmarks โ€” Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aw - [jeinlee1991/chinese-llm-benchmark](https://github.com/jeinlee1991/chinese-llm-benchmark) โ€” momentum 78, โญ6160 โ€” Agent Eval โ€” ้ž็บฟๆ™บ่ƒฝ NoneLinear - ReLE่ฏ„ๆต‹๏ผšไธญๆ–‡AIๅคงๆจกๅž‹่ƒฝๅŠ›่ฏ„ๆต‹๏ผˆๆŒ็ปญๆ›ดๆ–ฐ๏ผ‰๏ผš็›ฎๅ‰ๅทฒๅ›Šๆ‹ฌ374ไธชๅคงๆจกๅž‹๏ผŒ่ฆ†็›–chatgptใ€gpt-5.4ใ€่ฐทๆญŒgemini-3.1-proใ€Claude-4. - [Helicone/helicone](https://github.com/Helicone/helicone) โ€” momentum 78, โญ5809 โ€” Observability โ€” ๐ŸงŠ Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC - [coze-dev/coze-loop](https://github.com/coze-dev/coze-loop) โ€” momentum 78, โญ5516 โ€” Observability โ€” Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent developmen - [Giskard-AI/giskard-oss](https://github.com/Giskard-AI/giskard-oss) โ€” momentum 78, โญ5428 โ€” Red Teaming & Safety โ€” ๐Ÿข Open-Source Evaluation & Testing library for LLM Agents - [Kiln-AI/Kiln](https://github.com/Kiln-AI/Kiln) โ€” momentum 77, โญ4899 โ€” RAG Eval โ€” Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data g - [Marker-Inc-Korea/AutoRAG](https://github.com/Marker-Inc-Korea/AutoRAG) โ€” momentum 77, โญ4829 โ€” RAG Eval โ€” AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization - [pydantic/logfire](https://github.com/pydantic/logfire) โ€” momentum 77, โญ4299 โ€” Observability โ€” AI observability platform for production LLM and agent systems. - [Agenta-AI/agenta](https://github.com/Agenta-AI/agenta) โ€” momentum 77, โญ4200 โ€” Observability โ€” The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM obser - [future-agi/future-agi](https://github.com/future-agi/future-agi) โ€” momentum 77, โญ1142 โ€” Observability โ€” Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applicati - [PaperGuru-AI/PaperGuru-Benchmark](https://github.com/PaperGuru-AI/PaperGuru-Benchmark) โ€” momentum 77, โญ970 โ€” Agent Eval โ€” Lifecycle-Aware Memory for long-horizon LLM agents โ€” 66.05% on PaperBench, 94.66% on SurveyBench, 10 - [EvolvingLMMs-Lab/lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) โ€” momentum 76, โญ4224 โ€” Benchmarks โ€” One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks - [open-compass/VLMEvalKit](https://github.com/open-compass/VLMEvalKit) โ€” momentum 76, โญ4219 โ€” Benchmarks โ€” Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchma - [Tencent/AI-Infra-Guard](https://github.com/Tencent/AI-Infra-Guard) โ€” momentum 76, โญ3883 โ€” Red Teaming & Safety โ€” A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, - [juanjuandog/FinSight-AI](https://github.com/juanjuandog/FinSight-AI) โ€” momentum 76, โญ1158 โ€” Observability โ€” AI equity research agent with resilient workflows, Redis Lua single-flight, pgvector RAG, versioned - [FareedKhan-dev/all-agentic-architectures](https://github.com/FareedKhan-dev/all-agentic-architectures) โ€” momentum 75, โญ3606 โ€” Agent Eval โ€” 35 production-grade agentic AI architectures (Reflexion, LATS, GraphRAG, MemGPT, Voyager, BrowserAge - [truera/trulens](https://github.com/truera/trulens) โ€” momentum 75, โญ3378 โ€” Observability โ€” Evaluation and Tracking for LLM Experiments and AI Agents - [langwatch/langwatch](https://github.com/langwatch/langwatch) โ€” momentum 75, โญ3300 โ€” Observability โ€” The platform for LLM evaluations and AI agent testing - [lmnr-ai/lmnr](https://github.com/lmnr-ai/lmnr) โ€” momentum 75, โญ3001 โ€” Observability โ€” Laminar - open-source observability platform purpose-built for AI agents. YC S24. - [modelscope/evalscope](https://github.com/modelscope/evalscope) โ€” momentum 75, โญ2930 โ€” RAG Eval โ€” A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and p - [openai/evals](https://github.com/openai/evals) โ€” momentum 74, โญ18685 โ€” Benchmarks โ€” Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks. - [ianarawjo/ChainForge](https://github.com/ianarawjo/ChainForge) โ€” momentum 74, โญ2996 โ€” Coding Eval โ€” An open-source visual programming environment for battle-testing prompts to LLMs.