The Eval Index

The Eval Index https://eval.kymatalabs.com The living leaderboard of LLM & agent evaluation and benchmark tooling. langfuse/langfuse — momentum 87https://eval.kymatalabs.com/p/langfuse-langfuse/langfuse/langfuse🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23 mlflow/mlflow — momentum 87https://eval.kymatalabs.com/p/mlflow-mlflow/mlflow/mlflowThe open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data. promptfoo/promptfoo — momentum 86https://eval.kymatalabs.com/p/promptfoo-promptfoo/promptfoo/promptfooTest your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic. comet-ml/opik — momentum 85https://eval.kymatalabs.com/p/comet-ml-opik/comet-ml/opikDebug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards. openobserve/openobserve — momentum 85https://eval.kymatalabs.com/p/openobserve-openobserve/openobserve/openobserveOpen source observability platform for logs, metrics, traces, frontend monitoring, pipelines and LLM observability. A sophisticated, simple and highly performant alternative to Datadog, Splunk, and Elasticsearch with 140x lower storage costs and single binary deployment. trycua/cua — momentum 84https://eval.kymatalabs.com/p/trycua-cua/trycua/cuaOpen-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows). Tencent/WeKnora — momentum 84https://eval.kymatalabs.com/p/tencent-weknora/Tencent/WeKnoraOpen-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki. confident-ai/deepeval — momentum 84https://eval.kymatalabs.com/p/confident-ai-deepeval/confident-ai/deepevalThe LLM Evaluation Framework tensorzero/tensorzero — momentum 82https://eval.kymatalabs.com/p/tensorzero-tensorzero/tensorzero/tensorzeroTensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation. dataelement/bisheng — momentum 82https://eval.kymatalabs.com/p/dataelement-bisheng/dataelement/bishengBISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more. Arize-ai/phoenix — momentum 81https://eval.kymatalabs.com/p/arize-ai-phoenix/Arize-ai/phoenixAI Observability & Evaluation oumi-ai/oumi — momentum 81https://eval.kymatalabs.com/p/oumi-ai-oumi/oumi-ai/oumiEasily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM! Doorman11991/smallcode — momentum 81https://eval.kymatalabs.com/p/doorman11991-smallcode/Doorman11991/smallcodeAI coding agent optimized for small LLMs. 87% benchmark with 4B-active model. NVIDIA/garak — momentum 80https://eval.kymatalabs.com/p/nvidia-garak/NVIDIA/garakthe LLM vulnerability scanner traceloop/openllmetry — momentum 79https://eval.kymatalabs.com/p/traceloop-openllmetry/traceloop/openllmetryOpen-source observability for your GenAI or LLM application, based on OpenTelemetry open-compass/opencompass — momentum 79https://eval.kymatalabs.com/p/open-compass-opencompass/open-compass/opencompassOpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets. katanemo/plano — momentum 79https://eval.kymatalabs.com/p/katanemo-plano/katanemo/planoPlano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic. GoogleCloudPlatform/agent-starter-pack — momentum 79https://eval.kymatalabs.com/p/googlecloudplatform-agent-starter-pack/GoogleCloudPlatform/agent-starter-packShip AI Agents to Google Cloud in minutes, not months. Production-ready templates with built-in CI/CD, evaluation, and observability. Andyyyy64/whichllm — momentum 79https://eval.kymatalabs.com/p/andyyyy64-whichllm/Andyyyy64/whichllmFind the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly. jeinlee1991/chinese-llm-benchmark — momentum 78https://eval.kymatalabs.com/p/jeinlee1991-chinese-llm-benchmark/jeinlee1991/chinese-llm-benchmark非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅 Helicone/helicone — momentum 78https://eval.kymatalabs.com/p/helicone-helicone/Helicone/helicone🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓 coze-dev/coze-loop — momentum 78https://eval.kymatalabs.com/p/coze-dev-coze-loop/coze-dev/coze-loopNext-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring. Giskard-AI/giskard-oss — momentum 78https://eval.kymatalabs.com/p/giskard-ai-giskard-oss/Giskard-AI/giskard-oss🐢 Open-Source Evaluation & Testing library for LLM Agents Kiln-AI/Kiln — momentum 77https://eval.kymatalabs.com/p/kiln-ai-kiln/Kiln-AI/KilnBuild, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more. Marker-Inc-Korea/AutoRAG — momentum 77https://eval.kymatalabs.com/p/marker-inc-korea-autorag/Marker-Inc-Korea/AutoRAGAutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation pydantic/logfire — momentum 77https://eval.kymatalabs.com/p/pydantic-logfire/pydantic/logfireAI observability platform for production LLM and agent systems. Agenta-AI/agenta — momentum 77https://eval.kymatalabs.com/p/agenta-ai-agenta/Agenta-AI/agentaThe open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place. future-agi/future-agi — momentum 77https://eval.kymatalabs.com/p/future-agi-future-agi/future-agi/future-agiOpen-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0. PaperGuru-AI/PaperGuru-Benchmark — momentum 77https://eval.kymatalabs.com/p/paperguru-ai-paperguru-benchmark/PaperGuru-AI/PaperGuru-BenchmarkLifecycle-Aware Memory for long-horizon LLM agents — 66.05% on PaperBench, 94.66% on SurveyBench, 10 peer-reviewed acceptances at FSE/ICML/TOSEM/AEI/ICoGB EvolvingLMMs-Lab/lmms-eval — momentum 76https://eval.kymatalabs.com/p/evolvinglmms-lab-lmms-eval/EvolvingLMMs-Lab/lmms-evalOne-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks