<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>The Eval Index</title>
    <link>https://eval.kymatalabs.com</link>
    <description>The living leaderboard of LLM &amp; agent evaluation and benchmark tooling.</description>
    <item><title>langfuse/langfuse — momentum 87</title><link>https://eval.kymatalabs.com/p/langfuse-langfuse/</link><guid isPermaLink="false">langfuse/langfuse</guid><description>🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23</description></item>
    <item><title>mlflow/mlflow — momentum 87</title><link>https://eval.kymatalabs.com/p/mlflow-mlflow/</link><guid isPermaLink="false">mlflow/mlflow</guid><description>The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.</description></item>
    <item><title>promptfoo/promptfoo — momentum 86</title><link>https://eval.kymatalabs.com/p/promptfoo-promptfoo/</link><guid isPermaLink="false">promptfoo/promptfoo</guid><description>Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration.  Used by OpenAI and Anthropic.</description></item>
    <item><title>comet-ml/opik — momentum 85</title><link>https://eval.kymatalabs.com/p/comet-ml-opik/</link><guid isPermaLink="false">comet-ml/opik</guid><description>Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.</description></item>
    <item><title>openobserve/openobserve — momentum 85</title><link>https://eval.kymatalabs.com/p/openobserve-openobserve/</link><guid isPermaLink="false">openobserve/openobserve</guid><description>Open source observability platform for logs, metrics, traces, frontend monitoring, pipelines and LLM observability. A sophisticated, simple and highly performant alternative to Datadog, Splunk, and Elasticsearch with 140x lower storage costs and single binary deployment.</description></item>
    <item><title>trycua/cua — momentum 84</title><link>https://eval.kymatalabs.com/p/trycua-cua/</link><guid isPermaLink="false">trycua/cua</guid><description>Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).</description></item>
    <item><title>Tencent/WeKnora — momentum 84</title><link>https://eval.kymatalabs.com/p/tencent-weknora/</link><guid isPermaLink="false">Tencent/WeKnora</guid><description>Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.</description></item>
    <item><title>confident-ai/deepeval — momentum 84</title><link>https://eval.kymatalabs.com/p/confident-ai-deepeval/</link><guid isPermaLink="false">confident-ai/deepeval</guid><description>The LLM Evaluation Framework</description></item>
    <item><title>tensorzero/tensorzero — momentum 82</title><link>https://eval.kymatalabs.com/p/tensorzero-tensorzero/</link><guid isPermaLink="false">tensorzero/tensorzero</guid><description>TensorZero is an open-source LLMOps platform that unifies an LLM gateway, observability, evaluation, optimization, and experimentation.</description></item>
    <item><title>dataelement/bisheng — momentum 82</title><link>https://eval.kymatalabs.com/p/dataelement-bisheng/</link><guid isPermaLink="false">dataelement/bisheng</guid><description>BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation, SFT, Dataset Management, Enterprise-level System Management, Observability and more.</description></item>
    <item><title>Arize-ai/phoenix — momentum 81</title><link>https://eval.kymatalabs.com/p/arize-ai-phoenix/</link><guid isPermaLink="false">Arize-ai/phoenix</guid><description>AI Observability &amp; Evaluation</description></item>
    <item><title>oumi-ai/oumi — momentum 81</title><link>https://eval.kymatalabs.com/p/oumi-ai-oumi/</link><guid isPermaLink="false">oumi-ai/oumi</guid><description>Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM!</description></item>
    <item><title>Doorman11991/smallcode — momentum 81</title><link>https://eval.kymatalabs.com/p/doorman11991-smallcode/</link><guid isPermaLink="false">Doorman11991/smallcode</guid><description>AI coding agent optimized for small LLMs. 87% benchmark with 4B-active model.</description></item>
    <item><title>NVIDIA/garak — momentum 80</title><link>https://eval.kymatalabs.com/p/nvidia-garak/</link><guid isPermaLink="false">NVIDIA/garak</guid><description>the LLM vulnerability scanner</description></item>
    <item><title>traceloop/openllmetry — momentum 79</title><link>https://eval.kymatalabs.com/p/traceloop-openllmetry/</link><guid isPermaLink="false">traceloop/openllmetry</guid><description>Open-source observability for your GenAI or LLM application, based on OpenTelemetry</description></item>
    <item><title>open-compass/opencompass — momentum 79</title><link>https://eval.kymatalabs.com/p/open-compass-opencompass/</link><guid isPermaLink="false">open-compass/opencompass</guid><description>OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.</description></item>
    <item><title>katanemo/plano — momentum 79</title><link>https://eval.kymatalabs.com/p/katanemo-plano/</link><guid isPermaLink="false">katanemo/plano</guid><description>Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.</description></item>
    <item><title>GoogleCloudPlatform/agent-starter-pack — momentum 79</title><link>https://eval.kymatalabs.com/p/googlecloudplatform-agent-starter-pack/</link><guid isPermaLink="false">GoogleCloudPlatform/agent-starter-pack</guid><description>Ship AI Agents to Google Cloud in minutes, not months. Production-ready templates with built-in CI/CD, evaluation, and observability.</description></item>
    <item><title>Andyyyy64/whichllm — momentum 79</title><link>https://eval.kymatalabs.com/p/andyyyy64-whichllm/</link><guid isPermaLink="false">Andyyyy64/whichllm</guid><description>Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.</description></item>
    <item><title>jeinlee1991/chinese-llm-benchmark — momentum 78</title><link>https://eval.kymatalabs.com/p/jeinlee1991-chinese-llm-benchmark/</link><guid isPermaLink="false">jeinlee1991/chinese-llm-benchmark</guid><description>非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型， 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅</description></item>
    <item><title>Helicone/helicone — momentum 78</title><link>https://eval.kymatalabs.com/p/helicone-helicone/</link><guid isPermaLink="false">Helicone/helicone</guid><description>🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓</description></item>
    <item><title>coze-dev/coze-loop — momentum 78</title><link>https://eval.kymatalabs.com/p/coze-dev-coze-loop/</link><guid isPermaLink="false">coze-dev/coze-loop</guid><description>Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.</description></item>
    <item><title>Giskard-AI/giskard-oss — momentum 78</title><link>https://eval.kymatalabs.com/p/giskard-ai-giskard-oss/</link><guid isPermaLink="false">Giskard-AI/giskard-oss</guid><description>🐢 Open-Source Evaluation &amp; Testing library for LLM Agents</description></item>
    <item><title>Kiln-AI/Kiln — momentum 77</title><link>https://eval.kymatalabs.com/p/kiln-ai-kiln/</link><guid isPermaLink="false">Kiln-AI/Kiln</guid><description>Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.</description></item>
    <item><title>Marker-Inc-Korea/AutoRAG — momentum 77</title><link>https://eval.kymatalabs.com/p/marker-inc-korea-autorag/</link><guid isPermaLink="false">Marker-Inc-Korea/AutoRAG</guid><description>AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation &amp; Optimization with AutoML-Style Automation</description></item>
    <item><title>pydantic/logfire — momentum 77</title><link>https://eval.kymatalabs.com/p/pydantic-logfire/</link><guid isPermaLink="false">pydantic/logfire</guid><description>AI observability platform for production LLM and agent systems.</description></item>
    <item><title>Agenta-AI/agenta — momentum 77</title><link>https://eval.kymatalabs.com/p/agenta-ai-agenta/</link><guid isPermaLink="false">Agenta-AI/agenta</guid><description>The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.</description></item>
    <item><title>future-agi/future-agi — momentum 77</title><link>https://eval.kymatalabs.com/p/future-agi-future-agi/</link><guid isPermaLink="false">future-agi/future-agi</guid><description>Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Guardrails. Self-hostable. Apache 2.0.</description></item>
    <item><title>PaperGuru-AI/PaperGuru-Benchmark — momentum 77</title><link>https://eval.kymatalabs.com/p/paperguru-ai-paperguru-benchmark/</link><guid isPermaLink="false">PaperGuru-AI/PaperGuru-Benchmark</guid><description>Lifecycle-Aware Memory for long-horizon LLM agents — 66.05% on PaperBench, 94.66% on SurveyBench, 10 peer-reviewed acceptances at FSE/ICML/TOSEM/AEI/ICoGB</description></item>
    <item><title>EvolvingLMMs-Lab/lmms-eval — momentum 76</title><link>https://eval.kymatalabs.com/p/evolvinglmms-lab-lmms-eval/</link><guid isPermaLink="false">EvolvingLMMs-Lab/lmms-eval</guid><description>One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks</description></item>
  </channel>
</rss>
