Results for “benchmark”

53 packages found

Works with: Claude×

Agentawesome-gpt-5.6-usecases

@codeguilds-knightCommunity

Source-backed GPT-5.6 use cases for coding, agents, creative work, integrations, benchmarks, and practical limits.

0v1.0.0Compare

claude

AgentLLM-Agents-Papers

@codeguilds-knightCommunity

A repo lists papers related to LLM based agent

0v1.0.0Compare

claude

AgentDecryptPrompt

@codeguilds-knightCommunity

总结Prompt&LLM论文，开源数据&模型，AIGC应用

0v1.0.0Compare

claude

Agentawesome-generative-ai

@codeguilds-knightCommunity

A curated list of Generative AI tools, works, models, and references

0v1.0.0Compare

claude

Agentchinese-llm-benchmark

@codeguilds-knightCommunity

非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE

0v5.10Compare

claude

AgentAwesome-Graphs-Meet-Agents

@codeguilds-knightCommunity

[Up-to-date] A curated list of resources on graph-empowered agents and agent-facilitated graph learning (Graphs Meet Age

0v1.0.0Compare

claude

AgentGTA

@codeguilds-knightCommunity

[NeurIPS 2024 D&B] GTA: A Benchmark for General Tool Agents & [arXiv 2026] GTA-2

0v0.2.0Compare

claude

AgentDeep-Research-Survey

@codeguilds-knightCommunity

A Systematic Survey of Deep Research

0v1.0.0Compare

claude

Agentllm-srbench

@codeguilds-knightCommunity

[ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

0v1.0.0Compare

claude

AgentLLMCompiler

@codeguilds-knightCommunity

[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling

0v1.0.0Compare

claude

AgentAwesome-LLM-in-Social-Science

@codeguilds-knightCommunity

Awesome papers involving LLMs in Social Science.

0v1.0.0Compare

claude

AgentAwesome-LLM-Papers-Comprehensive-Topics

@codeguilds-knightCommunity

Awesome LLM Papers and repos on very comprehensive topics.

0vreadabilityCompare

claude

AgentxLAM

@codeguilds-knightCommunity

xLAM: A Family of Large Action Models to Empower AI Agent Systems

0v1.0.0Compare

claude

AgentLLM-SR

@codeguilds-knightCommunity

[ICLR 2025 Oral] This is the official repo for the paper "LLM-SR" on Scientific Equation Discovery and Symbolic Regressi

0v1.0.0Compare

claude

AgentPhysicianBench

@codeguilds-knightCommunity

The benchmark tasks and evaluation harness for "PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments".

0v1.0.0Compare

claude

AgentYunjue-Agent

@codeguilds-knightCommunity

Yunjue Agent: A Fully Reproducible, Zero-Start In-Situ Self-Evolving Agent System for Open-Ended Tasks

0v1.0.0Compare

claude

Agentskill-receipts

@codeguilds-knightCommunity

Agent skills for Claude Code where every entry ships with receipts: accuracy-gated benchmarks vs baseline AND placebo. R

0v1.0.0Compare

claude

AgentAgentBench

@codeguilds-knightCommunity

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

0v1.0.0Compare

claude

Agentskills-vote

@codeguilds-knightCommunity

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

0v1.0.0Compare

claude

AgentOdyssey

@codeguilds-knightCommunity

Odyssey: Empowering Minecraft Agents with Open-World Skills

0v1.0.0Compare

claude

AgentDeepCode

@codeguilds-knightCommunity

"DeepCode: Open Agentic Coding (Paper2Code & Text2Web & Text2Backend)"

0v1.2.0Compare

claude

Agentml-dev-bench

@codeguilds-knightCommunity

ML-Dev-Bench is a benchmark for evaluating AI agents against various ML development tasks.

0v0.1.0Compare

claude

Agentcactus

@codeguilds-knightCommunity

LLM Agent that leverages cheminformatics tools to provide informed responses.

0v1.0.0Compare

claude

AgentUA2-Agent

@codeguilds-knightCommunity

Official Implementation of UA^{2}-Agent and other baseline algorithms of "Towards Unified Alignment Between Agents, Huma

0v1.0.0Compare

claude

Agentscrapbot

@codeguilds-knightCommunity

An experimental game engine for robots (and their humans), made by robots (and their humans)

0v0.1.0Compare

claude

Agentyantrikdb-hermes-plugin

@codeguilds-knightCommunity

YantrikDB memory provider for NousResearch/hermes-agent — self-maintaining memory with canonicalization, contradiction t

0v0.6.0Compare

claude

AgentVideoGLaMM

@codeguilds-knightCommunity

[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

0v1.0.0Compare

claude

AgentPlugMem

@codeguilds-knightCommunity

ICML 2026 · Plug-and-play long-term memory for LLM agents

0v1.0.0Compare

claude

AgentMIRAI

@codeguilds-knightCommunity

Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"

0v1.0.0Compare

claude

AgentAutoRocq

@codeguilds-knightCommunity

Agentic Theorem Prover for Rocq for Program Verification

0v1.0.0Compare

claude

Agenttimechara

@codeguilds-knightCommunity

🧙🏻 Code and benchmark for our Findings of ACL 2024 paper - "TimeChara: Evaluating Point-in-Time Character Hallucinatio

0v1.0.0Compare

claude

Agentclaude-scientific-writer

@codeguilds-knightCommunity

A general purpose scientific writer

0v2.14.0Compare

claude

Agentawesome-claude

@codeguilds-knightCommunity

A curated list of awesome things related to Anthropic Claude

0v1.0.0Compare

claude

Agentbreaking-coding-chaos

@codeguilds-knightCommunity

A human-in-the-loop control plane for reliable agentic coding—plan, challenge, implement, verify, and preserve progress.

0v1.0.0Compare

claude

Agentawesome-ai-tools

@codeguilds-knightCommunity

🔴 VERY LARGE AI TOOL LIST! 🔴 Curated list of AI Tools - Updated 2026

0v1.0.0Compare

claude

AgentOpenRCA

@microsoft✓ Official

[ICLR'25] OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?

0v1.0.0Compare

claude

AgentMedAgents

@codeguilds-knightCommunity

[ACL 2024 Findings] MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning https://arxiv.org/

0v1.0.0Compare

claude

Agentax

@codeguilds-knightCommunity

The pretty much "official" DSPy framework for Typescript

0v22.0.3Compare

claude

AgentAwesome-AGI-Agents

@codeguilds-knightCommunity

🤖 Awesome list of AGI Agents. Agents 精选资源合集.

0v1.0.0Compare

claude

AgentAReaL

@codeguilds-knightCommunity

The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.

0v1.0.4Compare

claude

AgentFABULA-LLM-5

@codeguilds-knightCommunity

Frontier models sell confidence. FABULA ships proof — an agent harness where any model is a swappable chip and every fin

0v0.1.7Compare

claude

AgentSmartAgent

@codeguilds-knightCommunity

The official repository of "SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World".

0v1.0.0Compare

claude

Agentcode-act

@codeguilds-knightCommunity

Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan

0v1.0.0Compare

claude

AgentDeep-Research-skills

@codeguilds-knightCommunity

Structured deep research skill for Claude Code/Open Code/Codex with human-in-the-loop control

0v1.0.0Compare

claude

AgentVisualAgentBench

@codeguilds-knightCommunity

Towards Large Multimodal Models as Visual Foundation Agents

0v1.0.0Compare

claude

AgentRepairAgent

@codeguilds-knightCommunity

RepairAgent is an autonomous LLM-based agent for software repair.

0v1.0.0Compare

claude

AgentCodeGym

@codeguilds-knightCommunity

[ICLR2026] The official repository for the CodeGym project: "Generalizable End-to-End Tool-Use RL with Synthetic CodeGym

0v1.0.0Compare

claude

AgentPopupAttack

@codeguilds-knightCommunity

Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups

0v1.0.0Compare

claude

Agentmlxstudio

@codeguilds-knightCommunity

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)

0v1.5.58Compare

claude

Agentphoenix

@codeguilds-knightCommunity

AI Observability & Evaluation

0varize-phoenix-v17.6.0Compare

claude