ML-Dev-Bench

Ever wondered if AI agents can reliably develop new AI models? Look no further!

ML-Dev-Bench is a benchmark for evaluating AI agents on real world ML development tasks.

The benchmark currently includes 30 tasks covering various aspects of model development, including dataset management, debugging model and code failures, and implementing new ideas to achieve strong performance on various machine learning tasks.

We also introduce Calipers, a framework for evaluating AI agents, providing tools and infrastructure for systematic assessment of AI model performance.

ML-Dev-Bench

Highlights
Features
Adding New Evaluation Tasks
Requirements
Installation
Usage
- Basic Usage
- Multi-run Evaluations
Development
Project Structure
Adding new Evaluation Cases
Adding New Agents
Contributing
Evaluation Traces
License
Acknowledgments
Citation

Highlights

What kind of tasks are currently in ml-dev-bench?

ml-dev-bench currently includes 30 tasks across the following categories.

Category	Description
Dataset Handling	Downloading and preprocessing datasets
Model Training	Loading pretrained models, fine-tuning
Debugging	Addressing errors in training files, exploding gradients, and incorrect implementations
Model Implementation	Modifying and implementing on top of existing model architectures
API Integration	Integrating logging tools like WandB
Performance	Improving baselines and achieving competitive results

What kind of ML problems do these tasks cover?

The tasks cover ML development in problem domains like image classification, segmentation, question answering, image generation, LLM finetuning and alignment, etc.

What is the performance of different agents on these tasks?

We currently evaluate 3 agents (ReAct, OpenHands, and AIDE) using 3 models (Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash) on 30 tasks. Agent Results

What are the common failures across agents?

Agents perform well in easier and well-defined categories like dataset handling and basic debugging with clear instructions, but struggle in open-ended and long-running tasks like model performance improvement where no agent succeeded. Agents also fail in debugging and implementation tasks which need modifications to large existing codebases.

Features

Flexible evaluation framework for AI agents
Comprehensive metrics tracking and reporting
Integration with LiteLLM and LangChain
Configurable task-based evaluation system using Hydra
Support for parameter sweeps and multi-run evaluations

Adding New Evaluation Tasks

We welcome contributions of new evaluation tasks! The process is:

Propose Your Task
- Create a new issue using our New Evaluation Task template
- This helps gather feedback and ensure the task fits our evaluation framework
Implement Your Task
- After discussion and approval, implement your task following our examples:
  - hello_world for basic task structure
  - nan_losses for tasks with setup files and test scripts
Submit Your Implementation
- Create a pull request using our New Evaluation Task template
- Ensure all validation criteria and tests are implemented

Requirements

Python 3.12+
Poetry 1.8+
Linux, macOS, or Windows Subsystem for Linux (WSL)

Installation

Clone the repository:

hljs language-bash

git clone https://github.com/ml-dev-bench/ml-dev-bench.git
cd ml-dev-bench

Install dependencies:

hljs language-bash

make build

This will:

Check system requirements
Install Python dependencies
Set up pre-commit hooks
Configure the development environment

Install runtime dependencies:

This is needed for running evaluations locally.

hljs language-bash

make install-runtime-dependencies

Usage

The evaluation framework uses Hydra for configuration management, allowing flexible task and agent configurations.

Basic Usage

Run a single task with a specific agent:

hljs language-bash

./scripts/eval.sh task=hello_world agent=openhands

Run with configuration overrides:

hljs language-bash

./scripts/eval.sh task=hello_world agent=openhands num_runs=3

Multi-run Evaluations

Create a .env file to store the API keys for the agents you are using.

Activate the virtual environment for that agent from the root directory (e.g. for OpenHands):

hljs language-bash

source .venv-openhands/<ml-dev-bench-version>/bin/activate

Run all available tasks with a specific agent:

hljs language-bash

./scripts/eval.sh --multirun "task=glob(*)" agent=openhands

Run a list of tasks with a specific agent:

hljs language-bash

./scripts/eval.sh --multirun task=hello_world,shape_mismatch_train agent=react

Development

Format and lint code:

hljs language-bash

make lint

Calipers Architecture

hljs language-mermaid

graph TD
    %% Main Components
    User([User]) --> Scripts
    Scripts["Scripts (Entry Points)"] --> |"Configure & Run"| Framework

    %% Core Components
    subgraph "Core Framework"
        Framework["Framework (Orchestration)"]
        Registry["Registry (Task & Agent Repository)"]
        Config["Configuration (Hydra-based)"]
    end

    %% Execution Components
    subgraph "Execution Components"
        Tasks["Evaluation Tasks"]
        Agents["AI Agents"]
        Runtime["Runtime (Execution Environment)"]
    end

    %% Monitoring Components
    subgraph "Monitoring"
        Metrics["Metrics System"]
        Callbacks["Event Callbacks"]
        Results["Evaluation Results"]
    end

    %% Configurations
    Config --> |"Configure"| Framework
    Config --> |"Task Settings"| Tasks
    Config --> |"Agent Settings"| Agents

    %% Registration Flow
    Registry --> |"Register"| Tasks
    Registry --> |"Register"| Agents
    Framework --> |"Loads from"| Registry

    %% Task Execution Flow
    Framework --> |"Initialize"| Tasks
    Tasks --> |"Run via"| Agents
    Agents --> |"Execute in"| Runtime
    Tasks --> |"Validate with"| Runtime

    %% Monitoring Flow
    Tasks --> |"Record"| Metrics
    Agents --> |"Trigger"| Callbacks
    Tasks --> |"Produce"| Results
    Metrics --> Results


    %% Extensions
    MLDevBench["ML Dev Bench Runtime"] --> Runtime

Project Structure

hljs language-bash

.
├── calipers/
│   ├── agents/          # Agent implementations
│   ├── callbacks/       # Callback handlers
│   ├── framework/       # Core evaluation framework
│   ├── metrics/         # Metrics tracking
│   └── scripts/         # CLI tools
│
└── runtime/
    ├── backends/        # Runtime backend implementations
    ├── environments/    # Environment configurations
    └── tools/           # Runtime tools

Adding new Evaluation Cases

Use the structure of the existing cases in the ml_dev_bench/cases directory. You need to create a new directory in the ml_dev_bench/cases directory and add the new case files. A case includes a task.txt file that lists the tasks to be run, a config.yaml file that lists the configuration for the case, and a python file that evaluates the case. Optionally, you can add a setup_workspace directory that will be cloned into the workspace for the case.

Adding New Agents

Setting up Agent Dependencies using Poetry

Add a new group in pyproject.toml:

hljs language-toml

[tool.poetry.group.{your-agent-name}.dependencies]
dependency1 = "^version"
dependency2 = "^version"

Add a corresponding make target in Makefile:

hljs language-makefile

install-{your-agent}-dependencies:
	@echo "$(GREEN)Installing Python dependencies with {your-agent} in new environment...$(RESET)"
	POETRY_VIRTUALENVS_PATH="./.venv-{your-agent}" poetry env use python$(PYTHON_VERSION)
	POETRY_VIRTUALENVS_PATH="./.venv-{your-agent}" poetry install --with {your-agent}

This creates a separate virtual environment with a suffix matching your agent name (e.g., .venv-{your-agent}).

Example: The react-agent group is set up with:

hljs language-bash

make install-react-agent-dependencies

This creates a dedicated environment at .venv-react with all react-agent specific dependencies.

Adding Agents Code

Create a new directory under agents/ with your agent name (e.g., agents/my_agent/)
Add your agent implementation files in this directory
Create a Dockerfile in your agent directory that extends the base image
Add agent configuration in ml_dev_bench/conf/agent/

Example structure:

hljs language-bash

agents/
├── my_agent/
│   ├── __init__.py
│   ├── my_agent.py       # Your agent implementation
│   └── Dockerfile        # Agent-specific Dockerfile
└── utils.py              # Shared utilities

Agent Docker Setup

The project uses a two-stage Docker build:

A base image with core dependencies
Agent-specific images that extend the base image

Building Images

Build the base image (from project root):

hljs language-bash

docker build -t ml-dev-bench-base -f docker_base/base.Dockerfile .

Build your agent's image (from project root):

hljs language-bash

docker build -t ml-dev-bench-myagent -f agents/my_agent/Dockerfile .

Creating Agent Dockerfile

Your agent's Dockerfile should:

Extend the base image
Copy agent-specific code
Install agent-specific dependencies

Example agent Dockerfile:

hljs language-dockerfile

FROM ml-dev-bench-base:latest

# Copy the agent code
COPY agents/my_agent/ ./agents/my_agent/
COPY agents/__init__.py ./agents/
COPY agents/utils.py ./agents/

# Install agent-specific dependencies
RUN poetry install --with my-agent

# Set working directory
WORKDIR $WORKDIR/agents/my_agent

# Default command - open a shell with poetry env
CMD ["poetry", "shell"]

Contributing

Fork the repository
Create a feature branch
Make your changes
Run linters and tests
Submit a pull request

Evaluation Traces

Evaluation logs and Traces: Link

License

MIT License - see the LICENSE file for details

Acknowledgments

LiteLLM for LLM integration
Composio for runtime management
Hydra for configuration management

Citation

If you use ML-Dev-Bench in your research, please cite our paper:

hljs language-bibtex

@misc{mldevbench,
      title={ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows},
      author={Harshith Padigela and Chintan Shah and Dinkar Juyal},
      year={2025},
      eprint={2502.00964},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.00964},
}

ML-Dev-Bench

Ever wondered if AI agents can reliably develop new AI models? Look no further!

ML-Dev-Bench is a benchmark for evaluating AI agents on real world ML development tasks.

We also introduce Calipers, a framework for evaluating AI agents, providing tools and infrastructure for systematic assessment of AI model performance.

ML-Dev-Bench

Highlights
Features
Adding New Evaluation Tasks
Requirements
Installation
Usage
- Basic Usage
- Multi-run Evaluations
Development
Project Structure
Adding new Evaluation Cases
Adding New Agents
Contributing
Evaluation Traces
License
Acknowledgments
Citation

Highlights

What kind of tasks are currently in ml-dev-bench?

ml-dev-bench currently includes 30 tasks across the following categories.

Category	Description
Dataset Handling	Downloading and preprocessing datasets
Model Training	Loading pretrained models, fine-tuning
Debugging	Addressing errors in training files, exploding gradients, and incorrect implementations
Model Implementation	Modifying and implementing on top of existing model architectures
API Integration	Integrating logging tools like WandB
Performance	Improving baselines and achieving competitive results

What kind of ML problems do these tasks cover?

The tasks cover ML development in problem domains like image classification, segmentation, question answering, image generation, LLM finetuning and alignment, etc.

What is the performance of different agents on these tasks?

We currently evaluate 3 agents (ReAct, OpenHands, and AIDE) using 3 models (Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash) on 30 tasks. Agent Results

What are the common failures across agents?

Features

Flexible evaluation framework for AI agents
Comprehensive metrics tracking and reporting
Integration with LiteLLM and LangChain
Configurable task-based evaluation system using Hydra
Support for parameter sweeps and multi-run evaluations

Adding New Evaluation Tasks

We welcome contributions of new evaluation tasks! The process is:

Propose Your Task
- Create a new issue using our New Evaluation Task template
- This helps gather feedback and ensure the task fits our evaluation framework
Implement Your Task
- After discussion and approval, implement your task following our examples:
  - hello_world for basic task structure
  - nan_losses for tasks with setup files and test scripts
Submit Your Implementation
- Create a pull request using our New Evaluation Task template
- Ensure all validation criteria and tests are implemented

Requirements

Python 3.12+
Poetry 1.8+
Linux, macOS, or Windows Subsystem for Linux (WSL)

Installation

Clone the repository:

hljs language-bash

git clone https://github.com/ml-dev-bench/ml-dev-bench.git
cd ml-dev-bench

Install dependencies:

hljs language-bash

make build

This will:

Check system requirements
Install Python dependencies
Set up pre-commit hooks
Configure the development environment

Install runtime dependencies:

This is needed for running evaluations locally.

hljs language-bash

make install-runtime-dependencies

Usage

The evaluation framework uses Hydra for configuration management, allowing flexible task and agent configurations.

Basic Usage

Run a single task with a specific agent:

hljs language-bash

./scripts/eval.sh task=hello_world agent=openhands

Run with configuration overrides:

hljs language-bash

./scripts/eval.sh task=hello_world agent=openhands num_runs=3

Multi-run Evaluations

Create a .env file to store the API keys for the agents you are using.

Activate the virtual environment for that agent from the root directory (e.g. for OpenHands):

hljs language-bash

source .venv-openhands/<ml-dev-bench-version>/bin/activate

Run all available tasks with a specific agent:

hljs language-bash

./scripts/eval.sh --multirun "task=glob(*)" agent=openhands

Run a list of tasks with a specific agent:

hljs language-bash

./scripts/eval.sh --multirun task=hello_world,shape_mismatch_train agent=react

Development

Format and lint code:

hljs language-bash

make lint

Calipers Architecture

hljs language-mermaid

graph TD
    %% Main Components
    User([User]) --> Scripts
    Scripts["Scripts (Entry Points)"] --> |"Configure & Run"| Framework

    %% Core Components
    subgraph "Core Framework"
        Framework["Framework (Orchestration)"]
        Registry["Registry (Task & Agent Repository)"]
        Config["Configuration (Hydra-based)"]
    end

    %% Execution Components
    subgraph "Execution Components"
        Tasks["Evaluation Tasks"]
        Agents["AI Agents"]
        Runtime["Runtime (Execution Environment)"]
    end

    %% Monitoring Components
    subgraph "Monitoring"
        Metrics["Metrics System"]
        Callbacks["Event Callbacks"]
        Results["Evaluation Results"]
    end

    %% Configurations
    Config --> |"Configure"| Framework
    Config --> |"Task Settings"| Tasks
    Config --> |"Agent Settings"| Agents

    %% Registration Flow
    Registry --> |"Register"| Tasks
    Registry --> |"Register"| Agents
    Framework --> |"Loads from"| Registry

    %% Task Execution Flow
    Framework --> |"Initialize"| Tasks
    Tasks --> |"Run via"| Agents
    Agents --> |"Execute in"| Runtime
    Tasks --> |"Validate with"| Runtime

    %% Monitoring Flow
    Tasks --> |"Record"| Metrics
    Agents --> |"Trigger"| Callbacks
    Tasks --> |"Produce"| Results
    Metrics --> Results


    %% Extensions
    MLDevBench["ML Dev Bench Runtime"] --> Runtime

Project Structure

hljs language-bash

.
├── calipers/
│   ├── agents/          # Agent implementations
│   ├── callbacks/       # Callback handlers
│   ├── framework/       # Core evaluation framework
│   ├── metrics/         # Metrics tracking
│   └── scripts/         # CLI tools
│
└── runtime/
    ├── backends/        # Runtime backend implementations
    ├── environments/    # Environment configurations
    └── tools/           # Runtime tools

Adding new Evaluation Cases

Adding New Agents

Setting up Agent Dependencies using Poetry

Add a new group in pyproject.toml:

hljs language-toml

[tool.poetry.group.{your-agent-name}.dependencies]
dependency1 = "^version"
dependency2 = "^version"

Add a corresponding make target in Makefile:

hljs language-makefile

install-{your-agent}-dependencies:
	@echo "$(GREEN)Installing Python dependencies with {your-agent} in new environment...$(RESET)"
	POETRY_VIRTUALENVS_PATH="./.venv-{your-agent}" poetry env use python$(PYTHON_VERSION)
	POETRY_VIRTUALENVS_PATH="./.venv-{your-agent}" poetry install --with {your-agent}

This creates a separate virtual environment with a suffix matching your agent name (e.g., .venv-{your-agent}).

Example: The react-agent group is set up with:

hljs language-bash

make install-react-agent-dependencies

This creates a dedicated environment at .venv-react with all react-agent specific dependencies.

Adding Agents Code

Create a new directory under agents/ with your agent name (e.g., agents/my_agent/)
Add your agent implementation files in this directory
Create a Dockerfile in your agent directory that extends the base image
Add agent configuration in ml_dev_bench/conf/agent/

Example structure:

hljs language-bash

agents/
├── my_agent/
│   ├── __init__.py
│   ├── my_agent.py       # Your agent implementation
│   └── Dockerfile        # Agent-specific Dockerfile
└── utils.py              # Shared utilities

Agent Docker Setup

The project uses a two-stage Docker build:

A base image with core dependencies
Agent-specific images that extend the base image

Building Images

Build the base image (from project root):

hljs language-bash

docker build -t ml-dev-bench-base -f docker_base/base.Dockerfile .

Build your agent's image (from project root):

hljs language-bash

docker build -t ml-dev-bench-myagent -f agents/my_agent/Dockerfile .

Creating Agent Dockerfile

Your agent's Dockerfile should:

Extend the base image
Copy agent-specific code
Install agent-specific dependencies

Example agent Dockerfile:

hljs language-dockerfile

FROM ml-dev-bench-base:latest

# Copy the agent code
COPY agents/my_agent/ ./agents/my_agent/
COPY agents/__init__.py ./agents/
COPY agents/utils.py ./agents/

# Install agent-specific dependencies
RUN poetry install --with my-agent

# Set working directory
WORKDIR $WORKDIR/agents/my_agent

# Default command - open a shell with poetry env
CMD ["poetry", "shell"]

Contributing

Fork the repository
Create a feature branch
Make your changes
Run linters and tests
Submit a pull request

Evaluation Traces

Evaluation logs and Traces: Link

License

MIT License - see the LICENSE file for details

Acknowledgments

LiteLLM for LLM integration
Composio for runtime management
Hydra for configuration management

Citation

If you use ML-Dev-Bench in your research, please cite our paper:

hljs language-bibtex

@misc{mldevbench,
      title={ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows},
      author={Harshith Padigela and Chintan Shah and Dinkar Juyal},
      year={2025},
      eprint={2502.00964},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2502.00964},
}

ml-dev-bench

ML-Dev-Bench

Table of Contents

Highlights

Features

Adding New Evaluation Tasks

Requirements

Installation

Usage

Basic Usage

Multi-run Evaluations

Development

Calipers Architecture

Project Structure

Adding new Evaluation Cases

Adding New Agents

Setting up Agent Dependencies using Poetry

Adding Agents Code

Agent Docker Setup

Building Images

Creating Agent Dockerfile

Contributing

Evaluation Traces

License

Acknowledgments

Citation

Similar Packages

ml-dev-bench

ML-Dev-Bench

Table of Contents

Highlights

Features

Adding New Evaluation Tasks

Requirements

Installation

Usage

Basic Usage

Multi-run Evaluations

Development

Calipers Architecture

Project Structure

Adding new Evaluation Cases

Adding New Agents

Setting up Agent Dependencies using Poetry

Adding Agents Code

Agent Docker Setup

Building Images

Creating Agent Dockerfile

Contributing

Evaluation Traces

License

Acknowledgments

Citation

Similar Packages