ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Ruihan Yang¹ Yikai Zhang¹ Aili Chen¹

¹Fudan University

Introduction

Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an extremely large and combinatorial action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning.

To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering efficient and effective policy optimization.

Extensive experiments demonstrate that ARIA not only significantly reduces gradient variance, but also delivers substantial performance gains of average 9.95% across four downstream tasks (e.g., negotiation and text-based games), consistently outperforming strong offline and online RL baselines.

Installation

You can install ARIA using the following steps:

hljs language-bash

# Create and activate conda environment
conda create -n aria python=3.10
conda activate aria

# Install the package
pip install -e .

Data Processing Pipeline

The complete data processing pipeline transforms raw game data into training-ready datasets through several steps:

Step 1: Generate Clustering Labels

Starting with raw game data llama3-8b_{game}_msgs.jsonl, generate clustering labels for different k values (k=2 to k=100):

hljs language-bash

# For different game environments
cd reward_aggregation/{game}_clustering
python preprocesspy
python clustering.py 
python postprocess.py

Supported games:

bargaining (multi-agent)
negotiation (multi-agent)
guess_my_city (single-agent with actions & observations)
twenty_questions (single-agent with special observation processing)

Step 2: Find Optimal K Values

Determine the optimal k values for clustering using silhouette analysis:

hljs language-bash

cd clustering

# For multi-agent games (bargaining, negotiation)
python clustering_multi.py --data_path {game}_clustering

# For single-agent games (guess_my_city, twenty_questions)
python clustering_single.py --data_path {game}_clustering

This will output the optimal k values for each agent/component.

Step 3: Process Data with Selected K Values

Use the optimal k values to generate the final labeled dataset:

hljs language-bash

cd clustering
python game_data_processor.py {environment} {input_file} {output_file}

Examples:

hljs language-bash

# Multi-agent games (bargaining, negotiation)
python game_data_processor.py bargaining \
    llama3-8b_bargaining_with_labels_k2_to_k100.jsonl \
    llama3-8b_bargaining_with_selected_labels.jsonl \
    --alice-k 20 --bob-k 20

python game_data_processor.py negotiation \
    llama3-8b_negotiation_with_labels_k2_to_k100.jsonl \
    llama3-8b_negotiation_with_selected_labels.jsonl \
    --alice-k 16 --bob-k 16

# Single-agent with action/observation clustering
python game_data_processor.py guess_my_city \
    llama3-8b_guess_my_city_with_labels_k2_to_k100.jsonl \
    llama3-8b_guess_my_city_with_selected_labels.jsonl \
    --action-k 28 --observation-k 28

# Single-agent with special observation processing
python game_data_processor.py twenty_questions \
    llama3-8b_twenty_questions_with_labels_k2_to_k100.jsonl \
    llama3-8b_twenty_questions_with_selected_labels.jsonl \
    --k 36

Step 4: Generate Final Training Dataset

Convert the processed data into the final training format:

hljs language-bash

cd clustering
python gen_reinforce_multi.py

The final dataset will be saved to /ARIA/dataset/actor_reinforce_llama3-8b_multi.json and is ready for training.

Data Processing Summary

The complete pipeline:

hljs language-vbnet

llama3-8b_{game}_msgs.jsonl
    ↓ (Step 1: preprocess.py clustering.py postprocess.py)
llama3-8b_{game}_with_labels_k2_to_k100.jsonl
    ↓ (Step 2: clustering_multi/single.py)
optimal k values
    ↓ (Step 3: game_data_processor.py)
llama3-8b_{game}_with_selected_labels.jsonl
    ↓ (Step 4: gen_reinforce_multi.py)
actor_reinforce_llama3-8b_multi.json

Actor Training

hljs language-bash

cd scripts

# Offline training
python run_offline.py --config-name reinforce_llm
accelerate launch --config_file config/accelerate_config/default_config.yaml run_offline.py --config-name reinforce_llm

# Online training
python run_online.py --config-name onlinereinforce_llm
accelerate launch --config_file config/accelerate_config/default_config.yaml run_online.py --config-name onlinereinforce_llm

RM Training

We train a reward model (RM) using past rollout results from the actor (updated every 50 steps) to ensure accurate advantage estimation for the actor.

hljs language-bash

cd scripts
bash train_rm.sh

Evaluation

To evaluate your model in single-agent environments (e.g., Twenty Questions, Guess My City), follow these steps:

Step 1: Launch the model with vLLM

First, make sure your model is served using vLLM. Here's an example command:

hljs language-bash

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m model_path --port 8036 --tensor-parallel-size 4 --gpu_memory_utilization 0.6

Replace model_path with your actual model module or entry point.
The model will be accessible at http://localhost:8036.

Step 2: Run the evaluation script

Navigate to the evaluation directory and execute the evaluation script:

hljs language-bash

cd evaluation
bash eval.sh

The script does the following:

Evaluates the model on two environments: twenty_questions and guess_my_city.
Sends requests to the vLLM server via the specified BASE_URL.
Uses the model named llama3-8B (you can change this by editing MODEL_NAME in eval.sh).
Repeats each evaluation 200 times.
Saves results to ../results/single_agent/llama3-8B.

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Ruihan Yang¹ Yikai Zhang¹ Aili Chen¹

¹Fudan University

Introduction

Installation

You can install ARIA using the following steps:

hljs language-bash

# Create and activate conda environment
conda create -n aria python=3.10
conda activate aria

# Install the package
pip install -e .

Data Processing Pipeline

The complete data processing pipeline transforms raw game data into training-ready datasets through several steps:

Step 1: Generate Clustering Labels

Starting with raw game data llama3-8b_{game}_msgs.jsonl, generate clustering labels for different k values (k=2 to k=100):

hljs language-bash

# For different game environments
cd reward_aggregation/{game}_clustering
python preprocesspy
python clustering.py 
python postprocess.py

Supported games:

bargaining (multi-agent)
negotiation (multi-agent)
guess_my_city (single-agent with actions & observations)
twenty_questions (single-agent with special observation processing)

Step 2: Find Optimal K Values

Determine the optimal k values for clustering using silhouette analysis:

hljs language-bash

cd clustering

# For multi-agent games (bargaining, negotiation)
python clustering_multi.py --data_path {game}_clustering

# For single-agent games (guess_my_city, twenty_questions)
python clustering_single.py --data_path {game}_clustering

This will output the optimal k values for each agent/component.

Step 3: Process Data with Selected K Values

Use the optimal k values to generate the final labeled dataset:

hljs language-bash

cd clustering
python game_data_processor.py {environment} {input_file} {output_file}

Examples:

hljs language-bash

# Multi-agent games (bargaining, negotiation)
python game_data_processor.py bargaining \
    llama3-8b_bargaining_with_labels_k2_to_k100.jsonl \
    llama3-8b_bargaining_with_selected_labels.jsonl \
    --alice-k 20 --bob-k 20

python game_data_processor.py negotiation \
    llama3-8b_negotiation_with_labels_k2_to_k100.jsonl \
    llama3-8b_negotiation_with_selected_labels.jsonl \
    --alice-k 16 --bob-k 16

# Single-agent with action/observation clustering
python game_data_processor.py guess_my_city \
    llama3-8b_guess_my_city_with_labels_k2_to_k100.jsonl \
    llama3-8b_guess_my_city_with_selected_labels.jsonl \
    --action-k 28 --observation-k 28

# Single-agent with special observation processing
python game_data_processor.py twenty_questions \
    llama3-8b_twenty_questions_with_labels_k2_to_k100.jsonl \
    llama3-8b_twenty_questions_with_selected_labels.jsonl \
    --k 36

Step 4: Generate Final Training Dataset

Convert the processed data into the final training format:

hljs language-bash

cd clustering
python gen_reinforce_multi.py

The final dataset will be saved to /ARIA/dataset/actor_reinforce_llama3-8b_multi.json and is ready for training.

Data Processing Summary

The complete pipeline:

hljs language-vbnet

llama3-8b_{game}_msgs.jsonl
    ↓ (Step 1: preprocess.py clustering.py postprocess.py)
llama3-8b_{game}_with_labels_k2_to_k100.jsonl
    ↓ (Step 2: clustering_multi/single.py)
optimal k values
    ↓ (Step 3: game_data_processor.py)
llama3-8b_{game}_with_selected_labels.jsonl
    ↓ (Step 4: gen_reinforce_multi.py)
actor_reinforce_llama3-8b_multi.json

Actor Training

hljs language-bash

cd scripts

# Offline training
python run_offline.py --config-name reinforce_llm
accelerate launch --config_file config/accelerate_config/default_config.yaml run_offline.py --config-name reinforce_llm

# Online training
python run_online.py --config-name onlinereinforce_llm
accelerate launch --config_file config/accelerate_config/default_config.yaml run_online.py --config-name onlinereinforce_llm

RM Training

We train a reward model (RM) using past rollout results from the actor (updated every 50 steps) to ensure accurate advantage estimation for the actor.

hljs language-bash

cd scripts
bash train_rm.sh

Evaluation

To evaluate your model in single-agent environments (e.g., Twenty Questions, Guess My City), follow these steps:

Step 1: Launch the model with vLLM

First, make sure your model is served using vLLM. Here's an example command:

hljs language-bash

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m model_path --port 8036 --tensor-parallel-size 4 --gpu_memory_utilization 0.6

Replace model_path with your actual model module or entry point.
The model will be accessible at http://localhost:8036.

Step 2: Run the evaluation script

Navigate to the evaluation directory and execute the evaluation script:

hljs language-bash

cd evaluation
bash eval.sh

The script does the following:

Evaluates the model on two environments: twenty_questions and guess_my_city.
Sends requests to the vLLM server via the specified BASE_URL.
Uses the model named llama3-8B (you can change this by editing MODEL_NAME in eval.sh).
Repeats each evaluation 200 times.
Saves results to ../results/single_agent/llama3-8B.

ARIA

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Introduction

Installation

Data Processing Pipeline

Step 1: Generate Clustering Labels

Step 2: Find Optimal K Values

Step 3: Process Data with Selected K Values

Step 4: Generate Final Training Dataset

Data Processing Summary

Actor Training

RM Training

Evaluation

Step 1: Launch the model with vLLM

Step 2: Run the evaluation script

Similar Packages

ARIA

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

Introduction

Installation

Data Processing Pipeline

Step 1: Generate Clustering Labels

Step 2: Find Optimal K Values

Step 3: Process Data with Selected K Values

Step 4: Generate Final Training Dataset

Data Processing Summary

Actor Training

RM Training

Evaluation

Step 1: Launch the model with vLLM

Step 2: Run the evaluation script

Similar Packages