Skip to content
generated from SAP/repository-template

An automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. 🌟 Star if you like it!

License

Notifications You must be signed in to change notification settings

SAP/mirrorbench

MirrorBench

Evaluating Realism of User-Proxy Agents

REUSE status License

MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.

⭐ Drop a star to help us grow!

Requirements and Setup

The project requires Python 3.12 or higher. It is recommended to use a virtual environment to manage dependencies. You can install the project as a dependency using pip:

pip install https://github.com/SAP/mirrorbench.git

Alternatively, you can install it in editable/development mode by cloning the repository and installing it locally:

git clone https://github.com/SAP/mirrorbench.git

cd mirrorbench
pip install -e .[dev]

Quick Start

To get started with benchmarking your User-Proxy Agents, you can either use the code or CLI.

In order to run a benchmark, you need to define a job configuration in a YAML file. Below is an example of a simple job configuration:

# Job run settings (seed, sync/async, concurrency, cache, observability etc.)
run:
  name: my_run
  ...(trimmed for brevity)...

# Define User-Proxies to benchmark
user_proxies:
- name: proxy:langchain/claude-3.7-sonnet
  ...(trimmed for brevity)...

# Define datasets to use for benchmarking
datasets:
- name: dataset:jsonl/chatbot_arena_mirror
  ...(trimmed for brevity)...

# Define metrics
metrics:
- name: metric:judge/gteval
  ...(trimmed for brevity)...

task_drivers:
  dataset:jsonl/chatbot_arena_mirror:
    driver: task:mirror/conversation
    ...(trimmed for brevity)...

As shown above, the job configuration consists of several sections, including run, user_proxies, datasets, metrics, and task_drivers. Each section allows you to specify the components of your benchmark. You can find more examples of job configurations in the configs directory.

We provide a quick code snippet to run a benchmark using the above job configuration:

from mirrorbench.core.config import load_job_config
from mirrorbench.core.runner import Runner

job_cfg = load_job_config("path/to/your/job_config.yaml")
runner = Runner(job_cfg)
result_summary = runner.run()

LLM Usage

To use LLMs or any external API services, you will most likely need to set up and use API keys or authentication tokens. The package by default accesses environment variables addedn in .env file in your working directory. Alternatively, you can set the environment variables directly in your system using the following code snippet:

import os

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

The package has built-in support for Langchain based LLM clients. In case you would like to support other LLM clients, you can implement and register a custom LLM wrapper as shown for LangChainChatClient.

MirrorBench CLI

MirrorBench provides a command-line interface (CLI) to facilitate running benchmarks, managing runs & cache as well as validating job configs. Below, we provide an overview of the available commands. For detailed usage instructions, you can run mirrorbench --help.

mirrorbench plan

The mirrorbench plan command allows you to inspect and validate your job configuration file before executing a benchmarking job. It generates a summary file plan.json consisting of the components defined in the job configuration, including User-Proxies, datasets, metrics, and task drivers.

mirrorbench plan -c path/to/your/job_config.yaml

mirrorbench dryrun

The mirrorbench dryrun command allows you to perform a dry run with credential checks and dependency validation without actual execution of benchmarking tasks. As a result, it generates a manifest.json file containing detailed parsed information (units and episodes) which would be executed in a real run.

mirrorbench dryrun -c path/to/your/job_config.yaml

mirrorbench run

This command executes or resumes a benchmarking job based on the provided job configuration file. It manages the execution of tasks, computes metrics, and aggregates results.

# Execute a job from scratch
mirrorbench run -c path/to/your/job_config.yaml

# Resume a previously interrupted job
mirrorbench run -c path/to/your/job_config.yaml --resume

mirrorbench report

The CLI command mirrorbench report generates a comprehensive report of the benchmarking results from a completed run.

# Currently only JSON report generation is supported
mirrorbench report json <run-id> --output path/to/output/report.json

mirrorbench runs

The mirrorbench runs command has multiple subcommands to manage and inspect previous benchmarking runs. You can list all runs, view details of a specific run, or delete runs.

# List all previous runs
mirrorbench runs list

# Inspect the output of a specific episode of a run
mirrorbench runs inspect <run_id> --index <episode-index> --output episode.json

# Delete an existing run
mirrorbench runs delete <run_id> --force

mirrorbench cache

This command provides subcommands to check statistics of the cache or clear the cache.

# Show cache statistics
mirrorbench cache stats

# Clear the cache
mirrorbench cache purge

Cache is by default retained for 24 hours unless specified otherwise in the job configuration.

Support, Feedback, Contributing

This project is open to feature requests/suggestions, bug reports etc. via GitHub issues. Contribution and feedback are encouraged and always welcome. For more information about how to contribute, the project structure, as well as additional contribution information, see our Contribution Guidelines.

Security / Disclosure

If you find any bug that may be a security problem, please follow our instructions at in our security policy on how to report it. Please do not create GitHub issues for security-related doubts or problems.

Code of Conduct

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone. By participating in this project, you agree to abide by its Code of Conduct at all times.

Licensing

Copyright 2025 SAP SE or an SAP affiliate company and mirrorbench contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.

Contributors

Ashutosh Hathidara
Ashutosh Hathidara

πŸ”¬ πŸ’» 🎨 πŸ€” 🚧
sebastian-schreiber-sap
sebastian-schreiber-sap

πŸ€” πŸ§‘β€πŸ«
Vaishali Senthil
Vaishali Senthil

πŸ€”
aanilbabu
aanilbabu

πŸ€” πŸ§‘β€πŸ«
Yue (Julien) Yu
Yue (Julien) Yu

πŸ€”