GitHub - SAP/mirrorbench: An automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. 🌟 Star if you like it!

Evaluating Realism of User-Proxy Agents

MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.

⭐ Drop a star to help us grow!

Requirements and Setup

The project requires Python 3.12 or higher. It is recommended to use a virtual environment to manage dependencies. You can install the project as a dependency using pip:

pip install https://github.com/SAP/mirrorbench.git

Alternatively, you can install it in editable/development mode by cloning the repository and installing it locally:

git clone https://github.com/SAP/mirrorbench.git

cd mirrorbench
pip install -e .[dev]

Quick Start

To get started with benchmarking your User-Proxy Agents, you can either use the code or CLI.

In order to run a benchmark, you need to define a job configuration in a YAML file. Below is an example of a simple job configuration:

# Job run settings (seed, sync/async, concurrency, cache, observability etc.)
run:
  name: my_run
  ...(trimmed for brevity)...

# Define User-Proxies to benchmark
user_proxies:
- name: proxy:langchain/claude-3.7-sonnet
  ...(trimmed for brevity)...

# Define datasets to use for benchmarking
datasets:
- name: dataset:jsonl/chatbot_arena_mirror
  ...(trimmed for brevity)...

# Define metrics
metrics:
- name: metric:judge/gteval
  ...(trimmed for brevity)...

task_drivers:
  dataset:jsonl/chatbot_arena_mirror:
    driver: task:mirror/conversation
    ...(trimmed for brevity)...

As shown above, the job configuration consists of several sections, including run, user_proxies, datasets, metrics, and task_drivers. Each section allows you to specify the components of your benchmark. You can find more examples of job configurations in the configs directory.

We provide a quick code snippet to run a benchmark using the above job configuration:

from mirrorbench.core.config import load_job_config
from mirrorbench.core.runner import Runner

job_cfg = load_job_config("path/to/your/job_config.yaml")
runner = Runner(job_cfg)
result_summary = runner.run()

LLM Usage

To use LLMs or any external API services, you will most likely need to set up and use API keys or authentication tokens. The package by default accesses environment variables addedn in .env file in your working directory. Alternatively, you can set the environment variables directly in your system using the following code snippet:

import os

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

The package has built-in support for Langchain based LLM clients. In case you would like to support other LLM clients, you can implement and register a custom LLM wrapper as shown for LangChainChatClient.

MirrorBench CLI

MirrorBench provides a command-line interface (CLI) to facilitate running benchmarks, managing runs & cache as well as validating job configs. Below, we provide an overview of the available commands. For detailed usage instructions, you can run mirrorbench --help.

`mirrorbench plan`

The mirrorbench plan command allows you to inspect and validate your job configuration file before executing a benchmarking job. It generates a summary file plan.json consisting of the components defined in the job configuration, including User-Proxies, datasets, metrics, and task drivers.

mirrorbench plan -c path/to/your/job_config.yaml

`mirrorbench dryrun`

The mirrorbench dryrun command allows you to perform a dry run with credential checks and dependency validation without actual execution of benchmarking tasks. As a result, it generates a manifest.json file containing detailed parsed information (units and episodes) which would be executed in a real run.

mirrorbench dryrun -c path/to/your/job_config.yaml

`mirrorbench run`

This command executes or resumes a benchmarking job based on the provided job configuration file. It manages the execution of tasks, computes metrics, and aggregates results.

# Execute a job from scratch
mirrorbench run -c path/to/your/job_config.yaml

# Resume a previously interrupted job
mirrorbench run -c path/to/your/job_config.yaml --resume

`mirrorbench report`

The CLI command mirrorbench report generates a comprehensive report of the benchmarking results from a completed run.

# Currently only JSON report generation is supported
mirrorbench report json <run-id> --output path/to/output/report.json

`mirrorbench runs`

The mirrorbench runs command has multiple subcommands to manage and inspect previous benchmarking runs. You can list all runs, view details of a specific run, or delete runs.

# List all previous runs
mirrorbench runs list

# Inspect the output of a specific episode of a run
mirrorbench runs inspect <run_id> --index <episode-index> --output episode.json

# Delete an existing run
mirrorbench runs delete <run_id> --force

`mirrorbench cache`

This command provides subcommands to check statistics of the cache or clear the cache.

# Show cache statistics
mirrorbench cache stats

# Clear the cache
mirrorbench cache purge

Cache is by default retained for 24 hours unless specified otherwise in the job configuration.

Support, Feedback, Contributing

This project is open to feature requests/suggestions, bug reports etc. via GitHub issues. Contribution and feedback are encouraged and always welcome. For more information about how to contribute, the project structure, as well as additional contribution information, see our Contribution Guidelines.

Security / Disclosure

If you find any bug that may be a security problem, please follow our instructions at in our security policy on how to report it. Please do not create GitHub issues for security-related doubts or problems.

Code of Conduct

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone. By participating in this project, you agree to abide by its Code of Conduct at all times.

Licensing

Copyright 2025 SAP SE or an SAP affiliate company and mirrorbench contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.

Contributors

_{Ashutosh Hathidara}
🔬 💻 🎨 🤔 🚧

_{sebastian-schreiber-sap}
🤔 🧑‍🏫

_{Vaishali Senthil}
🤔

_aanilbabu
🤔 🧑‍🏫

_{Yue (Julien) Yu}
🤔

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
.vscode		.vscode
LICENSES		LICENSES
configs		configs
figures		figures
mirrorbench		mirrorbench
notebooks		notebooks
scratch_pad		scratch_pad
scripts		scripts
tests		tests
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
REUSE.toml		REUSE.toml
noxfile.py		noxfile.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating Realism of User-Proxy Agents

Requirements and Setup

Quick Start

LLM Usage

MirrorBench CLI

`mirrorbench plan`

`mirrorbench dryrun`

`mirrorbench run`

`mirrorbench report`

`mirrorbench runs`

`mirrorbench cache`

Support, Feedback, Contributing

Security / Disclosure

Code of Conduct

Licensing

Contributors

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

License

SAP/mirrorbench

Folders and files

Latest commit

History

Repository files navigation

Evaluating Realism of User-Proxy Agents

Requirements and Setup

Quick Start

LLM Usage

MirrorBench CLI

mirrorbench plan

mirrorbench dryrun

mirrorbench run

mirrorbench report

mirrorbench runs

mirrorbench cache

Support, Feedback, Contributing

Security / Disclosure

Code of Conduct

Licensing

Contributors

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

`mirrorbench plan`

`mirrorbench dryrun`

`mirrorbench run`

`mirrorbench report`

`mirrorbench runs`

`mirrorbench cache`

Packages