MirrorBench is an automatic, extensible Framework to Evaluate User-Proxy Agents for Human-Likeness. It provides a modular architecture to benchmark different User-Proxy Agents against a variety of realism metrics. MirrorBench is designed to be extensible, allowing researchers and developers to bring their own agents and metrics into the framework.
β Drop a star to help us grow!
The project requires Python 3.12 or higher. It is recommended to use a virtual environment to manage dependencies. You can install the project as a dependency using pip:
pip install https://github.com/SAP/mirrorbench.gitAlternatively, you can install it in editable/development mode by cloning the repository and installing it locally:
git clone https://github.com/SAP/mirrorbench.git
cd mirrorbench
pip install -e .[dev]To get started with benchmarking your User-Proxy Agents, you can either use the code or CLI.
In order to run a benchmark, you need to define a job configuration in a YAML file. Below is an example of a simple job configuration:
# Job run settings (seed, sync/async, concurrency, cache, observability etc.)
run:
name: my_run
...(trimmed for brevity)...
# Define User-Proxies to benchmark
user_proxies:
- name: proxy:langchain/claude-3.7-sonnet
...(trimmed for brevity)...
# Define datasets to use for benchmarking
datasets:
- name: dataset:jsonl/chatbot_arena_mirror
...(trimmed for brevity)...
# Define metrics
metrics:
- name: metric:judge/gteval
...(trimmed for brevity)...
task_drivers:
dataset:jsonl/chatbot_arena_mirror:
driver: task:mirror/conversation
...(trimmed for brevity)...As shown above, the job configuration consists of several sections, including run, user_proxies, datasets, metrics, and task_drivers. Each section allows you to specify the components of your benchmark. You can find more examples of job configurations in the configs directory.
We provide a quick code snippet to run a benchmark using the above job configuration:
from mirrorbench.core.config import load_job_config
from mirrorbench.core.runner import Runner
job_cfg = load_job_config("path/to/your/job_config.yaml")
runner = Runner(job_cfg)
result_summary = runner.run()To use LLMs or any external API services, you will most likely need to set up and use API keys or authentication tokens. The package by default accesses environment variables addedn in .env file in your working directory. Alternatively, you can set the environment variables directly in your system using the following code snippet:
import os
os.environ["OPENAI_API_KEY"] = "your_openai_api_key"The package has built-in support for Langchain based LLM clients. In case you would like to support other LLM clients, you can implement and register a custom LLM wrapper as shown for LangChainChatClient.
MirrorBench provides a command-line interface (CLI) to facilitate running benchmarks, managing runs & cache as well as validating job configs. Below, we provide an overview of the available commands. For detailed usage instructions, you can run mirrorbench --help.
The mirrorbench plan command allows you to inspect and validate your job configuration file before executing a benchmarking job. It generates a summary file plan.json consisting of the components defined in the job configuration, including User-Proxies, datasets, metrics, and task drivers.
mirrorbench plan -c path/to/your/job_config.yamlThe mirrorbench dryrun command allows you to perform a dry run with credential checks and dependency validation without actual execution of benchmarking tasks. As a result, it generates a manifest.json file containing detailed parsed information (units and episodes) which would be executed in a real run.
mirrorbench dryrun -c path/to/your/job_config.yamlThis command executes or resumes a benchmarking job based on the provided job configuration file. It manages the execution of tasks, computes metrics, and aggregates results.
# Execute a job from scratch
mirrorbench run -c path/to/your/job_config.yaml
# Resume a previously interrupted job
mirrorbench run -c path/to/your/job_config.yaml --resumeThe CLI command mirrorbench report generates a comprehensive report of the benchmarking results from a completed run.
# Currently only JSON report generation is supported
mirrorbench report json <run-id> --output path/to/output/report.jsonThe mirrorbench runs command has multiple subcommands to manage and inspect previous benchmarking runs. You can list all runs, view details of a specific run, or delete runs.
# List all previous runs
mirrorbench runs list
# Inspect the output of a specific episode of a run
mirrorbench runs inspect <run_id> --index <episode-index> --output episode.json
# Delete an existing run
mirrorbench runs delete <run_id> --forceThis command provides subcommands to check statistics of the cache or clear the cache.
# Show cache statistics
mirrorbench cache stats
# Clear the cache
mirrorbench cache purgeCache is by default retained for 24 hours unless specified otherwise in the job configuration.
This project is open to feature requests/suggestions, bug reports etc. via GitHub issues. Contribution and feedback are encouraged and always welcome. For more information about how to contribute, the project structure, as well as additional contribution information, see our Contribution Guidelines.
If you find any bug that may be a security problem, please follow our instructions at in our security policy on how to report it. Please do not create GitHub issues for security-related doubts or problems.
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone. By participating in this project, you agree to abide by its Code of Conduct at all times.
Copyright 2025 SAP SE or an SAP affiliate company and mirrorbench contributors. Please see our LICENSE for copyright and license information. Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool.
Ashutosh Hathidara π¬ π» π¨ π€ π§ |
sebastian-schreiber-sap π€ π§βπ« |
Vaishali Senthil π€ |
aanilbabu π€ π§βπ« |
Yue (Julien) Yu π€ |
