mcp-host · live · arXiv 2511.17729

{ Multi-Modal · Multi-Hop · Multi-Threaded }
Tool-Using MLLM Agent Benchmark

M³-Bench is a benchmarking & analysis suite built on top of the Model Context Protocol. It spins up 27 real MCP servers, drives multi-modal LLMs through 211 image-grounded tasks, and evaluates every trajectory at three granularities: step, call, and final answer.

tasks 211 servers 27 tools 200+ MLLMs 14 eval layers 3
mcp-host · task 00000000 · gpt-5 · live trace
user> image: 00000000.png · give me the total price of the 3 drinks on the shelf.
mcp> mcp-yolo.detect-all-objects(image=00000000.png) → bottle (×3)
mcp> amazon.search_products(keywords="Pepsi 20 oz")         
mcp> amazon.search_products(keywords="Pepsi Zero 20 oz")    ├─ parallel
mcp> amazon.search_products(keywords="Diet Pepsi 20 oz")    
mcp> math.sum(numbers=[2.49, 2.49, 2.29]) → 7.27
assistant> Total ≈ $7.27   // 6 tool calls · 3 servers · 2 steps · ✓ reward = 1.0
01 · motivation

Real MCP Tools, Not Simulated APIs

Most tool-use benchmarks handcraft a small simulated tool set or score the agent only on text output. M³-Bench instead boots real MCP servers and lets the model actually call them: Amazon catalog, Wikipedia, Google Maps, NASA APOD, yfinance, Ultralytics YOLO detection, barcode scanning, OCR, Excel, PowerPoint, and many more.

A single task may need the agent to read a shelf photo, detect the items with YOLO, search each one on Amazon for a price, sum them with a math tool, and return the total — threaded across multiple servers and parallel calls.

02 · the three M's

Multi-Modal · Multi-Hop · Multi-Threaded

Every task stresses all three axes at once, so models cannot coast on single-tool recall or text-only reasoning.

01 Multi-Modal

Each task ships with an image. The agent must visually parse shelf photos, labels, barcodes, OCR targets, or scene content, then feed that perception into the tool chain.

02 Multi-Hop

Tool calls span several servers. Typical chains: YOLO → Amazon → Math · OCR → Wiki → Weather · Barcode → OpenLibrary → Summary.

03 Multi-Threaded

Within a single reasoning step the agent often fires several parallel calls (e.g. three Amazon queries for three items detected in one image), then synthesizes the joined result.

03 · MCP server fleet

27 Real Servers  ·  200+ Tools

All servers are booted over stdio through a single MCPHost. The agent sees a unified catalog and picks relevant tools at each step. The figure counts tools per server in the current fleet.

Number of callable tools per MCP server, colored by category
bars colored by category · 27 servers · 232 total tool functions
── category × server ─────────────────────────────────

~ knowledge

wiki · openlibrary · paper_search (arXiv / PubMed / bioRxiv / medRxiv / IACR / Semantic Scholar) · nixos · metmuseum · nationalparks.

~ vision

ocr · pyzbar (barcodes) · imagesorcery (crop/draw/blur) · mcp-yolo (Ultralytics YOLO + YOLO-World) · linkimage.

~ commerce

amazon · tmdb · hugeicons · car-price (FIPE).

~ finance

yahoo-finance · okx (crypto prices + candlesticks).

~ location

google-maps · google-air · weather · nasa-mcp.

~ health / food

healthcare-mcp (FDA / PubMed / Clinical Trials / DICOM) · food_nutrition_mcp.

~ office

excel · ppt (Office-PowerPoint-MCP-Server).

~ math

arithmetic · stats · trig — closes the numeric reasoning loop.

~ social

reddit (subreddit search, posts, comments).

04 · three-layer eval

step · call · final_answer

We do not collapse everything to one scalar. Each prediction is scored at three granularities so weaknesses are visible — e.g. a correct plan with the wrong final answer shows up at the call layer, not at the step layer.

§1 step-level

Trajectory quality against GT steps: recall, precision, avg similarity, step coherence, order consistency, merge purity. Runs via evaluate_trajectories.py.

§2 call-level

Per-call classification: correct / partially correct / hallucinated / redundant. Aggregated pies in evaluate_calls.py.

§3 final answer

Task completion + information grounding of the final reply vs GT, scored by evaluate_final.py.

05 · main results

Step-Level Leaderboard

Average step-level score sorted high → low. Numbers mirror Table 1 of the paper — Avg is a composite of Recall, Precision, Arg Similarity, Step Coherence, Order Consistency, Merge Purity, Task Completion, and Info Grounding.

GPT-50.482
Gemini 2.5 Pro0.423
Grok 4 (0709)0.411
GPT-5 Mini0.395
Gemini 2.5 Flash0.388
Claude 4.5 Sonnet0.333
Grok-4 Fast0.298
Llama-4-Scout-17B-16E0.264
GPT-5 Nano0.247
Claude 4.5 Haiku0.205
Gemini 2.5 Flash Lite0.180
InternVL 3.50.179
Qwen2.5-VL-72B0.141
GLM 4.5v0.029

source :: paper/sec/4_Metrics.tex · Table 1 · tab:mcp_multimodal_results

06 · quick start

Four steps from clone to leaderboard

Requirements: Python 3.11, Conda, Node.js (for JS-based MCP servers), optional CUDA, and API keys in .env for commercial endpoints plus a few MCP servers (Amazon / NASA / Unsplash / Reddit).

$ 01_install

conda create -n mcp_app python=3.11 -y
conda activate mcp_app
pip install -r requirements_pip.txt

# Node-based MCP servers
for d in tmdb-mcp-server mcp-server-nationalparks \
         metmuseum-mcp okx-mcp hugeicons math-mcp; do
  (cd servers/$d && npm i && npm run build)
done
(cd servers/healthcare-mcp-public && npm i)

$ 02_smoke_test

python tools/test_mcp_servers.py
# -- initialization handshake per server

python tools/functional_test_mcp_servers.py
# -- one real tool call per server, checks
#    isError and returns a short preview

27 servers report OK end-to-end on a clean checkout.

$ 03_run_benchmark

bash scripts/benchmark_fuzzy.sh
# → results/<model>_test_mcp_fuzzy.json

All 211 tasks. Picks top-K tools per step from the unified catalog.

$ 04_evaluate

bash scripts/evaluate_step.sh          # trajectory quality
bash scripts/evaluate_call.sh          # call classification
bash scripts/evaluate_final_answer.sh  # task completion

Three granularities, one command each. Numbers land in results/<model>/.

07 · references

paper · code · dataset · cite

$ bibtex

@misc{yang2025m3bench,
  title        = {M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark},
  author       = {Yang, Eta and Zhou, Yang and Zhou, Mu and Metaxas, Dimitris N. and others},
  year         = {2025},
  eprint       = {2511.17729},
  archivePrefix= {arXiv},
  primaryClass = {cs.AI},
  url          = {https://arxiv.org/abs/2511.17729}
}

contact: eta.yang@rutgers.edu · dnm@rutgers.edu  ·  sibling project: SPARK