M³-Bench — Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

01 · motivation

Real MCP Tools, Not Simulated APIs

Most tool-use benchmarks handcraft a small simulated tool set or score the agent only on text output. M³-Bench instead boots real MCP servers and lets the model actually call them: Amazon catalog, Wikipedia, Google Maps, NASA APOD, yfinance, Ultralytics YOLO detection, barcode scanning, OCR, Excel, PowerPoint, and many more.

A single task may need the agent to read a shelf photo, detect the items with YOLO, search each one on Amazon for a price, sum them with a math tool, and return the total — threaded across multiple servers and parallel calls.

02 · the three M's

Multi-Modal · Multi-Hop · Multi-Threaded

Every task stresses all three axes at once, so models cannot coast on single-tool recall or text-only reasoning.

01 Multi-Modal

Each task ships with an image. The agent must visually parse shelf photos, labels, barcodes, OCR targets, or scene content, then feed that perception into the tool chain.

02 Multi-Hop

Tool calls span several servers. Typical chains: YOLO → Amazon → Math · OCR → Wiki → Weather · Barcode → OpenLibrary → Summary.

03 Multi-Threaded

Within a single reasoning step the agent often fires several parallel calls (e.g. three Amazon queries for three items detected in one image), then synthesizes the joined result.

03 · MCP server fleet

27 Real Servers · 200+ Tools

All servers are booted over stdio through a single MCPHost. The agent sees a unified catalog and picks relevant tools at each step. The figure counts tools per server in the current fleet.

Number of callable tools per MCP server, colored by category — bars colored by category · 27 servers · 232 total tool functions

── category × server ─────────────────────────────────

~ knowledge

wiki · openlibrary · paper_search (arXiv / PubMed / bioRxiv / medRxiv / IACR / Semantic Scholar) · nixos · metmuseum · nationalparks.

~ vision

ocr · pyzbar (barcodes) · imagesorcery (crop/draw/blur) · mcp-yolo (Ultralytics YOLO + YOLO-World) · linkimage.

~ commerce

amazon · tmdb · hugeicons · car-price (FIPE).

~ finance

yahoo-finance · okx (crypto prices + candlesticks).

~ location

google-maps · google-air · weather · nasa-mcp.

~ health / food

healthcare-mcp (FDA / PubMed / Clinical Trials / DICOM) · food_nutrition_mcp.

~ office

excel · ppt (Office-PowerPoint-MCP-Server).

~ math

arithmetic · stats · trig — closes the numeric reasoning loop.

~ social

reddit (subreddit search, posts, comments).

04 · three-layer eval

step · call · final_answer

We do not collapse everything to one scalar. Each prediction is scored at three granularities so weaknesses are visible — e.g. a correct plan with the wrong final answer shows up at the call layer, not at the step layer.

§1 step-level

Trajectory quality against GT steps: recall, precision, avg similarity, step coherence, order consistency, merge purity. Runs via evaluate_trajectories.py.

§2 call-level

Per-call classification: correct / partially correct / hallucinated / redundant. Aggregated pies in evaluate_calls.py.

§3 final answer

Task completion + information grounding of the final reply vs GT, scored by evaluate_final.py.

05 · main results

Step-Level Leaderboard

Average step-level score sorted high → low. Numbers mirror Table 1 of the paper — Avg is a composite of Recall, Precision, Arg Similarity, Step Coherence, Order Consistency, Merge Purity, Task Completion, and Info Grounding.

GPT-50.482

Gemini 2.5 Pro0.423

Grok 4 (0709)0.411

GPT-5 Mini0.395

Gemini 2.5 Flash0.388

Claude 4.5 Sonnet0.333

Grok-4 Fast0.298

Llama-4-Scout-17B-16E0.264

GPT-5 Nano0.247

Claude 4.5 Haiku0.205

Gemini 2.5 Flash Lite0.180

InternVL 3.50.179

Qwen2.5-VL-72B0.141

GLM 4.5v0.029

source :: paper/sec/4_Metrics.tex · Table 1 · tab:mcp_multimodal_results

06 · quick start

Four steps from clone to leaderboard

Requirements: Python 3.11, Conda, Node.js (for JS-based MCP servers), optional CUDA, and API keys in .env for commercial endpoints plus a few MCP servers (Amazon / NASA / Unsplash / Reddit).

$ 01_install

conda create -n mcp_app python=3.11 -y
conda activate mcp_app
pip install -r requirements_pip.txt

# Node-based MCP servers
for d in tmdb-mcp-server mcp-server-nationalparks \
         metmuseum-mcp okx-mcp hugeicons math-mcp; do
  (cd servers/$d && npm i && npm run build)
done
(cd servers/healthcare-mcp-public && npm i)

$ 02_smoke_test

python tools/test_mcp_servers.py
# -- initialization handshake per server

python tools/functional_test_mcp_servers.py
# -- one real tool call per server, checks
#    isError and returns a short preview

27 servers report OK end-to-end on a clean checkout.

$ 03_run_benchmark

bash scripts/benchmark_fuzzy.sh
# → results/<model>_test_mcp_fuzzy.json

All 211 tasks. Picks top-K tools per step from the unified catalog.

$ 04_evaluate

bash scripts/evaluate_step.sh          # trajectory quality
bash scripts/evaluate_call.sh          # call classification
bash scripts/evaluate_final_answer.sh  # task completion

Three granularities, one command each. Numbers land in results/<model>/.

07 · references

$ bibtex

@misc{yang2025m3bench,
  title        = {M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark},
  author       = {Yang, Eta and Zhou, Yang and Zhou, Mu and Metaxas, Dimitris N. and others},
  year         = {2025},
  eprint       = {2511.17729},
  archivePrefix= {arXiv},
  primaryClass = {cs.AI},
  url          = {https://arxiv.org/abs/2511.17729}
}

contact: eta.yang@rutgers.edu · dnm@rutgers.edu · sibling project: SPARK

{ Multi-Modal · Multi-Hop · Multi-Threaded }
Tool-Using MLLM Agent Benchmark