arXiv 2503.13794 · CVPR 2026

LLM Enhanced Open-Vocabulary Object Detection without human-curated data generation

LED introduces a knowledge-fusion paradigm for open-vocabulary detection: instead of asking foundation models to generate labels or descriptions, it directly injects intermediate MLLM hidden states into a detector decoder through a lightweight zero-initialized cross-attention adapter.

+6.34APall on OmniLabel with Qwen2.5-0.5B
+11.84APdp for positive descriptions
8.7%extra GFLOPs in the efficient shared-encoder setting
latent knowledge route
MLLM early decoder states vision-rich semantics LED adapter zero-gated cross-attention Detector grounding decoder box prediction no synthetic labels train only fusion path preserve detector geometry

Use the representation, not the generated annotation.

Many OVD pipelines improve detectors by asking large foundation models to create pseudo labels, detailed descriptions, or negative examples. LED takes a different route: it treats the MLLM hidden state itself as a semantic reservoir and fuses it into the detector while keeping the detector’s localization structure intact.

Proposition

Early MLLM decoder layers contain enough spatial and semantic information to strengthen free-form grounding, while deeper layers increasingly specialize toward text-side generation. LED extracts the useful part and inserts it through a stable adapter rather than replacing the detector.

I

Bypass prompt-crafted data

LED does not depend on manually designed synthetic-data prompts, reducing pipeline-specific bias and engineering cost.

II

Keep detection geometry

The detector remains responsible for box localization; MLLM states act as adaptation prompts rather than a replacement feature map.

III

Spend computation sparingly

Most gains come from the first few LLM layers, enabling a small overhead relative to running a full MLLM detector.

A quiet adapter between two strong priors.

LED is composed of an MLLM branch for semantic prompt extraction, a mainstream open-vocabulary detector, and a zero-initialized cross-attention adapter. The zero gate lets training start from the original detector behavior and gradually admit MLLM knowledge only when it becomes useful.

1

Shared image-text input

The MLLM and detector observe the same image and natural-language grounding query.

2

Early hidden states

Intermediate LLM decoder states are truncated into visual and textual token segments.

3

Adaptation prompts

A lightweight convolution and cross-attention pathway projects MLLM states into detector space.

4

Grounded boxes

The detector decoder receives semantic guidance while retaining its localization bias.

Hidden-state extraction

LED extracts the hidden state from decoder layer \(\ell_{LM}\), then truncates it into visual and textual token groups.

H = Decoder(Hℓ−1)
EVL, ET = Truncate(H)

Stable fusion

The adapter starts with a zero-initialized gate, preventing randomly initialized prompts from disturbing the detector at the beginning of training.

ED = EDℓ−1 + Linear(ŜV)
Ŝ = concat(tanh(g) · softmax(Sprompt), softmax(Sdet))

The gain concentrates where semantics matter.

Category-style detection is already strong in specialized detectors. LED’s largest improvements appear on description-grounded and relationship-heavy queries, where semantic composition, attributes, and spatial language become the bottleneck.

OmniLabel · Qwen2.5-0.5B 28.03 APall, +6.34 over G-DINO
OmniLabel · Qwen2.5-0.5B 36.45 APdp, +11.84 on positive descriptions
OVDEval · relationship 30.9 +3.7 AP on relationship grounding

Why not simply replace the detector?

The supplement shows that raw MLLM hidden states cannot serve as a drop-in visual detector feature. They are semantically rich but not geometrically aligned for localization. LED therefore inserts a controlled fusion mechanism and leaves box prediction to the detector.

Paper tables, rendered as interactive evidence.

Select a benchmark family to inspect the reported values. Tables can be sorted, and the main OmniLabel panel can be re-plotted by metric to reveal where each adaptation source helps most.

OmniLabel model comparison

GroundingDINO vs. LED adaptation prompts from different LLM families.

Referring expression comprehension

LED improves RefCOCO, RefCOCO+, and RefCOCOg splits, especially with InternLM2.5-7B hidden states.

D3 and OVDEval sub-categories

Using Qwen2-0.5B early hidden states, LED strengthens description and relationship-heavy settings.

Adapter architecture analysis

Text-free adaptation, Arch. IV, gives the best overall score in the reported OmniLabel adapter study.

Comparison with synthetic-data pipelines

On OmniLabel, LED with InternVL2-1B surpasses NEG-Text and DesCo without generating additional curated labels.

Small LLM slices can be more useful than large detectors.

LED uses the first two layers of a lightweight Qwen2-0.5B branch in its efficient setting. The adapter itself is tiny; most of the extra compute comes from the shallow LLM slice, and the reported overhead is only 8.7% relative to GroundingDINO.

Compute decomposition

Additional parameters and GFLOPs relative to G-DINO.

APd vs. GFLOPs

Bubble size encodes parameter count. Key points are annotated; hover over each bubble for exact values.

SOTA detectors and MLLMs on OmniLabel

Sortable table from the paper’s comparison with efficient detectors and large MLLM baselines.

From repository to training run.

The codebase builds on GroundingDINO and trains the LED adapter with mixed OD/VG data. The commands below mirror the repository’s installation path and keep the blog focused on reproducing the detection pipeline.

Install

cd GroundingDINO
pip install -r requirements.txt

cd models/GroundingDINO/ops
python setup.py build install
python test.py

Prepare OD/VG data

python tools/coco2odvg.py \
  --image-root path/coco_2017/train2017 \
  --anno-file path/coco_2017/annotations/instances_train2017.json \
  --out-jsonl path/coco_2017/annotations/coco2017_train_odvg.jsonl

Configure mixed dataset

{
  "train": [
    {"root": "path/Objects365/", "dataset_mode": "odvg"},
    {"root": "path/coco_2017/train2017/", "dataset_mode": "odvg"},
    {"root": "path/flickr30k/images/", "dataset_mode": "odvg"}
  ],
  "val": [{"root": "path/coco_2017/val2017", "dataset_mode": "coco"}]
}

Train

bash train.sh
# set --pretrained_path and --dataset_cfg
# according to your local environment

Paper, code, and citation.

@misc{zhou2025led,
  title        = {LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation},
  author       = {Zhou, Yang and Zhao, Shiyu and Chen, Yuxiao and Wang, Zhenting and Jin, Can and Zhao, Mingyu and Metaxas, Dimitris N.},
  year         = {2025},
  eprint       = {2503.13794},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2503.13794}
}