LED — LLM Enhanced Open-Vocabulary Object Detection

01 · central thesis

Use the representation, not the generated annotation.

Many OVD pipelines improve detectors by asking large foundation models to create pseudo labels, detailed descriptions, or negative examples. LED takes a different route: it treats the MLLM hidden state itself as a semantic reservoir and fuses it into the detector while keeping the detector’s localization structure intact.

Proposition

Early MLLM decoder layers contain enough spatial and semantic information to strengthen free-form grounding, while deeper layers increasingly specialize toward text-side generation. LED extracts the useful part and inserts it through a stable adapter rather than replacing the detector.

I

Bypass prompt-crafted data

LED does not depend on manually designed synthetic-data prompts, reducing pipeline-specific bias and engineering cost.

II

Keep detection geometry

The detector remains responsible for box localization; MLLM states act as adaptation prompts rather than a replacement feature map.

III

Spend computation sparingly

Most gains come from the first few LLM layers, enabling a small overhead relative to running a full MLLM detector.

02 · method

A quiet adapter between two strong priors.

LED is composed of an MLLM branch for semantic prompt extraction, a mainstream open-vocabulary detector, and a zero-initialized cross-attention adapter. The zero gate lets training start from the original detector behavior and gradually admit MLLM knowledge only when it becomes useful.

1

Shared image-text input

The MLLM and detector observe the same image and natural-language grounding query.

2

Early hidden states

Intermediate LLM decoder states are truncated into visual and textual token segments.

3

Adaptation prompts

A lightweight convolution and cross-attention pathway projects MLLM states into detector space.

4

Grounded boxes

The detector decoder receives semantic guidance while retaining its localization bias.

Hidden-state extraction

LED extracts the hidden state from decoder layer \(\ell_{LM}\), then truncates it into visual and textual token groups.

H_ℓ = Decoder_ℓ(H_ℓ−1)
E_V^L, E_T = Truncate(H_ℓ)

Stable fusion

The adapter starts with a zero-initialized gate, preventing randomly initialized prompts from disturbing the detector at the beginning of training.

E_D^ℓ = E_D^ℓ−1 + Linear(ŜV)
Ŝ = concat(tanh(g) · softmax(S_prompt), softmax(S_det))

03 · empirical signal

The gain concentrates where semantics matter.

Category-style detection is already strong in specialized detectors. LED’s largest improvements appear on description-grounded and relationship-heavy queries, where semantic composition, attributes, and spatial language become the bottleneck.

OmniLabel · Qwen2.5-0.5B 28.03 AP_all, +6.34 over G-DINO

OmniLabel · Qwen2.5-0.5B 36.45 AP_dp, +11.84 on positive descriptions

OVDEval · relationship 30.9 +3.7 AP on relationship grounding

Why not simply replace the detector?

The supplement shows that raw MLLM hidden states cannot serve as a drop-in visual detector feature. They are semantically rich but not geometrically aligned for localization. LED therefore inserts a controlled fusion mechanism and leaves box prediction to the detector.

04 · dynamic results

Paper tables, rendered as interactive evidence.

Select a benchmark family to inspect the reported values. Tables can be sorted, and the main OmniLabel panel can be re-plotted by metric to reveal where each adaptation source helps most.

OmniLabel model comparison

GroundingDINO vs. LED adaptation prompts from different LLM families.

Referring expression comprehension

LED improves RefCOCO, RefCOCO+, and RefCOCOg splits, especially with InternLM2.5-7B hidden states.

D3 and OVDEval sub-categories

Using Qwen2-0.5B early hidden states, LED strengthens description and relationship-heavy settings.

Adapter architecture analysis

Text-free adaptation, Arch. IV, gives the best overall score in the reported OmniLabel adapter study.

Comparison with synthetic-data pipelines

On OmniLabel, LED with InternVL2-1B surpasses NEG-Text and DesCo without generating additional curated labels.

05 · efficiency frontier

Small LLM slices can be more useful than large detectors.

LED uses the first two layers of a lightweight Qwen2-0.5B branch in its efficient setting. The adapter itself is tiny; most of the extra compute comes from the shallow LLM slice, and the reported overhead is only 8.7% relative to GroundingDINO.

Compute decomposition

Additional parameters and GFLOPs relative to G-DINO.

AP_d vs. GFLOPs

Bubble size encodes parameter count. Key points are annotated; hover over each bubble for exact values.

SOTA detectors and MLLMs on OmniLabel

Sortable table from the paper’s comparison with efficient detectors and large MLLM baselines.

06 · reproducibility

From repository to training run.

The codebase builds on GroundingDINO and trains the LED adapter with mixed OD/VG data. The commands below mirror the repository’s installation path and keep the blog focused on reproducing the detection pipeline.

Install

cd GroundingDINO
pip install -r requirements.txt

cd models/GroundingDINO/ops
python setup.py build install
python test.py

Prepare OD/VG data

python tools/coco2odvg.py \
  --image-root path/coco_2017/train2017 \
  --anno-file path/coco_2017/annotations/instances_train2017.json \
  --out-jsonl path/coco_2017/annotations/coco2017_train_odvg.jsonl

Configure mixed dataset

{
  "train": [
    {"root": "path/Objects365/", "dataset_mode": "odvg"},
    {"root": "path/coco_2017/train2017/", "dataset_mode": "odvg"},
    {"root": "path/flickr30k/images/", "dataset_mode": "odvg"}
  ],
  "val": [{"root": "path/coco_2017/val2017", "dataset_mode": "coco"}]
}

Train

bash train.sh
# set --pretrained_path and --dataset_cfg
# according to your local environment

07 · resources

Paper, code, and citation.

Paper

@misc{zhou2025led,
  title        = {LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation},
  author       = {Zhou, Yang and Zhao, Shiyu and Chen, Yuxiao and Wang, Zhenting and Jin, Can and Zhao, Mingyu and Metaxas, Dimitris N.},
  year         = {2025},
  eprint       = {2503.13794},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2503.13794}
}

LLM Enhanced Open-Vocabulary Object Detection without human-curated data generation

Use the representation, not the generated annotation.

Bypass prompt-crafted data

Keep detection geometry

Spend computation sparingly

A quiet adapter between two strong priors.

Shared image-text input

Early hidden states

Adaptation prompts

Grounded boxes

Hidden-state extraction

Stable fusion

The gain concentrates where semantics matter.

Why not simply replace the detector?

Paper tables, rendered as interactive evidence.

OmniLabel model comparison

Referring expression comprehension

D3 and OVDEval sub-categories

Adapter architecture analysis

Comparison with synthetic-data pipelines

Small LLM slices can be more useful than large detectors.

Compute decomposition

AP_d vs. GFLOPs

SOTA detectors and MLLMs on OmniLabel

From repository to training run.

Install

Prepare OD/VG data

Configure mixed dataset

Train

Paper, code, and citation.

arXiv:2503.13794

Open-LED

Rutgers University

Use the representation, not the generated annotation.

Bypass prompt-crafted data

Keep detection geometry

Spend computation sparingly

A quiet adapter between two strong priors.

Shared image-text input

Early hidden states

Adaptation prompts

Grounded boxes

Hidden-state extraction

Stable fusion

The gain concentrates where semantics matter.

Why not simply replace the detector?

Paper tables, rendered as interactive evidence.

OmniLabel model comparison

Referring expression comprehension

D3 and OVDEval sub-categories

Adapter architecture analysis

Comparison with synthetic-data pipelines

Small LLM slices can be more useful than large detectors.

Compute decomposition

APd vs. GFLOPs

SOTA detectors and MLLMs on OmniLabel

From repository to training run.

Install

Prepare OD/VG data

Configure mixed dataset

Train

Paper, code, and citation.

arXiv:2503.13794

Open-LED

Rutgers University

AP_d vs. GFLOPs