Bypass prompt-crafted data
LED does not depend on manually designed synthetic-data prompts, reducing pipeline-specific bias and engineering cost.
LED introduces a knowledge-fusion paradigm for open-vocabulary detection: instead of asking foundation models to generate labels or descriptions, it directly injects intermediate MLLM hidden states into a detector decoder through a lightweight zero-initialized cross-attention adapter.
Many OVD pipelines improve detectors by asking large foundation models to create pseudo labels, detailed descriptions, or negative examples. LED takes a different route: it treats the MLLM hidden state itself as a semantic reservoir and fuses it into the detector while keeping the detector’s localization structure intact.
Early MLLM decoder layers contain enough spatial and semantic information to strengthen free-form grounding, while deeper layers increasingly specialize toward text-side generation. LED extracts the useful part and inserts it through a stable adapter rather than replacing the detector.
LED does not depend on manually designed synthetic-data prompts, reducing pipeline-specific bias and engineering cost.
The detector remains responsible for box localization; MLLM states act as adaptation prompts rather than a replacement feature map.
Most gains come from the first few LLM layers, enabling a small overhead relative to running a full MLLM detector.
LED is composed of an MLLM branch for semantic prompt extraction, a mainstream open-vocabulary detector, and a zero-initialized cross-attention adapter. The zero gate lets training start from the original detector behavior and gradually admit MLLM knowledge only when it becomes useful.
The MLLM and detector observe the same image and natural-language grounding query.
Intermediate LLM decoder states are truncated into visual and textual token segments.
A lightweight convolution and cross-attention pathway projects MLLM states into detector space.
The detector decoder receives semantic guidance while retaining its localization bias.
LED extracts the hidden state from decoder layer \(\ell_{LM}\), then truncates it into visual and textual token groups.
The adapter starts with a zero-initialized gate, preventing randomly initialized prompts from disturbing the detector at the beginning of training.
Category-style detection is already strong in specialized detectors. LED’s largest improvements appear on description-grounded and relationship-heavy queries, where semantic composition, attributes, and spatial language become the bottleneck.
The supplement shows that raw MLLM hidden states cannot serve as a drop-in visual detector feature. They are semantically rich but not geometrically aligned for localization. LED therefore inserts a controlled fusion mechanism and leaves box prediction to the detector.
Select a benchmark family to inspect the reported values. Tables can be sorted, and the main OmniLabel panel can be re-plotted by metric to reveal where each adaptation source helps most.
GroundingDINO vs. LED adaptation prompts from different LLM families.
LED improves RefCOCO, RefCOCO+, and RefCOCOg splits, especially with InternLM2.5-7B hidden states.
Using Qwen2-0.5B early hidden states, LED strengthens description and relationship-heavy settings.
Text-free adaptation, Arch. IV, gives the best overall score in the reported OmniLabel adapter study.
On OmniLabel, LED with InternVL2-1B surpasses NEG-Text and DesCo without generating additional curated labels.
LED uses the first two layers of a lightweight Qwen2-0.5B branch in its efficient setting. The adapter itself is tiny; most of the extra compute comes from the shallow LLM slice, and the reported overhead is only 8.7% relative to GroundingDINO.
Additional parameters and GFLOPs relative to G-DINO.
Bubble size encodes parameter count. Key points are annotated; hover over each bubble for exact values.
Sortable table from the paper’s comparison with efficient detectors and large MLLM baselines.
The codebase builds on GroundingDINO and trains the LED adapter with mixed OD/VG data. The commands below mirror the repository’s installation path and keep the blog focused on reproducing the detection pipeline.
cd GroundingDINO
pip install -r requirements.txt
cd models/GroundingDINO/ops
python setup.py build install
python test.py
python tools/coco2odvg.py \
--image-root path/coco_2017/train2017 \
--anno-file path/coco_2017/annotations/instances_train2017.json \
--out-jsonl path/coco_2017/annotations/coco2017_train_odvg.jsonl
{
"train": [
{"root": "path/Objects365/", "dataset_mode": "odvg"},
{"root": "path/coco_2017/train2017/", "dataset_mode": "odvg"},
{"root": "path/flickr30k/images/", "dataset_mode": "odvg"}
],
"val": [{"root": "path/coco_2017/val2017", "dataset_mode": "coco"}]
}
bash train.sh
# set --pretrained_path and --dataset_cfg
# according to your local environment
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation.
CodeImplementation, training scripts, dataset preparation, and GroundingDINO integration.
ContactYang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, Mingyu Zhao, Dimitris N. Metaxas.
@misc{zhou2025led,
title = {LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation},
author = {Zhou, Yang and Zhao, Shiyu and Chen, Yuxiao and Wang, Zhenting and Jin, Can and Zhao, Mingyu and Metaxas, Dimitris N.},
year = {2025},
eprint = {2503.13794},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2503.13794}
}