Object detection has become the backbone of a lot of important products like safety systems that know where hands are near dangerous machinery, retail analytics counting products and people, autonomous vehicles, warehouse robots, ergonomic assessment tools, and more. Traditionally, those systems all shared one big assumption: you must decide up front which objects matter, hard-code that label set, and then spend a lot of time, money, and human resources in annotating data for those classes.

Vision-language models (VLMs) and open-vocabulary object detectors (OVD) eliminate this assumption completely. Instead of baking labels into the weights, you pass them in as prompts: “red mug”, “overhead luggage bin”, “safety helmet”, “tablet on the desk.” And surprisingly, the best of these models now match or even beat strong closed-set detectors without ever seeing that dataset’s labels.

In my day job, I work on real-time, on-device computer vision for ergonomics and workplace safety, think iPads or iPhones checking posture, reach, and PPE in warehouses and aircraft cabins. For a long time, every new “Can we detect X?” request meant another round of data collection, labeling, and retraining. When we started experimenting with open-vocabulary detectors, the workflow flipped: we could prompt for new concepts, see if the signals looked promising in real video, and only then decide whether it was worth investing in a dedicated closed-set model.

This article walks through:


1. Object detection 101: what and why?

Object detection tries to answer two questions for each image (or video frame):

  1. What is in the scene? (class labels)
  2. Where is it? (bounding boxes, sometimes masks)

Unlike plain image classification (one label per image), detection says “two people, one laptop, one chair, one cup” with coordinates. That’s what makes it useful for:

In traditional pipelines, the object catalog (your label set) is fixed, for example, 80 COCO classes, or 1,203 LVIS classes. Adding “blue cardboard box”, “broken pallet”, or a specific SKU later is where things start to hurt.


2. A very quick history: from HOG to deep nets

2.1 Pre-deep learning: HOG, DPM, Regionlets

Before deep learning, detectors used hand-crafted features like HOG (Histograms of Oriented Gradients) and part-based models. You’d slide a window over the image, compute features, and run a classifier.

Two representative classical systems on PASCAL VOC 2007:

VOC 2007 has 5,011 train+val images and 4,952 test images (9,963 total).

2.2 Deep learning arrives: R-CNN, Fast/Faster R-CNN

Then came CNNs:

The 07+12 setup uses VOC 2007 trainval (5,011 images) + VOC 2012 trainval, giving about 16.5k training images.

So on the same dataset, going from hand-crafted to CNNs roughly doubled performance:

Table 1 – Classical vs deep detectors on PASCAL VOC 2007

Dataset

Model

# training images (VOC)

mAP @ 0.5

VOC 2007 test

DPM voc-release5 (no context)

5,011 (VOC07 trainval)

33.7%

VOC 2007 test

Regionlets

5,011 (VOC07 trainval)

41.7%

VOC 2007 test

Fast R-CNN (VGG16, 07+12)

≈16.5k (VOC07+12)

70.0%

VOC 2007 test

Faster R-CNN (VGG16, 07+12)

≈16.5k (VOC07+12)

73.2%

That’s the story we’ve been telling for a decade: deep learning crushed classical detection.

But all of these are closed-set: you pick a fixed label list, and the model can’t recognize anything outside it.


3. Why closed-set deep detectors are painful in production

Closed-set detectors (Faster R-CNN, YOLO, etc.) are great if:

In practice, especially in enterprise settings:

Technically, closed-set detectors are optimized for one label space:

This is where open-vocabulary detectors and vision-language models become interesting.


4. Open-vocabulary object detection: prompts instead of fixed labels

Open-vocabulary detectors combine two ideas:

  1. Vision backbone – like a detector / transformer that proposes regions.
  2. Language backbone – text encoder (often CLIP-style) that turns prompts like “red cup” or “overhead bin” into embeddings.

Instead of learning a classifier over a fixed set of one-hot labels, the detector learns to align region features and text embeddings in a shared space. At inference time, you can pass any string: “steel toe boot”, “forklift”, “wrench”, “coffee stain”, and the model scores regions against those text prompts.

Examples:

These models are usually pre-trained on millions of image–text pairs from the web, then sometimes fine-tuned on detection datasets with large vocabularies.

Visual comparison: promptable Grounding DINO vs. closed-set Fast R-CNN

In the side-by-side image below, the open-vocabulary Grounding DINO model is prompted with fine-grained phrases like “armrests,” “mesh backrest,” “seat cushion,” and “chair,” and it correctly identifies each region, not just the overall object. This works because Grounding DINO connects image regions with text prompts during inference, enabling it to recognize categories that weren’t in its original training list. In contrast, the closed-set Fast R-CNN model is trained on a fixed set of categories (such as those in the PASCAL VOC or COCO label space), so it can only detect the broader “chair” class and misses the finer parts. This highlights the real-world advantage of promptable detectors: they can adapt to exactly what you ask for without retraining, while still maintaining practical performance. It also shows why open-vocabulary models are so promising for dynamic environments where new items, parts, or hazards appear regularly.

Promptable vs. closed-set detection on the same scene. Grounding DINO (left) identifies armrests, mesh backrest, seat cushion, and the overall chair; Fast RCNN (right) detects only the chair. Photo: © 2025 Balaji Sundareshan: original photo by the author.


5. Benchmarks: closed-set vs open-vocabulary on COCO

Let’s look at COCO 2017, the standard 80-class detection benchmark. COCO train2017 has about 118k training images and 5k val images.

A strong closed-set baseline:

Now compare that to Grounding DINO:

Table 2 – COCO closed-set vs open-vocabulary

Dataset

Model

# training images from COCO

AP@[0.5:0.95]

COCO 2017 test-dev

EfficientDet-D7 (closed-set)

118k (train2017)

52.2 AP

COCO det. (zero-shot)

Grounding DINO (open-vocab, zero-shot)

0 (no COCO data)

52.5 AP

COCO det. (supervised)

Grounding DINO (fine-tuned)

118k (train2017)

63.0 AP

You can fairly say:

An open-vocabulary detector, trained on other data, matches a COCO-specific SOTA detector on COCO, and then beats it once you fine-tune.

That’s a strong argument for reusability: with OVDs, you get decent performance on new domains without painstaking dataset-specific labeling.

In our own experiments on office ergonomics product, we’ve seen a similar pattern: a promptable detector gets us to a usable baseline quickly, and a small fine-tuned model does the heavy lifting in production.


6. Benchmarks on LVIS: long-tail, large vocabulary

COCO has 80 classes. LVIS v1.0 is more realistic for enterprise: ~100k train images, ~20k val, and 1,203 categories with a long-tailed distribution.

6.1 Closed-set LVIS

The Copy-Paste paper benchmarks strong instance/detection models on LVIS v1.0. With EfficientNet-B7 NAS-FPN and a two-stage training scheme, they report:

Another line of work, Detic hits 41.7 mAP on the standard LVIS benchmark across all classes, using LVIS annotations plus additional image-level labels.

6.2 Zero-shot open-vocabulary on LVIS

Two representative OVDs:

These models use no LVIS training images, they rely on large-scale pre-training with grounding annotations and text labels, then are evaluated on LVIS as a new domain.

Table 3 – LVIS: closed-set vs open-vocabulary

Dataset / split

Model

# training images from LVIS

AP (box)

LVIS v1.0 (val)

Eff-B7 NAS-FPN + Copy-Paste (closed-set)

100k (LVIS train)

41.6 AP

LVIS v1.0 (all classes)

Detic (open-vocab-friendly, LVIS-trained)

100k (LVIS train)

41.7 mAP

LVIS v1.0 (zero-shot)

YOLO-World (open-vocab, zero-shot)

0 (no LVIS data)

35.4 AP

LVIS-minival (zero-shot)

Grounding DINO 1.5 Edge (open-vocab, edge-optimized)

0 (no LVIS data)

36.2 AP

Takeaway that you can safely emphasize:

On LVIS, the best open-vocabulary detectors reach ~35–36 AP in pure zero-shot mode, not far behind strong closed-set models in the low-40s AP that use 100k fully annotated training images.

That’s a powerful trade-off story for enterprises: ~10k+ human hours of annotation vs zero LVIS labels for a ~5–6 AP gap.

In one of our internal pilots, we used an open-vocab model to sweep through a few hundred hours of warehouse video with prompts like “forklift”, “ladder”, and “cardboard boxes on the floor.” The raw detections were noisy, but they gave our annotators a huge head start: instead of hunting for rare events manually, they were editing candidate boxes. That distilled into a compact closed-set model we could actually ship on edge hardware, and it only existed because the open-vocab model gave us a cheap way to explore the long tail.


7. Limitations of open-vocabulary detection

Open-vocabulary detectors aren’t magic. They introduce new problems:

  1. Prompt sensitivity & hallucinations
    • “cup” vs “mug” vs “coffee cup” can change detections.
    • If you prompt with something that isn’t there (“giraffe” in an office), the model may still confidently hallucinate boxes.
  2. Calibration & thresholds
    • Scores aren’t always calibrated across arbitrary text prompts, so you may need prompt-specific thresholds or re-scoring.
  3. Latency & compute
    • Foundation-scale models (big backbones, large text encoders) can be heavy for edge devices.
    • YOLO-World and Grounding DINO 1.5 Edge show this is improving: 35.4 AP at 52 FPS, 36.2 AP at 75 FPS, but you’re still in GPU/accelerator territory.
  4. Governance & safety
    • Because they’re text-driven, you have to think about who controls the prompts and how to log/approve them in safety-critical systems.

So while OVDs are amazing for exploration, prototyping, querying, and rare-class detection, you might not always want to ship them directly to every edge device.


8. A practical recipe: OVD as annotator, closed-set as worker

A pattern that makes sense for many enterprises:

  1. Use an open-vocabulary detector as a “labeling assistant”
    • Run Grounding DINO / YOLO-World over your video/image streams with prompts like “pallet”, “fallen pallet”, “phone in hand”, “ladder”.
    • Let your annotators edit rather than draw boxes from scratch.
    • This creates a large, high-quality task-specific labeled dataset cheaply.
  2. Train a lean closed-set detector
    • Define the final label set you actually need in production.
    • Train an EfficientDet / YOLO / RetinaNet / lightweight transformer on your auto-bootstrapped dataset.
    • You now get fast, small, hardware-friendly models that are easy to deploy on edge devices (iPads, Jetsons, on-prem boxes).
  3. Iterate by “querying” the world with prompts
    • When product asks, “Can we also track X?” you don’t need to re-instrument hardware:
      • First, run an OVD with new prompts to mine candidate instances of X.
      • Curate + clean those labels.
      • Fine-tune or extend your closed-set detector with the new class.

This gives you the best of both worlds:


9. Where this leaves us

If you zoom out over the last 15 years:

The story for readers is simple:

If you’re building enterprise systems, it’s a good time to start treating prompts as the new label files and vision-language detectors as your first stop for exploration, before you commit to yet another closed-set training cycle.