Combining DINO with Grounded Pre-Training can improve performances in Open-Set Object Detection

Chinese researchers report that combining DINO with Grounded Pre-Training can improve performances in Open-Set Object Detection

Grounding DINO, an open-set object detector that utilizes language to detect arbitrary objects with human inputs such as category names or referring expressions. The model builds upon DINO, a transformer-based detector that incorporates multi-level text information through grounded pre-training. The authors introduce a tight fusion solution, which includes a feature enhancer, language-guided query selection, and a cross-modality decoder for effective cross-modality fusion. The researchers extend the evaluation of open-set object detection to referring expression comprehension datasets. Grounding DINO outperforms competitors on existing benchmarks for closed-set detection, open-set detection, and referring object detection. They highlights the advantages of transformer-based detectors for open-set detection and advocates for more feature fusion in the pipeline to achieve improved performance. The proposed model has potential applications in generative models for image editing. Researchers working on multi-modal learning, open-set object detection, and transfer learning will find this paper relevant. “Grounded pre-training” refers to a process in which a model is trained using both visual and textual information to establish a strong connection between the two modalities. In the context of this study, the authors train their model on a large dataset that includes both images and associated textual descriptions. This training allows the model to learn representations that effectively capture the relationships between visual features and textual information. By grounding the model’s understanding of language in the visual domain, it becomes better equipped to comprehend and detect objects based on textual cues.

Details of the model

They started by proposing a transformer-based detector called DINO, which integrates multi-level text information into its algorithm through grounded pre-training. Building upon DINO, they developed Grounding DINO, an open-set object detector that uses language to detect arbitrary objects with human inputs such as category names or referring expressions. The model consists of several components including a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

To train their model, they used various types of data including object detection data from COCO, O365, and OpenImage dataset, grounding data from GoldG and RefC, and caption data. Different text inputs were simulated by randomly sampling category names during training. They trained two model variants, Grounding-DINO-T with Swin-T as the image backbone, and Grounding-DINO-L with Swin-L. The text backbone used was BERT-base from Hugging Face. More implementation details can be found in the appendix of the paper.

For the model evaluation, they conducted extensive experiments on three settings: closed-set setting on the COCO detection benchmark, open-set setting on zero-shot COCO, LVIS, and ODinW datasets, and referring detection setting on RefCOCO/+/g datasets. They also performed ablation studies to verify the effectiveness of their model design. Additionally, they explored the transfer of a pre-trained DINO model to the Grounding DINO model to reduce training costs.

Performances of the model

The experiments showed that Grounding DINO outperformed competitors on existing benchmarks for closed-set detection, open-set detection, and referring object detection. The authors highlighted the advantages of transformer-based detectors for open-set detection and argued for more feature fusion in the pipeline to achieve better performance. They also mentioned that the proposed model has potential applications in generative models for image editing.