Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP

Read original: arXiv:2406.10961 - Published 6/18/2024 by Shuyang Lin, Tong Jia, Hao Wang, Bowen Ma, Mingyuan Li, Dongyue Chen

Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP

Overview

This paper explores using the CLIP vision-language model to detect prohibited items in X-ray security scans, without needing to retrain the entire model from scratch.
The researchers propose a fine-tuning approach that leverages CLIP's open-vocabulary capabilities to recognize a wide range of prohibited items, including those not seen during training.
They evaluate their method on a prohibited item detection dataset and show improved performance over traditional object detection approaches.

Plain English Explanation

The researchers in this paper wanted to find a way to automatically detect prohibited items, like weapons or explosives, in X-ray security scans. Rather than building a completely new detection system from scratch, they decided to fine-tune an existing vision-language model called CLIP.

CLIP is a powerful AI model that can recognize a wide variety of objects, just by being shown their names or descriptions. The researchers thought this open-vocabulary capability of CLIP could be really useful for detecting prohibited items, even ones the model hasn't seen before. So they took CLIP and trained it specifically on detecting prohibited items in X-ray images.

By fine-tuning CLIP, instead of building a new detection system, the researchers were able to create a system that can spot a much broader range of prohibited items compared to traditional object detection approaches. This is important because security checkpoints need to be able to catch all kinds of prohibited items, not just the ones they've been specifically trained on.

The researchers tested their fine-tuned CLIP model on a dataset of X-ray scans and found that it performed better than other object detection methods at identifying prohibited items. This suggests that using a versatile vision-language model like CLIP could be a promising approach for improving security screening technology.

Technical Explanation

The researchers propose an approach called "Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP". They start with the pre-trained CLIP model, which is a vision-language model that can recognize a wide range of objects just from their names or descriptions.

To adapt CLIP for prohibited item detection in X-ray scans, the researchers fine-tune the model using a domain adaptation technique. Specifically, they freeze the CLIP backbone and only train a small adapter module that maps the X-ray images to the CLIP's visual embedding space.

This fine-tuning approach allows the model to leverage CLIP's powerful open-vocabulary capabilities, while only needing to learn the specific mapping from X-ray images to the CLIP embedding space. The researchers hypothesize that this will enable the model to recognize a much broader range of prohibited items compared to traditional object detection approaches, which typically require retraining the entire model from scratch for each new object category.

The researchers evaluate their fine-tuned CLIP model on a prohibited item detection dataset, and show that it outperforms other state-of-the-art object detection methods. They attribute this performance improvement to CLIP's ability to recognize a wide range of prohibited items, including those not seen during training.

Critical Analysis

The researchers' approach of fine-tuning CLIP for prohibited item detection in X-ray scans is a clever use of a powerful vision-language model. By leveraging CLIP's open-vocabulary capabilities, they were able to create a system that can detect a much broader range of prohibited items compared to traditional object detection methods.

However, the paper does not provide much insight into the specific limitations or failure cases of their approach. For example, it would be interesting to know how the model performs on rare or unseen prohibited items, or how it handles challenging X-ray scan conditions like low image quality or cluttered backgrounds.

Additionally, the researchers mention that their method relies on the availability of a suitable prohibited item detection dataset for fine-tuning. In practice, collecting and annotating such a dataset may be a significant challenge, especially for security-sensitive applications. Further research could explore ways to reduce the data requirements, such as through few-shot learning or unsupervised domain adaptation techniques.

Overall, the researchers have presented a promising approach for improving prohibited item detection in X-ray security scans. However, additional investigation into the limitations and practical deployment challenges would help provide a more comprehensive understanding of the method's real-world applicability.

Conclusion

This paper demonstrates the potential of using a vision-language model like CLIP for open-vocabulary prohibited item detection in X-ray security scans. By fine-tuning CLIP, the researchers were able to create a system that can recognize a much broader range of prohibited items compared to traditional object detection approaches.

The key insight is that leveraging the open-vocabulary capabilities of a pre-trained model like CLIP can significantly improve the flexibility and generalization of prohibited item detection systems. This is particularly important for security applications, where the range of potential threats is constantly evolving.

While the researchers have shown promising results, further investigation into the limitations and practical deployment challenges of their approach would help solidify its real-world viability. Nonetheless, this work represents an important step towards enhancing the capabilities of X-ray security screening technology, which is crucial for maintaining public safety.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Open-Vocabulary X-ray Prohibited Item Detection via Fine-tuning CLIP

Shuyang Lin, Tong Jia, Hao Wang, Bowen Ma, Mingyuan Li, Dongyue Chen

X-ray prohibited item detection is an essential component of security check and categories of prohibited item are continuously increasing in accordance with the latest laws. Previous works all focus on close-set scenarios, which can only recognize known categories used for training and often require time-consuming as well as labor-intensive annotations when learning novel categories, resulting in limited real-world applications. Although the success of vision-language models (e.g. CLIP) provides a new perspectives for open-set X-ray prohibited item detection, directly applying CLIP to X-ray domain leads to a sharp performance drop due to domain shift between X-ray data and general data used for pre-training CLIP. To address aforementioned challenges, in this paper, we introduce distillation-based open-vocabulary object detection (OVOD) task into X-ray security inspection domain by extending CLIP to learn visual representations in our specific X-ray domain, aiming to detect novel prohibited item categories beyond base categories on which the detector is trained. Specifically, we propose X-ray feature adapter and apply it to CLIP within OVOD framework to develop OVXD model. X-ray feature adapter containing three adapter submodules of bottleneck architecture, which is simple but can efficiently integrate new knowledge of X-ray domain with original knowledge, further bridge domain gap and promote alignment between X-ray images and textual concepts. Extensive experiments conducted on PIXray and PIDray datasets demonstrate that proposed method performs favorably against other baseline OVOD methods in detecting novel categories in X-ray scenario. It outperforms previous best result by 15.2 AP50 and 1.5 AP50 on PIXray and PIDray with achieving 21.0 AP50 and 27.8 AP50 respectively.

6/18/2024

🔎

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.

8/13/2024

🔎

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Chau Pham, Truong Vu, Khoi Nguyen

This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $textbf{40.5}$ in $text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.

6/4/2024

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

Sadia Ilyas, Ido Freeman, Matthias Rottmann

Out-of-distribution (OOD) object detection is a critical task focused on detecting objects that originate from a data distribution different from that of the training data. In this study, we investigate to what extent state-of-the-art open-vocabulary object detectors can detect unusual objects in street scenes, which are considered as OOD or rare scenarios with respect to common street scene datasets. Specifically, we evaluate their performance on the OoDIS Benchmark, which extends RoadAnomaly21 and RoadObstacle21 from SegmentMeIfYouCan, as well as LostAndFound, which was recently extended to object level annotations. The objective of our study is to uncover short-comings of contemporary object detectors in challenging real-world, and particularly in open-world scenarios. Our experiments reveal that open vocabulary models are promising for OOD object detection scenarios, however far from perfect. Substantial improvements are required before they can be reliably deployed in real-world applications. We benchmark four state-of-the-art open-vocabulary object detection models on three different datasets. Noteworthily, Grounding DINO achieves the best results on RoadObstacle21 and LostAndFound in our study with an AP of 48.3% and 25.4% respectively. YOLO-World excels on RoadAnomaly21 with an AP of 21.2%.

8/22/2024