Replication Study and Benchmarking of Real-Time Object Detection Models

2405.06911

Published 5/14/2024 by Pierre-Luc Asselin, Vincent Coulombe, William Guimont-Martin, William Larriv'ee-Hardy

🔎

Abstract

This work examines the reproducibility and benchmarking of state-of-the-art real-time object detection models. As object detection models are often used in real-world contexts, such as robotics, where inference time is paramount, simply measuring models' accuracy is not enough to compare them. We thus compare a large variety of object detection models' accuracy and inference speed on multiple graphics cards. In addition to this large benchmarking attempt, we also reproduce the following models from scratch using PyTorch on the MS COCO 2017 dataset: DETR, RTMDet, ViTDet and YOLOv7. More importantly, we propose a unified training and evaluation pipeline, based on MMDetection's features, to better compare models. Our implementation of DETR and ViTDet could not achieve accuracy or speed performances comparable to what is declared in the original papers. On the other hand, reproduced RTMDet and YOLOv7 could match such performances. Studied papers are also found to be generally lacking for reproducibility purposes. As for MMDetection pretrained models, speed performances are severely reduced with limited computing resources (larger, more accurate models even more so). Moreover, results exhibit a strong trade-off between accuracy and speed, prevailed by anchor-free models - notably RTMDet or YOLOx models. The code used is this paper and all the experiments is available in the repository at https://github.com/Don767/segdet_mlcr2024.

Create account to get full access

Overview

The research paper examines the reproducibility and benchmarking of state-of-the-art real-time object detection models.
It compares the accuracy and inference speed of various object detection models on multiple graphics cards.
The paper also reproduces several object detection models, including DETR, RTMDet, ViTDet, and YOLOv7, using a unified training and evaluation pipeline.
The paper aims to provide a comprehensive benchmark for comparing the performance of real-time object detection models.

Plain English Explanation

Accurate and fast object detection is crucial for real-world applications like robotics, where quick decision-making is essential. This research investigates how well different state-of-the-art object detection models perform in terms of both accuracy and speed.

The researchers tested a variety of object detection models on multiple types of graphics hardware, comparing their performance on the MS COCO 2017 dataset. They also reproduced several models from scratch, using a standardized training and evaluation pipeline, to better understand the models' capabilities.

The results show that some reproduced models, like RTMDet and YOLOv7, were able to match the performance claimed in the original papers. However, others, like DETR and ViTDet, could not achieve the same level of accuracy or speed.

The research also found that the original papers often lacked the necessary details for easy reproducibility. Additionally, the speed of pre-trained models from the MMDetection library was significantly reduced on less powerful hardware, with larger and more accurate models being affected the most.

Overall, the study highlights the trade-off between accuracy and speed in real-time object detection, with anchor-free models like RTMDet and YOLOx performing well in this regard.

Technical Explanation

The researchers conducted a comprehensive benchmark of state-of-the-art real-time object detection models, evaluating their accuracy and inference speed on multiple graphics cards. In addition to the benchmarking, they reproduced several models from scratch using PyTorch on the MS COCO 2017 dataset, including DETR, RTMDet, ViTDet, and YOLOv7.

The researchers developed a unified training and evaluation pipeline, based on MMDetection's features, to standardize the comparison of the models. Their implementation of DETR and ViTDet could not match the accuracy or speed performances reported in the original papers, while the reproduced RTMDet and YOLOv7 models were able to achieve comparable results.

The paper also highlights that the original studies often lacked the necessary details for easy reproducibility. Furthermore, the researchers found that the speed performance of pre-trained models from the MMDetection library was significantly reduced when using less powerful computing resources, with larger and more accurate models being affected the most.

The results of the study demonstrate a strong trade-off between accuracy and speed, with anchor-free models, such as RTMDet and YOLOx, generally performing well in this regard.

Critical Analysis

While the paper provides a comprehensive benchmark of real-time object detection models, there are a few limitations and areas for further research that could be addressed:

The study focuses on a specific dataset (MS COCO 2017) and may not fully capture the performance of the models in other real-world scenarios or domains.
The reproducibility issues highlighted in the paper suggest that more detailed reporting and standardization of model training and evaluation procedures are needed in the field of object detection.
The trade-off between accuracy and speed observed in the results may be influenced by the specific hardware configurations used in the experiments. It would be valuable to explore the impact of different hardware setups on this trade-off.
The paper does not delve into the potential reasons why the reproduced DETR and ViTDet models could not match the original performance claims. Further investigation into the factors contributing to these discrepancies could provide valuable insights.

Overall, the research provides a useful benchmark for comparing the performance of real-time object detection models and highlights the importance of reproducibility in the field of computer vision.

Conclusion

This research paper presents a comprehensive benchmarking and reproducibility study of state-of-the-art real-time object detection models. By evaluating a wide range of models on multiple graphics cards and reproducing several models from scratch, the researchers have shed light on the critical trade-off between accuracy and inference speed in this domain.

The findings emphasize the need for more detailed reporting and standardization of model training and evaluation procedures to improve the reproducibility of object detection research. Additionally, the study underscores the significant impact of hardware resources on the performance of these models, with larger and more accurate models being particularly sensitive to computing power limitations.

The insights gained from this work can inform the development of more efficient and reliable real-time object detection systems, which are essential for applications such as robotics, autonomous vehicles, and surveillance. By highlighting the strengths and limitations of different object detection approaches, the research paves the way for further advancements in this important field of computer vision.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

DETRs Beat YOLOs on Real-time Object Detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen

The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.

4/4/2024

cs.CV

Real-Time Detection and Analysis of Vehicles and Pedestrians using Deep Learning

Md Nahid Sadik, Tahmim Hossain, Faisal Sayeed

Computer vision, particularly vehicle and pedestrian identification is critical to the evolution of autonomous driving, artificial intelligence, and video surveillance. Current traffic monitoring systems confront major difficulty in recognizing small objects and pedestrians effectively in real-time, posing a serious risk to public safety and contributing to traffic inefficiency. Recognizing these difficulties, our project focuses on the creation and validation of an advanced deep-learning framework capable of processing complex visual input for precise, real-time recognition of cars and people in a variety of environmental situations. On a dataset representing complicated urban settings, we trained and evaluated different versions of the YOLOv8 and RT-DETR models. The YOLOv8 Large version proved to be the most effective, especially in pedestrian recognition, with great precision and robustness. The results, which include Mean Average Precision and recall rates, demonstrate the model's ability to dramatically improve traffic monitoring and safety. This study makes an important addition to real-time, reliable detection in computer vision, establishing new benchmarks for traffic management systems.

4/15/2024

cs.CV

🔎

YOLOv10: Real-Time End-to-End Object Detection

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, Guiguang Ding

Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs and adversely impacts the inference latency. Besides, the design of various components in YOLOs lacks the comprehensive and thorough inspection, resulting in noticeable computational redundancy and limiting the model's capability. It renders the suboptimal efficiency, along with considerable potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both the post-processing and model architecture. To this end, we first present the consistent dual assignments for NMS-free training of YOLOs, which brings competitive performance and low inference latency simultaneously. Moreover, we introduce the holistic efficiency-accuracy driven model design strategy for YOLOs. We comprehensively optimize various components of YOLOs from both efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability. The outcome of our effort is a new generation of YOLO series for real-time end-to-end object detection, dubbed YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art performance and efficiency across various model scales. For example, our YOLOv10-S is 1.8$times$ faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2.8$times$ smaller number of parameters and FLOPs. Compared with YOLOv9-C, YOLOv10-B has 46% less latency and 25% fewer parameters for the same performance.

5/24/2024

cs.CV

A Review and Implementation of Object Detection Models and Optimizations for Real-time Medical Mask Detection during the COVID-19 Pandemic

Ioanna Gogou, Dimitrios Koutsomitropoulos

Convolutional Neural Networks (CNN) are commonly used for the problem of object detection thanks to their increased accuracy. Nevertheless, the performance of CNN-based detection models is ambiguous when detection speed is considered. To the best of our knowledge, there has not been sufficient evaluation of the available methods in terms of the speed/accuracy trade-off in related literature. This work assesses the most fundamental object detection models on the Common Objects in Context (COCO) dataset with respect to this trade-off, their memory consumption, and computational and storage cost. Next, we select a highly efficient model called YOLOv5 to train on the topical and unexplored dataset of human faces with medical masks, the Properly-Wearing Masked Faces Dataset (PWMFD), and analyze the benefits of specific optimization techniques for real-time medical mask detection: transfer learning, data augmentations, and a Squeeze-and-Excitation attention mechanism. Using our findings in the context of the COVID-19 pandemic, we propose an optimized model based on YOLOv5s using transfer learning for the detection of correctly and incorrectly worn medical masks that surpassed more than two times in speed (69 frames per second) the state-of-the-art model SE-YOLOv3 on the PWMFD dataset while maintaining the same level of mean Average Precision (67%).

5/29/2024

cs.CV cs.AI