YOLOv10: Real-Time End-to-End Object Detection

2405.14458

Published 5/24/2024 by Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, Guiguang Ding

🔎

Abstract

Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs and adversely impacts the inference latency. Besides, the design of various components in YOLOs lacks the comprehensive and thorough inspection, resulting in noticeable computational redundancy and limiting the model's capability. It renders the suboptimal efficiency, along with considerable potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both the post-processing and model architecture. To this end, we first present the consistent dual assignments for NMS-free training of YOLOs, which brings competitive performance and low inference latency simultaneously. Moreover, we introduce the holistic efficiency-accuracy driven model design strategy for YOLOs. We comprehensively optimize various components of YOLOs from both efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability. The outcome of our effort is a new generation of YOLO series for real-time end-to-end object detection, dubbed YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art performance and efficiency across various model scales. For example, our YOLOv10-S is 1.8$times$ faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2.8$times$ smaller number of parameters and FLOPs. Compared with YOLOv9-C, YOLOv10-B has 46% less latency and 25% fewer parameters for the same performance.

Create account to get full access

Overview

Real-time object detection models like YOLO have emerged as popular choices due to their balance of speed and performance.
Researchers have explored various aspects of YOLO models, including architecture, optimization, and data augmentation, leading to notable progress.
However, YOLO models still face challenges, such as the reliance on non-maximum suppression (NMS) for post-processing, which impacts inference latency, and computational redundancy in the model design.

Plain English Explanation

YOLO (You Only Look Once) models have become widely used for real-time object detection tasks, as they can quickly identify and locate objects in images or videos while maintaining good accuracy. Researchers have been working to continuously improve YOLO models, exploring different ways to design the model architecture, optimize the training process, and augment the training data.

Despite these advancements, YOLO models still have some limitations. One issue is the use of non-maximum suppression (NMS) for post-processing, which can slow down the speed of the model at inference time. Additionally, the components of YOLO models may not be optimized as thoroughly as they could be, leading to unnecessary computational overhead and limiting the model's overall capabilities.

Technical Explanation

The researchers in this work aim to further improve the performance and efficiency of YOLO models, addressing both the post-processing and model architecture aspects.

First, they present a new training approach for YOLO models that eliminates the need for NMS, achieving competitive performance with low inference latency.

Second, the researchers introduce a comprehensive model design strategy that optimizes various components of YOLO models, targeting both efficiency and accuracy. This reduces the computational overhead and enhances the overall capabilities of the models.

The outcome of this work is a new generation of YOLO models, dubbed YOLOv10, which demonstrate state-of-the-art performance and efficiency across different model scales. For example, the YOLOv10-S model is 1.8 times faster than RT-DETR-R18 while achieving similar accuracy on the COCO dataset. Compared to the previous YOLOv9-C model, the YOLOv10-B model has 46% less latency and 25% fewer parameters for the same level of performance.

Critical Analysis

The researchers have made notable progress in improving the performance and efficiency of YOLO models. The elimination of the NMS post-processing step and the comprehensive optimization of the model components are significant contributions that address key limitations of YOLO models.

However, the paper does not provide a detailed analysis of the specific architectural changes or optimizations made to the various components of the YOLOv10 models. It would be helpful to understand the rationale behind these design choices and how they improve the overall efficiency and capability of the models.

Additionally, the paper does not discuss the potential limitations or drawbacks of the proposed approaches. It would be valuable to explore any trade-offs or edge cases that may arise, as well as potential areas for further research and improvement.

Conclusion

The researchers have developed a new generation of YOLO models, YOLOv10, that achieve state-of-the-art performance and efficiency in real-time object detection tasks. By addressing the limitations of NMS-based post-processing and optimizing the model architecture, the researchers have pushed the boundaries of what is possible with YOLO models.

These advancements in YOLO-based object detection have the potential to benefit a wide range of applications, from autonomous vehicles to surveillance systems, by enabling faster and more accurate object recognition in real-time. As the field of computer vision continues to evolve, the insights and techniques presented in this work may inspire further innovation and progress in the development of efficient and high-performing object detection models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🔎

DETRs Beat YOLOs on Real-time Object Detection

Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen

The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.

4/4/2024

cs.CV

🔎

Real-Time Flying Object Detection with YOLOv8

Dillon Reis, Jordan Kupec, Jacqueline Hong, Ahmad Daoudi

This paper presents a generalized model for real-time detection of flying objects that can be used for transfer learning and further research, as well as a refined model that achieves state-of-the-art results for flying object detection. We achieve this by training our first (generalized) model on a data set containing 40 different classes of flying objects, forcing the model to extract abstract feature representations. We then perform transfer learning with these learned parameters on a data set more representative of real world environments (i.e. higher frequency of occlusion, very small spatial sizes, rotations, etc.) to generate our refined model. Object detection of flying objects remains challenging due to large variances of object spatial sizes/aspect ratios, rate of speed, occlusion, and clustered backgrounds. To address some of the presented challenges while simultaneously maximizing performance, we utilize the current state-of-the-art single-shot detector, YOLOv8, in an attempt to find the best trade-off between inference speed and mean average precision (mAP). While YOLOv8 is being regarded as the new state-of-the-art, an official paper has not been released as of yet. Thus, we provide an in-depth explanation of the new architecture and functionality that YOLOv8 has adapted. Our final generalized model achieves a mAP50 of 79.2%, mAP50-95 of 68.5%, and an average inference speed of 50 frames per second (fps) on 1080p videos. Our final refined model maintains this inference speed and achieves an improved mAP50 of 99.1% and mAP50-95 of 83.5%

5/24/2024

cs.CV cs.LG

You Only Look at Once for Real-time and Generic Multi-Task

Jiayuan Wang, Q. M. Jonathan Wu, Ning Zhang

High precision, lightweight, and real-time responsiveness are three essential requirements for implementing autonomous driving. In this study, we incorporate A-YOLOM, an adaptive, real-time, and lightweight multi-task model designed to concurrently address object detection, drivable area segmentation, and lane line segmentation tasks. Specifically, we develop an end-to-end multi-task model with a unified and streamlined segmentation structure. We introduce a learnable parameter that adaptively concatenates features between necks and backbone in segmentation tasks, using the same loss function for all segmentation tasks. This eliminates the need for customizations and enhances the model's generalization capabilities. We also introduce a segmentation head composed only of a series of convolutional layers, which reduces the number of parameters and inference time. We achieve competitive results on the BDD100k dataset, particularly in visualization outcomes. The performance results show a mAP50 of 81.1% for object detection, a mIoU of 91.0% for drivable area segmentation, and an IoU of 28.8% for lane line segmentation. Additionally, we introduce real-world scenarios to evaluate our model's performance in a real scene, which significantly outperforms competitors. This demonstrates that our model not only exhibits competitive performance but is also more flexible and faster than existing multi-task models. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/YOLOv8-multi-task

4/26/2024

cs.CV

Precision and Adaptability of YOLOv5 and YOLOv8 in Dynamic Robotic Environments

Victor A. Kich, Muhammad A. Muttaqien, Junya Toyama, Ryutaro Miyoshi, Yosuke Ida, Akihisa Ohya, Hisashi Date

Recent advancements in real-time object detection frameworks have spurred extensive research into their application in robotic systems. This study provides a comparative analysis of YOLOv5 and YOLOv8 models, challenging the prevailing assumption of the latter's superiority in performance metrics. Contrary to initial expectations, YOLOv5 models demonstrated comparable, and in some cases superior, precision in object detection tasks. Our analysis delves into the underlying factors contributing to these findings, examining aspects such as model architecture complexity, training dataset variances, and real-world applicability. Through rigorous testing and an ablation study, we present a nuanced understanding of each model's capabilities, offering insights into the selection and optimization of object detection frameworks for robotic applications. Implications of this research extend to the design of more efficient and contextually adaptive systems, emphasizing the necessity for a holistic approach to evaluating model performance.

6/4/2024

cs.RO cs.CV