Methodology to Deploy CNN-Based Computer Vision Models on Immersive Wearable Devices

Read original: arXiv:2407.00233 - Published 7/2/2024 by Kaveh Malek (Department of Mechanical Engineering, University of New Mexico, New Mexico), Fernando Moreu (Department of Civil, Construction and Environmental Engineering, University of New Mexico, New Mexico)

👀

Overview

Convolutional Neural Networks (CNNs) often lack the ability to incorporate human input, which can be addressed by Augmented Reality (AR) headsets.
Current AR headsets are limited in processing power, preventing real-time, complex image recognition tasks using CNNs.
This paper presents a method to deploy CNN models on AR headsets by training them on computers and transferring the optimized weight matrices to the headset.
The approach transforms the image data and CNN layers into a one-dimensional format suitable for the AR platform.
The method is demonstrated by training the LeNet-5 CNN model on the MNIST dataset using PyTorch and deploying it on a HoloLens AR headset.

Plain English Explanation

Convolutional Neural Networks (CNNs) are a type of artificial intelligence model that are very good at recognizing images. However, they often lack the ability to incorporate input from humans, which could be useful in many applications. Augmented Reality (AR) headsets could provide a way to address this by allowing users to interact with the AI models in real-time.

Unfortunately, current AR headsets are limited in their processing power, which has made it difficult to use complex CNN models on them. This paper introduces a new method to overcome this issue. The researchers train the CNN model on a powerful computer, then transfer the optimized weight matrices (the learned parameters of the model) to the AR headset.

To make this work, the researchers transform the image data and the CNN layers into a format that is more suitable for the AR platform. They demonstrate this by training a well-known CNN model called LeNet-5 on the MNIST dataset of handwritten digits, and then deploying it on a HoloLens AR headset.

The results show that the model maintains an accuracy of around 98% on the AR headset, which is similar to its performance on a computer. This integration of CNN and AR technology enables real-time image processing on AR headsets, allowing for the incorporation of human input into the AI models.

Technical Explanation

The paper presents a method to deploy Convolutional Neural Network (CNN) models on Augmented Reality (AR) headsets, which are often limited in processing power and unable to run complex CNN models in real-time.

The researchers train the CNN model, in this case, the LeNet-5 architecture, on a computer using the PyTorch library and the MNIST dataset of handwritten digits. They then transfer the optimized weight matrices (the learned parameters) of the CNN model to the AR headset, in this case, a HoloLens.

To make the CNN model compatible with the AR platform, the researchers transform the image data and the CNN layers into a one-dimensional format. This involves flattening the 2D convolutional layers into 1D layers, and reshaping the image data to match the input requirements of the transformed model.

The transformed model is then deployed on the HoloLens AR headset, where it is able to perform real-time image recognition tasks with an accuracy of approximately 98%, which is similar to its performance on the computer.

This integration of CNN and AR technology enables the incorporation of human input into the AI models, as users can interact with the system using the AR headset. This could have applications in real-time simulated avatars, human neural representation learning, and holographic overlays.

Critical Analysis

The paper presents a novel and promising approach to deploying CNN models on AR headsets, which addresses the limitations of current AR platforms in processing power and real-time performance.

One potential limitation of the approach is the need to transform the CNN architecture and data to a one-dimensional format, which may result in some loss of spatial information and could affect the model's performance on more complex image recognition tasks.

Additionally, the paper only demonstrates the method using the LeNet-5 model and the MNIST dataset, which is a relatively simple task. It would be interesting to see how the approach scales to more complex CNN architectures and larger, more diverse datasets.

Further research could also explore ways to optimize the transformed CNN model for better performance on the AR platform, or investigate methods to perform the training and inference directly on the AR headset without the need for a separate computer.

Overall, the paper provides a valuable contribution to the field of integrating AI and AR technologies, and the proposed method could have significant implications for the development of interactive, real-time AI applications in augmented reality.

Conclusion

This paper presents a novel approach to deploying Convolutional Neural Network (CNN) models on Augmented Reality (AR) headsets, which are often limited in processing power and unable to run complex AI models in real-time.

The researchers demonstrate a method to train the CNN model on a computer, transfer the optimized weight matrices to the AR headset, and transform the model architecture and data to a one-dimensional format that is compatible with the AR platform.

By using this approach, the researchers were able to deploy the LeNet-5 CNN model on a HoloLens AR headset, maintaining an accuracy of approximately 98% on the MNIST handwritten digit recognition task, similar to its performance on a computer.

This integration of CNN and AR technology enables real-time image processing on AR headsets, allowing for the incorporation of human input into the AI models. This could have significant implications for the development of interactive, real-time AI applications in augmented reality, with potential use cases in areas such as embedded speech-driven fly referencing, simulated avatars, and holographic overlays.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Methodology to Deploy CNN-Based Computer Vision Models on Immersive Wearable Devices

Kaveh Malek (Department of Mechanical Engineering, University of New Mexico, New Mexico), Fernando Moreu (Department of Civil, Construction and Environmental Engineering, University of New Mexico, New Mexico)

Convolutional Neural Network (CNN) models often lack the ability to incorporate human input, which can be addressed by Augmented Reality (AR) headsets. However, current AR headsets face limitations in processing power, which has prevented researchers from performing real-time, complex image recognition tasks using CNNs in AR headsets. This paper presents a method to deploy CNN models on AR headsets by training them on computers and transferring the optimized weight matrices to the headset. The approach transforms the image data and CNN layers into a one-dimensional format suitable for the AR platform. We demonstrate this method by training the LeNet-5 CNN model on the MNIST dataset using PyTorch and deploying it on a HoloLens AR headset. The results show that the model maintains an accuracy of approximately 98%, similar to its performance on a computer. This integration of CNN and AR enables real-time image processing on AR headsets, allowing for the incorporation of human input into AI models.

7/2/2024

Smartphone-based Eye Tracking System using Edge Intelligence and Model Optimisation

Nishan Gunawardena, Gough Yumu Lui, Jeewani Anupama Ginige, Bahman Javadi

A significant limitation of current smartphone-based eye-tracking algorithms is their low accuracy when applied to video-type visual stimuli, as they are typically trained on static images. Also, the increasing demand for real-time interactive applications like games, VR, and AR on smartphones requires overcoming the limitations posed by resource constraints such as limited computational power, battery life, and network bandwidth. Therefore, we developed two new smartphone eye-tracking techniques for video-type visuals by combining Convolutional Neural Networks (CNN) with two different Recurrent Neural Networks (RNN), namely Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). Our CNN+LSTM and CNN+GRU models achieved an average Root Mean Square Error of 0.955cm and 1.091cm, respectively. To address the computational constraints of smartphones, we developed an edge intelligence architecture to enhance the performance of smartphone-based eye tracking. We applied various optimisation methods like quantisation and pruning to deep learning models for better energy, CPU, and memory usage on edge devices, focusing on real-time processing. Using model quantisation, the model inference time in the CNN+LSTM and CNN+GRU models was reduced by 21.72% and 19.50%, respectively, on edge devices.

8/23/2024

An evaluation of CNN models and data augmentation techniques in hierarchical localization of mobile robots

J. J. Cabrera, O. J. C'espedes, S. Cebollada, O. Reinoso, L. Pay'a

This work presents an evaluation of CNN models and data augmentation to carry out the hierarchical localization of a mobile robot by using omnidireccional images. In this sense, an ablation study of different state-of-the-art CNN models used as backbone is presented and a variety of data augmentation visual effects are proposed for addressing the visual localization of the robot. The proposed method is based on the adaption and re-training of a CNN with a dual purpose: (1) to perform a rough localization step in which the model is used to predict the room from which an image was captured, and (2) to address the fine localization step, which consists in retrieving the most similar image of the visual map among those contained in the previously predicted room by means of a pairwise comparison between descriptors obtained from an intermediate layer of the CNN. In this sense, we evaluate the impact of different state-of-the-art CNN models such as ConvNeXt for addressing the proposed localization. Finally, a variety of data augmentation visual effects are separately employed for training the model and their impact is assessed. The performance of the resulting CNNs is evaluated under real operation conditions, including changes in the lighting conditions. Our code is publicly available on the project website https://github.com/juanjo-cabrera/IndoorLocalizationSingleCNN.git

7/16/2024

Fast Registration of Photorealistic Avatars for VR Facial Animation

Chaitanya Patel, Shaojie Bai, Te-Li Wang, Jason Saragih, Shih-En Wei

Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.

7/22/2024