FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

Read original: arXiv:2408.06468 - Published 8/14/2024 by Zhongweiyang Xu, Ali Aroudi, Ke Tan, Ashutosh Pandey, Jung-Suk Lee, Buye Xu, Francesco Nesta
Total Score

0

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • FoVNet is a speech enhancement model for smart glasses that can be configured to optimize the field of view (FoV) for different use cases.
  • It aims to provide high-quality speech enhancement with low computational cost and minimal distortion.
  • The model is designed to be flexible and adaptable to various scenarios, enabling users to adjust the FoV to their needs.

Plain English Explanation

The paper introduces FoVNet, a speech enhancement system designed for smart glasses. The key idea is to make the field of view (FoV) configurable so that users can adjust it to their specific needs.

For example, in a loud environment, you might want a narrower FoV to focus on the person in front of you and filter out background noise. In a quieter setting, you might prefer a wider FoV to capture more of the surrounding conversation. FoVNet allows you to do this by adjusting the model's parameters.

The researchers also aimed to keep the computational cost low and minimize distortion in the enhanced audio, making it suitable for real-time use on smart glasses. This is important, as you wouldn't want your speech enhancement to drain the battery or introduce noticeable artifacts.

Technical Explanation

The FoVNet architecture consists of a convolutional neural network (CNN) with a configurable field of view. The network takes in multichannel audio from the microphones on the smart glasses and outputs enhanced speech.

The key innovation is the ability to adjust the FoV of the CNN. This is achieved by using a learnable spatial pooling module that can dynamically change the size of the receptive field of the CNN. By adjusting this parameter, the model can focus on a narrower or wider area of the audio input, depending on the user's needs.

The researchers evaluated FoVNet on several speech enhancement benchmarks and found that it outperformed state-of-the-art models in terms of speech quality and computational efficiency, while also offering the configurable FoV feature.

Critical Analysis

The paper provides a comprehensive evaluation of FoVNet, including comparisons to other speech enhancement models and ablation studies to understand the importance of the configurable FoV. However, the authors do not discuss potential limitations or areas for further research in depth.

One potential concern is the generalization of the model to diverse real-world scenarios. The evaluation was conducted primarily in simulated environments, and it would be valuable to see how FoVNet performs in more complex, real-world settings with varying noise conditions and speaker placements.

Additionally, the paper does not explore the user experience implications of the configurable FoV feature. It would be interesting to understand how users perceive and interact with this functionality in practice, and whether there are any trade-offs or unintended consequences that need to be addressed.

Conclusion

The FoVNet model presented in this paper is a promising approach to speech enhancement for smart glasses. By allowing users to configure the field of view, it provides a flexible and adaptable solution that can be tailored to different environments and user preferences. The researchers have demonstrated the model's effectiveness in terms of speech quality and computational efficiency, making it a potential candidate for real-world deployment in smart glasses and similar wearable devices.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses
Total Score

0

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

Zhongweiyang Xu, Ali Aroudi, Ke Tan, Ashutosh Pandey, Jung-Suk Lee, Buye Xu, Francesco Nesta

This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with high computational efficiency. The neural network component is designed with ultra-low computation (about 50 MMACS). A multi-channel Wiener filter and a post-processing module are further used to improve perceptual quality. We evaluate our algorithm with a microphone array on smart glasses, providing a configurable, efficient solution for augmented hearing on energy-constrained devices. FoVNet excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.

Read more

8/14/2024

Wide-Field, High-Resolution Reconstruction in Computational Multi-Aperture Miniscope Using a Fourier Neural Network
Total Score

0

Wide-Field, High-Resolution Reconstruction in Computational Multi-Aperture Miniscope Using a Fourier Neural Network

Qianwan Yang, Ruipeng Guo, Guorong Hu, Yujia Xue, Yunzhe Li, Lei Tian

Traditional fluorescence microscopy is constrained by inherent trade-offs among resolution, field-of-view, and system complexity. To navigate these challenges, we introduce a simple and low-cost computational multi-aperture miniature microscope, utilizing a microlens array for single-shot wide-field, high-resolution imaging. Addressing the challenges posed by extensive view multiplexing and non-local, shift-variant aberrations in this device, we present SV-FourierNet, a novel multi-channel Fourier neural network. SV-FourierNet facilitates high-resolution image reconstruction across the entire imaging field through its learned global receptive field. We establish a close relationship between the physical spatially-varying point-spread functions and the network's learned effective receptive field. This ensures that SV-FourierNet has effectively encapsulated the spatially-varying aberrations in our system, and learned a physically meaningful function for image reconstruction. Training of SV-FourierNet is conducted entirely on a physics-based simulator. We showcase wide-field, high-resolution video reconstructions on colonies of freely moving C. elegans and imaging of a mouse brain section. Our computational multi-aperture miniature microscope, augmented with SV-FourierNet, represents a major advancement in computational microscopy and may find broad applications in biomedical research and other fields requiring compact microscopy solutions.

Read more

5/31/2024

💬

Total Score

0

Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer

Hao Shi, Qi Jiang, Kailun Yang, Xiaoting Yin, Ze Wang, Kaiwei Wang

Vision sensors are widely applied in vehicles, robots, and roadside infrastructure. However, due to limitations in hardware cost and system size, camera Field-of-View (FoV) is often restricted and may not provide sufficient coverage. Nevertheless, from a spatiotemporal perspective, it is possible to obtain information beyond the camera's physical FoV from past video streams. In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view, thereby enhancing scene visibility, perception, and system safety. To achieve this, we introduce the FlowLens architecture, which explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation. FlowLens offers two key features: 1) FlowLens includes a newly designed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated over time. 2) It integrates a multi-branch Mix Fusion Feed Forward Network (MixF3N) to enhance the precise spatial flow of local features. To facilitate training and evaluation, we derive the KITTI360 dataset with various FoV mask, which covers both outer- and inner FoV expansion scenarios. We also conduct both quantitative assessments and qualitative comparisons of beyond-FoV semantics and beyond-FoV object detection across different models. We illustrate that employing FlowLens to reconstruct unseen scenes even enhances perception within the field of view by providing reliable semantic context. Extensive experiments and user studies involving offline and online video inpainting, as well as beyond-FoV perception tasks, demonstrate that FlowLens achieves state-of-the-art performance. The source code and dataset are made publicly available at https://github.com/MasterHow/FlowLens.

Read more

6/26/2024

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses
Total Score

0

New!M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli

The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.

Read more

9/19/2024