Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Read original: arXiv:2309.08978 - Published 7/9/2024 by Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu and 2 others

🤿

Overview

The paper presents a new system called nnJIT that enables just-in-time (JIT) auto-generation of optimized computing kernels for running deep learning (DL) models on edge devices in web browsers.
The key innovations are techniques that significantly reduce the overhead of compiling and tuning these computing kernels, allowing nnJIT to achieve up to 8.2x faster performance compared to existing baselines.
The authors evaluate nnJIT on a range of modern DL models and edge devices, including laptops, smartphones, and different hardware architectures from ARM, Intel, AMD, and Nvidia.

Plain English Explanation

Running AI models directly in web browsers, a concept known as in-browser deep learning inference, is becoming more common as the web becomes a primary platform for delivering AI services to edge devices. However, the varying capabilities of edge devices, combined with the underdeveloped state of web-based hardware acceleration, currently limits the performance of these in-browser AI applications.

To address this issue, the researchers developed a new system called nnJIT that can automatically generate optimized computing kernels for running DL models on different edge devices directly within web browsers. This is achieved through two key innovations:

Tensor-Web Compiling Co-Design: This technique reduces the cost of compiling these computing kernels by around 100x by eliminating redundant and inefficient compilation steps.
Web-Specific Lite Kernel Optimization Space: This reduces the time and effort required to tune the computing kernels by focusing on the specific requirements of web programming and efficient use of device resources, pruning the optimization space from millions of options down to just dozens.

By employing these techniques, nnJIT is able to achieve up to 8.2x faster performance on modern DL models like BART, T5, and Llama 2, compared to existing approaches. This allows for more powerful AI capabilities to be delivered directly within web browsers and edge devices.

Technical Explanation

The paper presents nnJIT, a novel just-in-time (JIT) system for automatically generating optimized computing kernels to run deep learning (DL) models on a variety of edge devices directly within web browsers. This is an important advance, as edge intelligence and optimization of large language model inference on heterogeneous mobile processors is a growing area of research and application.

The key innovations in nnJIT are two novel techniques that significantly reduce the overhead of compiling and tuning these computing kernels:

Tensor-Web Compiling Co-Design: This approach lowers the compiling costs by around 100x compared to existing baselines. It does this by eliminating redundant and ineffective compilation passes that are unnecessary for the web environment.
Web-Specific Lite Kernel Optimization Space: This technique reduces the kernel tuning costs by focusing the optimization process on the specific requirements of web programming and efficient device resource utilization. This prunes the optimization search space from millions of options down to just dozens.

The authors evaluate nnJIT on a range of modern DL models, including BART, T5, and Llama 2, running on various edge devices such as laptops, smartphones, and hardware from ARM, Intel, AMD, and Nvidia. The results show that nnJIT can achieve up to 8.2x faster performance within 30 seconds compared to existing approaches for in-browser DL inference.

Critical Analysis

The paper presents a compelling solution to the challenges of running high-performance DL models directly within web browsers on heterogeneous edge devices. The two key innovations of Tensor-Web Compiling Co-Design and Web-Specific Lite Kernel Optimization Space are well-designed and effectively address the main bottlenecks.

However, the paper does not fully address potential issues around the generalizability of nnJIT's techniques. While the evaluation covers a range of models and devices, it would be helpful to see how the system performs on an even broader set of DL architectures and hardware configurations, including future consumer edge AI computing scenarios.

Additionally, the paper does not discuss potential security and privacy implications of running DL models in web browsers, which should be an important consideration for real-world deployment. Further research may be needed to ensure the safety and robustness of nnJIT in diverse deployment contexts.

Overall, the nnJIT system represents a significant advancement in enabling powerful AI capabilities to be delivered directly within web-based applications on edge devices. The innovations presented in this paper lay a strong foundation for future work in this area.

Conclusion

This paper introduces nnJIT, a pioneering system that enables just-in-time generation of optimized computing kernels for running deep learning models directly within web browsers on a variety of edge devices. Through two novel techniques - Tensor-Web Compiling Co-Design and Web-Specific Lite Kernel Optimization Space - nnJIT is able to significantly reduce the overhead of compiling and tuning these kernels, leading to up to 8.2x faster performance compared to existing approaches.

The successful evaluation of nnJIT across modern DL models and diverse edge hardware demonstrates the potential for more powerful AI capabilities to be delivered directly within web-based applications, bringing the benefits of in-browser deep learning inference to a wider range of users and devices. As the web continues to evolve as a primary platform for AI services, innovations like nnJIT will play a crucial role in unlocking the full potential of edge intelligence and adaptive device-edge collaboration for DNN inference in AIoT applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang

Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.

7/9/2024

Anatomizing Deep Learning Inference in Web Browsers

Qipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe Liu

Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology

7/26/2024

🤿

Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls

Sicong Liu, Wentao Zhou, Zimu Zhou, Bin Guo, Minfan Wang, Cheng Fang, Zheng Lin, Zhiwen Yu

There is a growing demand to deploy computation-intensive deep learning (DL) models on resource-constrained mobile devices for real-time intelligent applications. Equipped with a variety of processing units such as CPUs, GPUs, and NPUs, the mobile devices hold potential to accelerate DL inference via parallel execution across heterogeneous processors. Various efficient parallel methods have been explored to optimize computation distribution, achieve load balance, and minimize communication cost across processors. Yet their practical effectiveness in the dynamic and diverse real-world mobile environment is less explored. This paper presents a holistic empirical study to assess the capabilities and challenges associated with parallel DL inference on heterogeneous mobile processors. Through carefully designed experiments covering various DL models, mobile software/hardware environments, workload patterns, and resource availability, we identify limitations of existing techniques and highlight opportunities for cross-level optimization.

5/6/2024

Accelerate Intermittent Deep Inference

Ziliang Zhang

Emerging research in edge devices and micro-controller units (MCU) enables on-device computation of Deep Learning Training and Inferencing tasks. More recently, contemporary trends focus on making the Deep Neural Net (DNN) Models runnable on battery-less intermittent devices. One of the approaches is to shrink the DNN models by enabling weight sharing, pruning, and conducted Neural Architecture Search (NAS) with optimized search space to target specific edge devices cite{Cai2019OnceFA} cite{Lin2020MCUNetTD} cite{Lin2021MCUNetV2MP} cite{Lin2022OnDeviceTU}. Another approach analyzes the intermittent execution and designs the corresponding system by performing NAS that is aware of intermittent execution cycles and resource constraints cite{iNAS} cite{HW-NAS} cite{iLearn}. However, the optimized NAS was only considering consecutive execution with no power loss, and intermittent execution designs only focused on balancing data reuse and costs related to intermittent inference and often with low accuracy. We proposed Accelerated Intermittent Deep Inference to harness the power of optimized inferencing DNN models specifically targeting SRAM under 256KB and make it schedulable and runnable within intermittent power. Our main contribution is: (1) Schedule tasks performed by on-device inferencing into intermittent execution cycles and optimize for latency; (2) Develop a system that can satisfy the end-to-end latency while achieving a much higher accuracy compared to baseline cite{iNAS} cite{HW-NAS}

7/23/2024