Bridging the Language Gap: An Empirical Study of Bindings for Open Source Machine Learning Libraries Across Software Package Ecosystems

Read original: arXiv:2201.07201 - Published 8/21/2024 by Hao Li, Cor-Paul Bezemer

💬

Overview

Open-source machine learning (ML) libraries allow developers to integrate advanced ML capabilities into their applications.
However, popular ML libraries like TensorFlow are not natively available in all programming languages and software package ecosystems.
Developers may need to use "binding libraries" (or bindings) to reuse an ML library in a different programming language or ecosystem.
Bindings provide cross-language and cross-ecosystem support for using a host library.

Plain English Explanation

Machine learning (ML) is a powerful tool that allows software to learn and improve from data, without being explicitly programmed. Open-source ML libraries, like TensorFlow and Keras, make it easier for developers to add advanced ML capabilities to their applications.

However, these popular ML libraries are often written in programming languages like Python, and may not be readily available in all the programming languages and software package ecosystems that developers use. For example, the Keras library was written in Python, but a Keras .NET binding provides support for using Keras in the .NET ecosystem.

These binding libraries act as a bridge, allowing developers to use an ML library in a different programming language or software package ecosystem than the one it was originally written in. By using a binding, developers can reuse the advanced ML functionality of a host library, even if it wasn't originally designed for their specific programming language or software package ecosystem.

Technical Explanation

This paper presents a comprehensive study of cross-ecosystem bindings for open-source ML libraries. The researchers used an approach called BindFind to automatically identify and link 2,436 bindings for 546 ML libraries across 13 different software package ecosystems.

The researchers then conducted an in-depth analysis of 133 cross-ecosystem bindings for 40 popular open-source ML libraries. Their key findings include:

The majority of ML library bindings are maintained by the community, with the npm ecosystem being the most popular.
Most bindings only cover a limited range of the host library's releases, often with considerable delays in supporting new releases.
Bindings frequently experience "technical lag," where they do not fully support the latest features and capabilities of the host library.

These findings highlight important considerations for developers who need to integrate ML library bindings into their applications, such as the need to carefully manage versioning and feature support. The researchers also identify opportunities for further research to better understand and improve the ecosystem of ML library bindings.

Critical Analysis

The paper provides a comprehensive and well-designed study of cross-ecosystem bindings for open-source ML libraries. The researchers' use of the BindFind approach to automatically identify and link a large number of bindings is a notable strength, as it allows for a more systematic and scalable analysis than manual approaches.

However, the paper does not delve into the specific reasons why many bindings only cover a limited range of host library releases or experience technical lag. Understanding the underlying causes of these issues could inform strategies to improve the development and maintenance of ML library bindings.

Additionally, the paper does not explore the potential impact of these binding-related challenges on the broader adoption and use of open-source ML libraries. Further research could investigate how these issues affect developers' experiences and the overall ecosystem of ML-enabled applications.

Conclusion

This paper offers valuable insights into the complex ecosystem of cross-ecosystem bindings for open-source ML libraries. The researchers' findings highlight the importance of careful versioning and feature support management when integrating ML library bindings into applications.

The study also points to opportunities for improving the development and maintenance of ML library bindings, which could ultimately enhance the accessibility and usability of advanced ML capabilities for a wider range of developers and software ecosystems. As the use of ML continues to grow, understanding and addressing these binding-related challenges will be crucial for fostering a more robust and inclusive ML-powered software ecosystem.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Bridging the Language Gap: An Empirical Study of Bindings for Open Source Machine Learning Libraries Across Software Package Ecosystems

Hao Li, Cor-Paul Bezemer

Open source machine learning (ML) libraries enable developers to integrate advanced ML functionality into their own applications. However, popular ML libraries, such as TensorFlow, are not available natively in all programming languages and software package ecosystems. Hence, developers who wish to use an ML library which is not available in their programming language or ecosystem of choice, may need to resort to using a so-called binding library (or binding). Bindings provide support across programming languages and package ecosystems for reusing a host library. For example, the Keras .NET binding provides support for the Keras library in the NuGet (.NET) ecosystem even though the Keras library was written in Python. In this paper, we collect 2,436 cross-ecosystem bindings for 546 ML libraries across 13 software package ecosystems by using an approach called BindFind, which can automatically identify bindings and link them to their host libraries. Furthermore, we conduct an in-depth study of 133 cross-ecosystem bindings and their development for 40 popular open source ML libraries. Our findings reveal that the majority of ML library bindings are maintained by the community, with npm being the most popular ecosystem for these bindings. Our study also indicates that most bindings cover only a limited range of the host library's releases, often experience considerable delays in supporting new releases, and have widespread technical lag. Our findings highlight key factors to consider for developers integrating bindings for ML libraries and open avenues for researchers to further investigate bindings in software package ecosystems.

8/21/2024

Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality

Hao Li, Gopi Krishnan Rajbahadur, Cor-Paul Bezemer

Bindings for machine learning frameworks (such as TensorFlow and PyTorch) allow developers to integrate a framework's functionality using a programming language different from the framework's default language (usually Python). In this paper, we study the impact of using TensorFlow and PyTorch bindings in C#, Rust, Python and JavaScript on the software quality in terms of correctness (training and test accuracy) and time cost (training and inference time) when training and performing inference on five widely used deep learning models. Our experiments show that a model can be trained in one binding and used for inference in another binding for the same framework without losing accuracy. Our study is the first to show that using a non-default binding can help improve machine learning software quality from the time cost perspective compared to the default Python binding while still achieving the same level of correctness.

7/9/2024

A Large-Scale Study of Model Integration in ML-Enabled Software Systems

Yorick Sens, Henriette Knopp, Sven Peldszus, Thorsten Berger

The rise of machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems. Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them, as well as best practices for integrating them, i.e., software architectures. In contrast, the development of ML artifacts, i.e. ML models, comes from data science and focuses on the ML models and their training data. However, to deliver value to end users, these ML models must be embedded in traditional software, often forming complex topologies. In fact, ML-enabled software can easily incorporate many different ML models. While the challenges and practices of building ML-enabled systems have been studied to some extent, beyond isolated examples, little is known about the characteristics of real-world ML-enabled systems. Properly embedding ML models in systems so that they can be easily maintained or reused is far from trivial. We need to improve our empirical understanding of such systems, which we address by presenting the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub. We classified and analyzed them to determine their characteristics, as well as their practices for reusing ML models and related code, and the architecture of these systems. Our findings provide practitioners and researchers with insight into practices for embedding and integrating ML models, bringing data science and software engineering closer together.

8/13/2024

Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes

Connor Toups, Rishi Bommasani, Kathleen A. Creel, Sarah H. Bana, Dan Jurafsky, Percy Liang

Machine learning is traditionally studied at the model level: researchers measure and improve the accuracy, robustness, bias, efficiency, and other dimensions of specific models. In practice, the societal impact of machine learning is determined by the surrounding context of machine learning deployments. To capture this, we introduce ecosystem-level analysis: rather than analyzing a single model, we consider the collection of models that are deployed in a given context. For example, ecosystem-level analysis in hiring recognizes that a job candidate's outcomes are not only determined by a single hiring algorithm or firm but instead by the collective decisions of all the firms they applied to. Across three modalities (text, images, speech) and 11 datasets, we establish a clear trend: deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. Even when individual models improve at the population level over time, we find these improvements rarely reduce the prevalence of systemic failure. Instead, the benefits of these improvements predominantly accrue to individuals who are already correctly classified by other models. In light of these trends, we consider medical imaging for dermatology where the costs of systemic failure are especially high. While traditional analyses reveal racial performance disparities for both models and humans, ecosystem-level analysis reveals new forms of racial disparity in model predictions that do not present in human predictions. These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.

4/4/2024