The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research

Read original: arXiv:2305.02797 - Published 7/17/2024 by Mohamed Abdalla, Jan Philip Wahle, Terry Ruas, Aur'elie N'ev'eol, Fanny Ducel, Saif M. Mohammad, Karen Fort

🌿

Overview

This study examines the growing influence of industry on natural language processing (NLP) research over the past few decades.
The researchers used a large corpus of NLP publications and author resumes to quantify and characterize the industry's presence in the NLP community.
They found that industry presence among NLP authors has increased significantly in recent years, with a few major companies accounting for a large share of publications and funding academic researchers.
The study calls for increased transparency around the industry's influence in the field of NLP research.

Plain English Explanation

The field of natural language processing (NLP) has seen rapid advancements in recent years, thanks in large part to the development of powerful deep learning methods. As a result, NLP has become increasingly important for businesses and industries, creating new commercial opportunities.

To better understand the industry's involvement in NLP research, this study looked at a large collection of NLP publications and the resumes of their authors. The researchers found that the presence of industry in the NLP research community has been steadily increasing, with a particularly steep rise over the past five years. A handful of major companies now account for a significant portion of NLP publications and also provide funding to academic researchers through grants and internships.

This shift towards greater industry involvement raises important questions about the transparency and potential biases in NLP research. As the industry's influence grows, there are concerns that the research agenda may be shaped more by commercial interests than by the goal of advancing scientific understanding. The study calls for greater openness about the industry's role in this field.

Technical Explanation

The researchers analyzed a corpus of 78,187 NLP publications along with 701 resumes of NLP authors to quantify and characterize the industry's presence in the NLP research community over time. They found that industry presence, as measured by the percentage of NLP authors affiliated with industry, has been relatively steady before experiencing a steep 180% increase from 2017 to 2022.

Further analysis revealed that a small number of companies, primarily large tech firms, account for the majority of industry-authored NLP publications. These companies also provide significant funding to academic researchers through grants and internship programs, potentially influencing the research agenda.

The study design allowed the researchers to track the growth of industry involvement in NLP over an extended period, from the early 1990s to the present day. By combining publication metadata with author resume data, they were able to paint a comprehensive picture of the industry's footprint in this field, including the specific companies and researchers driving this trend.

Critical Analysis

While the study provides valuable insights into the increasing industry presence in NLP research, it does not delve into the potential implications or risks of this trend. The researchers acknowledge the need for greater transparency around industry influence, but do not explore the specific ways in which commercial interests may shape the research priorities or methodologies used in NLP.

Additionally, the study is limited to analyzing publication and resume data, which may not capture the full extent of industry involvement, such as funding for research projects, collaborations, or other forms of engagement that are not reflected in the authorship of academic papers. Further research is needed to examine the broader impacts of industry influence on the NLP job market and the research topics and datasets being prioritized in the field.

It is also important to consider the role that industry-academia collaboration can play in driving innovation and practical applications of NLP technology. The study does not distinguish between mutually beneficial partnerships and cases of potential "corporate capture" of the research agenda.

Conclusion

This study provides compelling evidence of the growing industry presence in the field of natural language processing research. As major tech companies increasingly invest in and influence NLP research, there is a need for greater transparency and critical examination of the potential implications for the direction and integrity of the field.

While industry involvement can contribute to the development of practical NLP applications, the study highlights the risk of commercial interests shaping the research priorities and potentially introducing biases. Ongoing monitoring and open dialogue between industry, academia, and the public are necessary to ensure that NLP research remains driven by the pursuit of scientific knowledge and societal benefit, rather than solely by profit motives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research

Mohamed Abdalla, Jan Philip Wahle, Terry Ruas, Aur'elie N'ev'eol, Fanny Ducel, Saif M. Mohammad, Karen Fort

Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field.

7/17/2024

Collaboration or Corporate Capture? Quantifying NLP's Reliance on Industry Artifacts and Contributions

Will Aitken, Mohamed Abdalla, Karen Rudie, Catherine Stinson

Impressive performance of pre-trained models has garnered public attention and made news headlines in recent years. Almost always, these models are produced by or in collaboration with industry. Using them is critical for competing on natural language processing (NLP) benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine the degree to which researchers rely on industry models, other artifacts, and contributions to publish in prestigious NLP venues and found that the ratio of their citation is at least three times greater than what would be expected. Our work serves as a scaffold to enable future researchers to more accurately address whether: 1) Collaboration with industry is still collaboration in the absence of an alternative or 2) if NLP inquiry has been captured by the motivations and research direction of private corporations.

6/26/2024

🌿

We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields

Jan Philip Wahle, Terry Ruas, Mohamed Abdalla, Bela Gipp, Saif M. Mohammad

Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP papers to other papers, and ~1.8m citations from other papers to NLP papers. We show that, unlike most fields, the cross-field engagement of NLP, measured by our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in 1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown more insular -- citing increasingly more NLP papers and having fewer papers that act as bridges between fields. NLP citations are dominated by computer science; Less than 8% of NLP citations are to linguistics, and less than 3% are to math and psychology. These findings underscore NLP's urgent need to reflect on its engagement with various fields.

7/17/2024

Computational Job Market Analysis with Natural Language Processing

Mike Zhang

[Abridged Abstract] Recent technological advances underscore labor market dynamics, yielding significant consequences for employment prospects and increasing job vacancy data across platforms and languages. Aggregating such data holds potential for valuable insights into labor market demands, new skills emergence, and facilitating job matching for various stakeholders. However, despite prevalent insights in the private sector, transparent language technology systems and data for this domain are lacking. This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions, identifying challenges including scarcity of training data, lack of standardized annotation guidelines, and shortage of effective extraction methods from job ads. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training. We propose skill extraction using weak supervision, a taxonomy-aware pre-training methodology adapting multilingual language models to the job market domain, and a retrieval-augmented model leveraging multiple skill extraction datasets to enhance overall performance. Finally, we ground extracted information within a designated taxonomy.

5/1/2024