Why do you cite? An investigation on citation intents and decision-making classification processes

Read original: arXiv:2407.13329 - Published 7/19/2024 by Lorenzo Paolini (Department of Classical Philology,Italian Studies, University of Bologna, Bologna, Italy), Sahar Vahdati (Nature-inspired machine intelligence group, SCaDS.AI center, Technical University of Dresden, Germany Institute for Applied Computer Science, InfAI - Dresden, Germany) and 19 others

🏷️

Overview

Examines the importance of accurately classifying the intent behind citations in scholarly work
Presents an advanced ensemble model for Citation Intent Classification (CIC) that leverages Language Models (LMs) and Explainable AI (XAI) techniques
Demonstrates the critical role of section titles in improving classification performance
Introduces a web application for classifying citation intents
Achieves state-of-the-art performance on the SciCite benchmark

Plain English Explanation

When researchers cite other scholarly works in their own papers, they are doing so for specific reasons. Understanding these citation intents is essential for gaining deeper insights into the nature of scientific contributions and assessing their true impact. Most metrics used to analyze these conceptual links are based on simple quantitative observations, but there is a rich world of meaning behind each citation that needs to be effectively revealed.

This study emphasizes the importance of accurately classifying citation intents to provide more comprehensive and insightful analyses in research assessment. The researchers present an advanced ensemble model that combines Language Models (LMs) and Explainable AI (XAI) techniques to classify citation intents with high accuracy and interpretability.

One of the key findings is that the inclusion of section titles as a feature significantly enhances the classification performance. The researchers also introduce a web application that allows users to classify citation intents, and their model achieves a new state-of-the-art result on the SciCite benchmark.

Overall, this research provides valuable insights for developing more robust datasets and methodologies, ultimately fostering a deeper understanding of scholarly communication.

Technical Explanation

The study presents two ensemble classifiers for the Citation Intent Classification (CIC) task. The first ensemble utilizes fine-tuned SciBERT and XLNet Language Models (LMs) as baselines, while the second ensemble incorporates additional features, including section titles.

The researchers demonstrate that the inclusion of section titles as a feature significantly improves the classification performance. This finding suggests that the contextual information provided by section titles is crucial for accurately determining the intent behind a citation.

To enhance the interpretability and trustworthiness of the models' predictions, the researchers employ Explainable AI (XAI) techniques. These techniques provide insights into the decision-making processes, highlighting the contributions of individual words for level-0 classifications and the individual models for the metaclassification.

One of the ensemble models sets a new state-of-the-art (SOTA) with an 89.46% Macro-F1 score on the SciCite benchmark, a widely used dataset for citation intent classification. The web application developed as part of this study allows users to classify citation intents, further demonstrating the practical applications of this research.

The study's findings suggest that the inclusion of section titles significantly enhances classification performances in the CIC task. The integration of XAI techniques provides valuable insights into the models' decision-making, fostering a deeper understanding of the underlying factors that influence citation intents.

Critical Analysis

The study presents a robust and well-designed approach to citation intent classification, leveraging advanced ensemble strategies and incorporating state-of-the-art Language Models and Explainable AI techniques. The inclusion of section titles as a feature is a particularly insightful contribution, as it highlights the importance of contextual information in accurately determining citation intents.

However, the paper does not address the potential limitations of the study, such as the generalizability of the findings to other datasets or domains, or the potential biases or errors in the underlying citation datasets. Additionally, the authors do not discuss the potential ethical implications of their work, such as the use of citation data for research assessment or the potential misuse of citation intent classification.

Furthermore, while the web application is a valuable contribution, the paper does not provide details on its usage, user feedback, or plans for further development and deployment. Addressing these aspects could enhance the practical impact of the research.

Overall, the study makes a significant contribution to the field of citation analysis and research assessment, but there is room for further exploration and critical analysis of the limitations and potential implications of the work.

Conclusion

This study emphasizes the importance of accurately classifying citation intents in scholarly communication, which is essential for gaining deeper insights into the nature of scientific contributions and assessing their true impact. The researchers present an advanced ensemble model that leverages Language Models and Explainable AI techniques to achieve state-of-the-art performance on the SciCite benchmark.

One of the key findings is the critical role of section titles in improving classification performance, underscoring the importance of contextual information in accurately determining citation intents. The integration of Explainable AI techniques provides valuable insights into the models' decision-making processes, fostering a deeper understanding of the factors that influence citation practices.

The introduction of a web application for citation intent classification further demonstrates the practical applications of this research. Overall, this study offers important insights for developing more robust datasets and methodologies, ultimately contributing to a deeper understanding of scholarly communication and its impact.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Why do you cite? An investigation on citation intents and decision-making classification processes

Lorenzo Paolini (Department of Classical Philology,Italian Studies, University of Bologna, Bologna, Italy), Sahar Vahdati (Nature-inspired machine intelligence group, SCaDS.AI center, Technical University of Dresden, Germany Institute for Applied Computer Science, InfAI - Dresden, Germany), Angelo Di Iorio (Department of Computer Science,Engineering, University of Bologna, Bologna, Italy), Robert Wardenga (Institute for Applied Computer Science, InfAI - Dresden, Germany), Ivan Heibi (Research Centre for Open Scholarly Metadata, Department of Classical Philology,Italian Studies, University of Bologna, Bologna, Italy, Digital Humanities Advanced Research Centre), Silvio Peroni (Research Centre for Open Scholarly Metadata, Department of Classical Philology,Italian Studies, University of Bologna, Bologna, Italy, Digital Humanities Advanced Research Centre)

Identifying the reason for which an author cites another work is essential to understand the nature of scientific contributions and to assess their impact. Citations are one of the pillars of scholarly communication and most metrics employed to analyze these conceptual links are based on quantitative observations. Behind the act of referencing another scholarly work there is a whole world of meanings that needs to be proficiently and effectively revealed. This study emphasizes the importance of trustfully classifying citation intents to provide more comprehensive and insightful analyses in research assessment. We address this task by presenting a study utilizing advanced Ensemble Strategies for Citation Intent Classification (CIC) incorporating Language Models (LMs) and employing Explainable AI (XAI) techniques to enhance the interpretability and trustworthiness of models' predictions. Our approach involves two ensemble classifiers that utilize fine-tuned SciBERT and XLNet LMs as baselines. We further demonstrate the critical role of section titles as a feature in improving models' performances. The study also introduces a web application developed with Flask and currently available at http://137.204.64.4:81/cic/classifier, aimed at classifying citation intents. One of our models sets as a new state-of-the-art (SOTA) with an 89.46% Macro-F1 score on the SciCite benchmark. The integration of XAI techniques provides insights into the decision-making processes, highlighting the contributions of individual words for level-0 classifications, and of individual models for the metaclassification. The findings suggest that the inclusion of section titles significantly enhances classification performances in the CIC task. Our contributions provide useful insights for developing more robust datasets and methodologies, thus fostering a deeper understanding of scholarly communication.

7/19/2024

Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models

Tong Zeng, Daniel E. Acuna

Scientist learn early on how to cite scientific sources to support their claims. Sometimes, however, scientists have challenges determining where a citation should be situated -- or, even worse, fail to cite a source altogether. Automatically detecting sentences that need a citation (i.e., citation worthiness) could solve both of these issues, leading to more robust and well-constructed scientific arguments. Previous researchers have applied machine learning to this task but have used small datasets and models that do not take advantage of recent algorithmic developments such as attention mechanisms in deep learning. We hypothesize that we can develop significantly accurate deep learning architectures that learn from large supervised datasets constructed from open access publications. In this work, we propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations. We also produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets. Our experiments show that our architecture achieves state of the art performance on the standard ACL-ARC dataset ($F_{1}=0.507$) and exhibits high performance ($F_{1}=0.856$) on the new PMOA-CITE. Moreover, we show that it can transfer learning across these datasets. We further use interpretable models to illuminate how specific language is used to promote and inhibit citations. We discover that sections and surrounding sentences are crucial for our improved predictions. We further examined purported mispredictions of the model, and uncovered systematic human mistakes in citation behavior and source data. This opens the door for our model to check documents during pre-submission and pre-archival procedures. We make this new dataset, the code, and a web-based tool available to the community.

5/21/2024

🔍

Hidden Citations Obscure True Impact in Science

Xiangyi Meng, Onur Varol, Albert-L'aszl'o Barab'asi

References, the mechanism scientists rely on to signal previous knowledge, lately have turned into widely used and misused measures of scientific impact. Yet, when a discovery becomes common knowledge, citations suffer from obliteration by incorporation. This leads to the concept of hidden citation, representing a clear textual credit to a discovery without a reference to the publication embodying it. Here, we rely on unsupervised interpretable machine learning applied to the full text of each paper to systematically identify hidden citations. We find that for influential discoveries hidden citations outnumber citation counts, emerging regardless of publishing venue and discipline. We show that the prevalence of hidden citations is not driven by citation counts, but rather by the degree of the discourse on the topic within the text of the manuscripts, indicating that the more discussed is a discovery, the less visible it is to standard bibliometric analysis. Hidden citations indicate that bibliometric measures offer a limited perspective on quantifying the true impact of a discovery, raising the need to extract knowledge from the full text of the scientific corpus.

5/14/2024

Past, Present, and Future of Citation Practices in HCI

Jonas Oppenlaender

Science is a complex system comprised of many scientists who individually make collective decisions that, due to the size and nature of the academic system, largely do not affect the system as a whole. However, certain decisions at the meso-level of research communities, such as the Human-Computer Interaction (HCI) community, may result in deep and long-lasting behavioral changes in scientists. In this article, we provide evidence on how a change in editorial policies introduced at the ACM CHI Conference in 2016 launched the CHI community on an expansive path, denoted by a year-by-year increase in the mean number of references included in CHI articles. If this near-linear trend continues undisrupted, an article in CHI 2030 will include on average almost 130 references. The trend towards more citations reflects a citation culture where quantity is prioritized over quality, contributing to both author and peer reviewer fatigue. This article underscores the profound impact that meso-level policy adjustments have on the evolution of scientific fields and disciplines, urging stakeholders to carefully consider the broader implications of such changes.

9/11/2024