Seed-based information retrieval in networks of research publications: Evaluation of direct citations, bibliographic coupling, co-citations and PubMed related article score

Read original: arXiv:2403.09295 - Published 6/14/2024 by Peter Sjog{aa}rde, Per Ahlgren
Total Score

0

👁️

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper explores the use of citation-based approaches for seed-based information retrieval in networks of research publications.
  • The authors compare the performance of three citation-based approaches (direct citation, co-citation, and bibliographic coupling) against the PubMed Related Article score and combinations of these methods.
  • The study uses systematic reviews as a baseline and publication data from the NIH Open Citation Collection to evaluate the approaches.
  • The results suggest that combining citation-based and textual approaches can enhance the performance of seed-based information retrieval, with co-citation outperforming bibliographic coupling and direct citation.

Plain English Explanation

When searching for relevant research papers, it's often useful to start with a "seed" or example paper that is known to be relevant. Citation-based information retrieval aims to find other papers that are closely connected to this seed paper through citations and references.

In this study, the researchers compared different citation-based approaches to see which ones work best for this task. They looked at three main methods:

  1. Direct citation: Finding papers that directly cite the seed paper.
  2. Co-citation: Finding papers that are cited by the same other papers as the seed paper.
  3. Bibliographic coupling: Finding papers that share a lot of the same references as the seed paper.

The researchers also included the PubMed Related Article score, which is a textual approach that looks at the similarity of paper abstracts. They tested these methods using data from the NIH Open Citation Collection and systematic reviews as a benchmark.

The results showed that combining citation-based and textual approaches, such as using citation information to improve health question answering, outperformed using just citation-based methods alone. Among the citation-based approaches, co-citation performed the best, but combining all three citation-based methods worked even better.

These findings can help guide future research on using citation data and text to improve information retrieval and entity linking in research publication networks.

Technical Explanation

The paper compares the performance of three citation-based approaches for seed-based information retrieval: direct citation, co-citation, and bibliographic coupling. These approaches leverage the connections between research papers based on citations and references to identify related publications.

The authors use systematic reviews as a baseline and publication data from the NIH Open Citation Collection to evaluate the recall and precision of the three citation-based methods, as well as the PubMed Related Article score (a textual approach) and combinations of these approaches.

The results show that co-citation outperforms bibliographic coupling and direct citation, but the best performance is achieved by combining the three citation-based approaches. The findings also indicate that incorporating both citation-based and textual approaches, as in exploring the relationship between retrievability and query generation strategies, can further enhance the performance of seed-based information retrieval.

The authors suggest that future research should use more structured approaches to evaluate methods for seed-based retrieval of publications, including comparative studies and the development of common data sets and baselines for evaluation.

Critical Analysis

The paper provides a comprehensive and systematic comparison of citation-based approaches for seed-based information retrieval, which is a valuable contribution to the field. The use of systematic reviews as a baseline and the NIH Open Citation Collection as the data source lend credibility to the findings.

However, the paper does not provide much detail on the specific implementation of the citation-based methods or the textual approach (PubMed Related Article score). Additionally, the authors do not discuss the potential limitations or biases in the data source, which could impact the generalizability of the results.

Further research could explore the performance of these approaches on different types of publication data, such as analyzing retrieval systems in real-world settings, and investigate the impact of factors like publication date, discipline, or citation network structure on the effectiveness of the methods.

Conclusion

This study offers valuable insights into the use of citation-based approaches for seed-based information retrieval in research publication networks. The findings suggest that combining citation-based and textual methods can enhance the performance of this task, with co-citation emerging as a particularly effective citation-based approach.

The results of this work can guide future research on leveraging citation data and textual information to improve information retrieval and entity linking in the context of scholarly communication. The authors' call for more structured evaluation approaches and the development of common benchmarks is a valuable recommendation for advancing the field.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Total Score

0

Seed-based information retrieval in networks of research publications: Evaluation of direct citations, bibliographic coupling, co-citations and PubMed related article score

Peter Sjog{aa}rde, Per Ahlgren

In this contribution, we deal with seed-based information retrieval in networks of research publications. Using systematic reviews as a baseline, and publication data from the NIH Open Citation Collection, we compare the performance of the three citation-based approaches direct citation, co-citation, and bibliographic coupling with respect to recall and precision measures. In addition, we include the PubMed Related Article score as well as combined approaches in the comparison. We also provide a fairly comprehensive review of earlier research in which citation relations have been used for information retrieval purposes. The results show an advantage for co-citation over bibliographic coupling and direct citation. However, combining the three approaches outperforms the exclusive use of co-citation in the study. The results further indicate, in line with previous research, that combining citation-based approaches with textual approaches enhances the performance of seed-based information retrieval. The results from the study may guide approaches combining citation-based and textual approaches in their choice of citation similarity measures. We suggest that future research use more structured approaches to evaluate methods for seed-based retrieval of publications, including comparative approaches as well as the elaboration of common data sets and baselines for evaluation.

Read more

6/14/2024

🤷

Total Score

0

Measuring publication relatedness using controlled vocabularies

Emil Dolmer Alnor

Measuring the relatedness between scientific publications has important applications in many areas of bibliometrics and science policy. Controlled vocabularies provide a promising basis for measuring relatedness because they address issues that arise when using citation or textual similarity to measure relatedness. While several controlled-vocabulary-based relatedness measures have been developed, there exists no comprehensive and direct test of their accuracy and suitability for different types of research questions. This paper reviews existing measures, develops a new measure, and benchmarks the measures using TREC Genomics data as a ground truth of topics. The benchmark test show that the new measure and the measure proposed by Ahlgren et al. (2020) have differing strengths and weaknesses. These results inform a discussion of which method to choose when studying interdisciplinarity, information retrieval, clustering of science, and researcher topic switching.

Read more

8/28/2024

Judgement Citation Retrieval using Contextual Similarity
Total Score

0

Judgement Citation Retrieval using Contextual Similarity

Akshat Mohan Dasula, Hrushitha Tigulla, Preethika Bhukya

Traditionally in the domain of legal research, the retrieval of pertinent citations from intricate case descriptions has demanded manual effort and keyword-based search applications that mandate expertise in understanding legal jargon. Legal case descriptions hold pivotal information for legal professionals and researchers, necessitating more efficient and automated approaches. We propose a methodology that combines natural language processing (NLP) and machine learning techniques to enhance the organization and utilization of legal case descriptions. This approach revolves around the creation of textual embeddings with the help of state-of-art embedding models. Our methodology addresses two primary objectives: unsupervised clustering and supervised citation retrieval, both designed to automate the citation extraction process. Although the proposed methodology can be used for any dataset, we employed the Supreme Court of The United States (SCOTUS) dataset, yielding remarkable results. Our methodology achieved an impressive accuracy rate of 90.9%. By automating labor-intensive processes, we pave the way for a more efficient, time-saving, and accessible landscape in legal research, benefiting legal professionals, academics, and researchers.

Read more

8/16/2024

Exploring the Nexus Between Retrievability and Query Generation Strategies
Total Score

0

Exploring the Nexus Between Retrievability and Query Generation Strategies

Aman Sinha, Priyanshu Raj Mall, Dwaipayan Roy

Quantifying bias in retrieval functions through document retrievability scores is vital for assessing recall-oriented retrieval systems. However, many studies investigating retrieval model bias lack validation of their query generation methods as accurate representations of retrievability for real users and their queries. This limitation results from the absence of established criteria for query generation in retrievability assessments. Typically, researchers resort to using frequent collocations from document corpora when no query log is available. In this study, we address the issue of reproducibility and seek to validate query generation methods by comparing retrievability scores generated from artificially generated queries to those derived from query logs. Our findings demonstrate a minimal or negligible correlation between retrievability scores from artificial queries and those from query logs. This suggests that artificially generated queries may not accurately reflect retrievability scores as derived from query logs. We further explore alternative query generation techniques, uncovering a variation that exhibits the highest correlation. This alternative approach holds promise for improving reproducibility when query logs are unavailable.

Read more

4/16/2024