Text mining arXiv: a look through quantitative finance papers

Read original: arXiv:2401.01751 - Published 4/8/2024 by Michele Leonardo Bianchi

Text mining arXiv: a look through quantitative finance papers

Introduction

This paper presents an analysis of trends in financial topics discussed in news articles over time. The researchers aimed to automatically detect relevant information, predictions, and forecasts in financial news to better understand the dynamics of financial markets.

Data description

The researchers collected a dataset of over 1.5 million news articles from various financial news sources, spanning a 10-year period. The articles covered a wide range of financial topics, including stocks, bonds, commodities, and macroeconomic trends.

Text preprocessing

The researchers used natural language processing techniques to preprocess the text data, including tokenization, stopword removal, and lemmatization. This allowed them to extract the key concepts and themes discussed in the articles.

Topics trend

Algorithms performance on topic detection

The researchers tested various topic modeling algorithms, such as Latent Dirichlet Allocation (LDA) and BERTopic, to identify the main topics discussed in the financial news articles. They evaluated the performance of these algorithms in detecting temporality and sentiment at the discourse level, as described in this related work.

The researchers also explored the use of ChatGPT for sentiment analysis of the news articles, building on previous research in this area.

Critical Analysis

The paper provides a comprehensive analysis of trends in financial topics, but it is important to note that the findings may be limited by the specific dataset used and the algorithms employed. Additionally, the researchers acknowledge that further research is needed to fully understand the complex dynamics of financial markets and the role of news media in shaping investor sentiment and decision-making.

Conclusion

Overall, this research contributes to a better understanding of the evolving landscape of financial news and its potential impact on financial markets. The techniques and insights presented in the paper could be valuable for investors, policymakers, and researchers interested in monitoring and analyzing financial trends.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Text mining arXiv: a look through quantitative finance papers

Michele Leonardo Bianchi

This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.

4/8/2024

Automated Text Mining of Experimental Methodologies from Biomedical Literature

Ziqing Guo

Biomedical literature is a rapidly expanding field of science and technology. Classification of biomedical texts is an essential part of biomedicine research, especially in the field of biology. This work proposes the fine-tuned DistilBERT, a methodology-specific, pre-trained generative classification language model for mining biomedicine texts. The model has proven its effectiveness in linguistic understanding capabilities and has reduced the size of BERT models by 40% but by 60% faster. The main objective of this project is to improve the model and assess the performance of the model compared to the non-fine-tuned model. We used DistilBert as a support model and pre-trained on a corpus of 32,000 abstracts and complete text articles; our results were impressive and surpassed those of traditional literature classification methods by using RNN or LSTM. Our aim is to integrate this highly specialised and specific model into different research industries.

4/23/2024

🔎

Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

Silvia Garc'ia-M'endez, Francisco de Arriba-P'erez, Ana Barros-Vila, Francisco J. Gonz'alez-Casta~no, Enrique Costa-Montenegro

Financial news items are unstructured sources of information that can be mined to extract knowledge for market screening applications. Manual extraction of relevant information from the continuous stream of finance-related news is cumbersome and beyond the skills of many investors, who, at most, can follow a few sources and authors. Accordingly, we focus on the analysis of financial news to identify relevant text and, within that text, forecasts and predictions. We propose a novel Natural Language Processing (NLP) system to assist investors in the detection of relevant financial events in unstructured textual sources by considering both relevance and temporality at the discursive level. Firstly, we segment the text to group together closely related text. Secondly, we apply co-reference resolution to discover internal dependencies within segments. Finally, we perform relevant topic modelling with Latent Dirichlet Allocation (LDA) to separate relevant from less relevant text and then analyse the relevant text using a Machine Learning-oriented temporal approach to identify predictions and speculative statements. We created an experimental data set composed of 2,158 financial news items that were manually labelled by NLP researchers to evaluate our solution. The ROUGE-L values for the identification of relevant text and predictions/forecasts were 0.662 and 0.982, respectively. To our knowledge, this is the first work to jointly consider relevance and temporality at the discursive level. It contributes to the transfer of human associative discourse capabilities to expert systems through the combination of multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns, topic modelling with LDA to detect relevant text, and discursive temporality analysis to identify forecasts and predictions within this text.

4/3/2024

💬

Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers

Rajiv Movva, Sidhika Balachandar, Kenny Peng, Gabriel Agostini, Nikhil Garg, Emma Pierson

Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field's future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM research increasingly considers societal impacts, evidenced by 20x growth in LLM submissions to the Computers and Society sub-arXiv. An influx of new authors -- half of all first authors in 2023 -- are entering from non-NLP fields of CS, driving disciplinary expansion. Second, we study industry and academic publishing trends. Surprisingly, industry accounts for a smaller publication share in 2023, largely due to reduced output from Google and other Big Tech companies; universities in Asia are publishing more. Third, we study institutional collaboration: while industry-academic collaborations are common, they tend to focus on the same topics that industry focuses on rather than bridging differences. The most prolific institutions are all US- or China-based, but there is very little cross-country collaboration. We discuss implications around (1) how to support the influx of new authors, (2) how industry trends may affect academics, and (3) possible effects of (the lack of) collaboration.

4/30/2024