Word Embedding for Social Sciences: An Interdisciplinary Survey

Read original: arXiv:2207.03086 - Published 6/18/2024 by Akira Matsui, Emilio Ferrara

📊

Overview

Researchers are developing machine learning models to extract insights from complex data by learning low-dimensional representation modes.
These advancements have benefited not only computer scientists, but also social scientists studying human behavior and social phenomena.
However, this emerging trend is not well documented, as different social science fields rarely cover each other's work, resulting in fragmented knowledge.
To address this, the paper surveys recent studies that apply word embedding techniques to human behavior mining.
The survey builds a taxonomy to illustrate the methods and procedures used in the surveyed papers, aiding social science researchers in contextualizing their research.
The survey also conducts a simple experiment to warn that common similarity measurements used in the literature could yield different results, even if they return consistent results at an aggregate level.

Plain English Explanation

Researchers are using advanced machine learning techniques, such as word embeddings, to extract valuable insights from complex data related to human behavior and social phenomena. These techniques allow them to uncover hidden patterns and relationships in large, messy datasets.

This is beneficial for both computer scientists and social scientists, as it enables them to gain a better understanding of how people think, interact, and behave. However, the different fields of social science rarely share their findings with each other, so there is a lack of cohesion in the research landscape.

To address this, the researchers in this paper have surveyed a wide range of studies that have used word embedding techniques to analyze human behavior. They have created a framework or "taxonomy" to help organize and contextualize these different research efforts, making it easier for social scientists to see how their work fits into the broader picture.

The researchers also conducted a simple experiment to highlight a potential issue with the way that similarity is often measured in this type of research. They found that even if the overall results seem consistent, the specific methods used can sometimes lead to different conclusions. This is an important consideration for researchers who want to ensure the reliability and robustness of their findings.

Technical Explanation

The paper surveys recent studies that apply word embedding techniques to the mining of human behavior data. Word embedding is a machine learning technique that represents words as numerical vectors, capturing the semantic relationships between them.

The researchers built a taxonomy to illustrate the methods and procedures used in the surveyed papers. This taxonomy includes categories such as the type of data used (e.g., text, social media, or online reviews), the specific word embedding models employed (e.g., Word2Vec, GloVe, or FastText), and the downstream tasks or applications (e.g., sentiment analysis, user profiling, or community detection).

The survey also includes a simple experiment that explores the use of different similarity measures, such as cosine similarity and Pearson correlation, in the context of word embeddings. The researchers found that while these measures often return consistent results at an aggregate level, they can yield different conclusions when applied to specific cases. This highlights the importance of carefully considering the choice of similarity metric when working with word embeddings and other types of representation learning models.

Critical Analysis

The paper makes a valuable contribution by documenting the emerging trend of applying word embedding techniques to the analysis of human behavior and social phenomena. By building a comprehensive taxonomy, the researchers provide a useful framework for social science researchers to contextualize their work and identify potential areas for collaboration or cross-pollination of ideas.

However, the paper does not delve deeply into the limitations or potential biases inherent in these techniques. For example, word embeddings can reflect and amplify societal biases present in the training data, which could potentially skew the insights derived from their application to social science research.

Additionally, the survey focuses on a relatively narrow set of techniques (i.e., word embeddings) and does not consider other representation learning approaches, such as graph embeddings or language models, which may offer complementary or alternative perspectives on human behavior and social phenomena.

Future research in this area could explore the integration of multiple representation learning techniques, as well as the development of more robust and interpretable methods for measuring similarity and extracting insights from complex social data.

Conclusion

This paper provides a valuable survey of the emerging trend of applying word embedding techniques to the analysis of human behavior and social phenomena. By building a comprehensive taxonomy and highlighting the potential pitfalls of common similarity measures, the researchers have laid the groundwork for a more cohesive and rigorous approach to this interdisciplinary field of research.

As machine learning and data science continue to evolve, the integration of these techniques with the social sciences holds great promise for enhancing our understanding of human behavior and social dynamics. This survey serves as an important step in bridging the gap between computer science and the social sciences, paving the way for more fruitful collaboration and cross-pollination of ideas.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📊

Word Embedding for Social Sciences: An Interdisciplinary Survey

Akira Matsui, Emilio Ferrara

To extract essential information from complex data, computer scientists have been developing machine learning models that learn low-dimensional representation mode. From such advances in machine learning research, not only computer scientists but also social scientists have benefited and advanced their research because human behavior or social phenomena lies in complex data. However, this emerging trend is not well documented because different social science fields rarely cover each other's work, resulting in fragmented knowledge in the literature. To document this emerging trend, we survey recent studies that apply word embedding techniques to human behavior mining. We built a taxonomy to illustrate the methods and procedures used in the surveyed papers, aiding social science researchers in contextualizing their research within the literature on word embedding applications. This survey also conducts a simple experiment to warn that common similarity measurements used in the literature could yield different results even if they return consistent results at an aggregate level.

6/18/2024

Ontology Embedding: A Survey of Methods, Applications and Resources

Jiaoyan Chen, Olga Mashkova, Fernando Zhapa-Camacho, Robert Hoehndorf, Yuan He, Ian Horrocks

Ontologies are widely used for representing domain knowledge and meta data, playing an increasingly important role in Information Systems, the Semantic Web, Bioinformatics and many other domains. However, logical reasoning that ontologies can directly support are quite limited in learning, approximation and prediction. One straightforward solution is to integrate statistical analysis and machine learning. To this end, automatically learning vector representation for knowledge of an ontology i.e., ontology embedding has been widely investigated in recent years. Numerous papers have been published on ontology embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field. To bridge this gap, we write this survey paper, which first introduces different kinds of semantics of ontologies, and formally defines ontology embedding from the perspectives of both mathematics and machine learning, as well as its property of faithfulness. Based on this, it systematically categorises and analyses a relatively complete set of over 80 papers, according to the ontologies and semantics that they aim at, and their technical solutions including geometric modeling, sequence modeling and graph propagation. This survey also introduces the applications of ontology embedding in ontology engineering, machine learning augmentation and life sciences, presents a new library mOWL, and discusses the challenges and future directions.

6/18/2024

GuideWalk -- Heterogeneous Data Fusion for Enhanced Learning -- A Multiclass Document Classification Case

Sarmad N. Mohammed, Semra Gunduc{c}

One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an exceptional place among the data types in concern. The processing of the text data requires embedding, a method of translating the content of the text to numeric vectors. A correct embedding algorithm is the starting point for obtaining the full information content of the text data. In this work, a new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed. The model uses the graph structure of sentences to capture different types of information from text data, such as syntactic, semantic, and hidden content. Using random walks on a weighted word graph, GTPM calculates transition probabilities to derive text embedding vectors. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms. GTPM shows significantly better classification performance for binary and multi-class datasets than well-known algorithms. Additionally, the proposed method demonstrates superior robustness, maintaining performance with limited (only $10%$) training data, showing an $8%$ decline compared to $15-20%$ for baseline methods.

9/10/2024

The Echoes of the 'I': Tracing Identity with Demographically Enhanced Word Embeddings

Ivan Smirnov

Identity is one of the most commonly studied constructs in social science. However, despite extensive theoretical work on identity, there remains a need for additional empirical data to validate and refine existing theories. This paper introduces a novel approach to studying identity by enhancing word embeddings with socio-demographic information. As a proof of concept, we demonstrate that our approach successfully reproduces and extends established findings regarding gendered self-views. Our methodology can be applied in a wide variety of settings, allowing researchers to tap into a vast pool of naturally occurring data, such as social media posts. Unlike similar methods already introduced in computer science, our approach allows for the study of differences between social groups. This could be particularly appealing to social scientists and may encourage the faster adoption of computational methods in the field.

7/2/2024