A Temporal Psycholinguistics Approach to Identity Resolution of Social Media Users

Read original: arXiv:2407.19967 - Published 7/30/2024 by Md Touhidul Islam

🏋️

Overview

Researchers propose an approach to match user profiles across social media platforms using post topics, sentiments, and timing.
They collected public posts from around 5,000 profiles on Disqus and Twitter, and analyzed the posts to link profiles across the two platforms.
Both temporal and non-temporal methods were explored, with the temporal approach generally performing better.
Sentiment analysis showed little impact, likely due to issues with data extraction.
A scoring model based on distance, rewards, and punishments achieved 24.2% accuracy.
Future work includes refining sentiment analysis, extending temporal analysis, and improving the scoring model.

Plain English Explanation

Matching User Profiles Across Social Media

Researchers wanted to find a way to automatically link someone's profiles across different social media platforms, like Disqus and Twitter. To do this, they looked at the topics, sentiments, and timing of the posts made by around 5,000 user profiles.

They tried two main approaches: one that considered the timing of posts, and one that did not. The timing-based approach generally worked a bit better. They found that the size of the time window they looked at mattered more than how much the time window shifted.

Interestingly, the sentiment analysis didn't seem to make much difference, likely because there were problems with how the sentiment data was collected.

They also tested a scoring model that rewarded matches and punished mismatches based on the distance between profiles. This model achieved around 24% accuracy in matching profiles.

For future work, the researchers want to improve the sentiment analysis by looking at sentiments for specific topics, expand the temporal analysis, and refine the scoring model.

Technical Explanation

The researchers' approach to identity resolution across social media platforms involved collecting public posts from around 5,000 user profiles on Disqus and Twitter. They then analyzed the topics, sentiments, and timing of these posts to try to match profiles across the two platforms.

Both temporal and non-temporal methods were explored. The temporal approach generally performed better, with the size of the time window being more influential than the shifting amount. The sentiment analysis, on the other hand, showed little impact, likely due to issues with the data extraction process.

The researchers also experimented with a distance-based scoring model that used rewards and punishments. This model achieved an accuracy of 24.2% and an average rank of 158.2 out of 2,525 in the collected corpus.

Critical Analysis

The researchers acknowledge several limitations and areas for future work in their paper. The sentiment analysis, for example, was hampered by flaws in the data extraction methods, suggesting the need for more robust sentiment evaluation, potentially by looking at sentiments on a per-topic basis.

Additionally, the temporal analysis could be expanded with additional phases to further improve performance. The scoring model also has room for refinement, such as through weight adjustments and modified reward structures.

It would also be valuable to see the researchers address any potential biases or privacy concerns that may arise from this type of identity resolution across social media platforms. As these techniques become more advanced, it will be crucial to carefully consider the ethical implications and potential for misuse.

Conclusion

This research presents an intriguing approach to matching user profiles across social media platforms by analyzing the topics, sentiments, and timing of their posts. While the results are promising, particularly for the temporal-based methods, there are still opportunities to improve the techniques, especially in the areas of sentiment analysis and scoring model refinement.

As this field of research continues to evolve, it will be important to carefully consider the broader implications and ensure that these technologies are developed and used responsibly, with a focus on protecting user privacy and preventing potential misuse. Overall, this work represents a valuable contribution to the ongoing efforts to better understand and leverage social media data.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

A Temporal Psycholinguistics Approach to Identity Resolution of Social Media Users

Md Touhidul Islam

In this thesis, we propose an approach to identity resolution across social media platforms using the topics, sentiments, and timings of the posts on the platforms. After collecting the public posts of around 5000 profiles from Disqus and Twitter, we analyze their posts to match their profiles across the two platforms. We pursue both temporal and non-temporal methods in our analysis. While neither approach proves definitively superior, the temporal approach generally performs better. We found that the temporal window size influences results more than the shifting amount. On the other hand, our sentiment analysis shows that the inclusion of sentiment makes little difference, probably due to flawed data extraction methods. We also experimented with a distance-based reward-and-punishment-focused scoring model, which achieved an accuracy of 24.198% and an average rank of 158.217 out of 2525 in our collected corpus. Future work includes refining sentiment analysis by evaluating sentiments per topic, extending temporal analysis with additional phases, and improving the scoring model through weight adjustments and modified rewards.

7/30/2024

💬

A Systematic Analysis on the Temporal Generalization of Language Models in Social Media

Asahi Ushio, Jose Camacho-Collados

In machine learning, temporal shifts occur when there are differences between training and test splits in terms of time. For streaming data such as news or social media, models are commonly trained on a fixed corpus from a certain period of time, and they can become obsolete due to the dynamism and evolving nature of online content. This paper focuses on temporal shifts in social media and, in particular, Twitter. We propose a unified evaluation scheme to assess the performance of language models (LMs) under temporal shift on standard social media tasks. LMs are tested on five diverse social media NLP tasks under different temporal settings, which revealed two important findings: (i) the decrease in performance under temporal shift is consistent across different models for entity-focused tasks such as named entity recognition or disambiguation, and hate speech detection, but not significant in the other tasks analysed (i.e., topic and sentiment classification); and (ii) continuous pre-training on the test period does not improve the temporal adaptability of LMs.

5/24/2024

Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models

Yi Zhou, Danushka Bollegala, Jose Camacho-Collados

Social biases such as gender or racial biases have been reported in language models (LMs), including Masked Language Models (MLMs). Given that MLMs are continuously trained with increasing amounts of additional data collected over time, an important yet unanswered question is how the social biases encoded with MLMs vary over time. In particular, the number of social media users continues to grow at an exponential rate, and it is a valid concern for the MLMs trained specifically on social media data whether their social biases (if any) would also amplify over time. To empirically analyse this problem, we use a series of MLMs pretrained on chronologically ordered temporal snapshots of corpora. Our analysis reveals that, although social biases are present in all MLMs, most types of social bias remain relatively stable over time (with a few exceptions). To further understand the mechanisms that influence social biases in MLMs, we analyse the temporal corpora used to train the MLMs. Our findings show that some demographic groups, such as male, obtain higher preference over the other, such as female on the training corpora constantly.

6/21/2024

Mental Disorder Classification via Temporal Representation of Text

Raja Kumar, Kishan Maharaj, Ashita Saxena, Pushpak Bhattacharyya

Mental disorders pose a global challenge, aggravated by the shortage of qualified mental health professionals. Mental disorder prediction from social media posts by current LLMs is challenging due to the complexities of sequential text data and the limited context length of language models. Current language model-based approaches split a single data instance into multiple chunks to compensate for limited context size. The predictive model is then applied to each chunk individually, and the most voted output is selected as the final prediction. This results in the loss of inter-post dependencies and important time variant information, leading to poor performance. We propose a novel framework which first compresses the large sequence of chronologically ordered social media posts into a series of numbers. We then use this time variant representation for mental disorder classification. We demonstrate the generalization capabilities of our framework by outperforming the current SOTA in three different mental conditions: depression, self-harm, and anorexia, with an absolute improvement of 5% in the F1 score. We investigate the situation where current data instances fall within the context length of language models and present empirical results highlighting the importance of temporal properties of textual data. Furthermore, we utilize the proposed framework for a cross-domain study, exploring commonalities across disorders and the possibility of inter-domain data usage.

6/26/2024