Metadata Integration for Spam Reviews Detection on Vietnamese E-commerce Websites

Read original: arXiv:2405.13292 - Published 8/2/2024 by Co Van Dinh, Son T. Luu

🔎

Overview

Researchers developed a dataset called ViSpamReviews v2 that includes metadata of reviews to help classify spam reviews on Vietnamese e-commerce websites.
They proposed a novel approach to integrate both textual and categorical attributes into the classification model.
Experiments showed that the product category feature was effective when combined with deep neural network (DNN) models, while text features performed well on both DNN models.
The PhoBERT model achieved state-of-the-art performance when combined with product description features generated from the SPhoBert model.

Plain English Explanation

Detecting spam reviews on e-commerce websites is an important problem, as these fake reviews can mislead consumers. In this work, the researchers created a new dataset called ViSpamReviews v2 that includes not only the text of the reviews, but also additional information like the product category.

The researchers then developed a new approach that combines both the text of the reviews and the additional metadata, like the product category, to improve the accuracy of spam review detection. They found that the product category information was particularly helpful when used with deep neural network models, while the text features performed well on their own.

The best-performing model was the PhoBERT model, which is a type of language model that has been trained on a large amount of Vietnamese text. When the PhoBERT model was combined with additional features generated from another model called SPhoBert, it was able to achieve the highest accuracy in detecting spam reviews on Vietnamese e-commerce websites.

Overall, this research shows how incorporating additional metadata, like product information, can help improve the accuracy of detecting spam and fake reviews online, which is an important problem for maintaining trust in e-commerce platforms.

Technical Explanation

The researchers introduced a new dataset called ViSpamReviews v2 that includes metadata of reviews, such as the product category, in addition to the review text. This was done to provide supplementary attributes to help classify spam reviews more accurately.

They proposed a novel approach that integrates both textual and categorical attributes into the classification model. In their experiments, they found that the product category feature was effective when combined with deep neural network (DNN) models, while text features performed well on both DNN models and the PhoBERT model.

The PhoBERT model, which is a BERT-based language model trained on Vietnamese text, achieved the highest accuracy when combined with product description features generated from the SPhoBert model (a combination of PhoBERT and SentenceBERT). Using the macro-averaged F1 score, the task of classifying spam reviews achieved 87.22% (an increase of 1.64% compared to the baseline), while the task of identifying the type of spam reviews achieved an accuracy of 73.49% (an increase of 1.93% compared to the baseline).

Critical Analysis

The researchers acknowledged that their dataset, ViSpamReviews v2, is focused on reviews from Vietnamese e-commerce websites, which may limit the generalizability of their findings to other languages and contexts. Additionally, they noted that the dataset is imbalanced, with more genuine reviews than spam reviews, which could affect the model's performance.

While the researchers showed that incorporating metadata, such as product category, can improve spam review detection, it's unclear how well this approach would scale to larger and more diverse review datasets. There may be other types of metadata or contextual information that could be even more useful for this task.

Furthermore, the researchers did not provide much insight into the types of spam reviews that their models were able to detect more effectively with the additional metadata. A more detailed analysis of the model's strengths and weaknesses would be helpful for understanding the practical implications of this research.

Overall, this work represents a promising step towards improving the detection of spam and fake reviews on e-commerce platforms, but further research is needed to fully understand the broader applicability and limitations of this approach.

Conclusion

This research introduces a new dataset, ViSpamReviews v2, and a novel approach for using both textual and categorical attributes to improve the detection of spam reviews on Vietnamese e-commerce websites. The key findings show that incorporating metadata, such as product category, can enhance the performance of deep learning models for this task.

The state-of-the-art results achieved by the PhoBERT model, when combined with features generated from the SPhoBert model, highlight the potential of leveraging pre-trained language models and supplementary data sources to tackle the problem of spam and fake reviews.

While the dataset and findings are specific to the Vietnamese market, this work demonstrates the value of considering additional context and metadata in AI-powered review analysis systems. As e-commerce continues to grow, addressing the challenge of spam reviews will be increasingly important for maintaining trust and transparency in online marketplaces.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Metadata Integration for Spam Reviews Detection on Vietnamese E-commerce Websites

Co Van Dinh, Son T. Luu

The problem of detecting spam reviews (opinions) has received significant attention in recent years, especially with the rapid development of e-commerce. Spam reviews are often classified based on comment content, but in some cases, it is insufficient for models to accurately determine the review label. In this work, we introduce the ViSpamReviews v2 dataset, which includes metadata of reviews with the objective of integrating supplementary attributes for spam review classification. We propose a novel approach to simultaneously integrate both textual and categorical attributes into the classification model. In our experiments, the product category proved effective when combined with deep neural network (DNN) models, while text features performed well on both DNN models and the model achieved state-of-the-art performance in the problem of detecting spam reviews on Vietnamese e-commerce websites, namely PhoBERT. Specifically, the PhoBERT model achieves the highest accuracy when combined with product description features generated from the SPhoBert model, which is the combination of PhoBERT and SentenceBERT. Using the macro-averaged F1 score, the task of classifying spam reviews achieved 87.22% (an increase of 1.64% compared to the baseline), while the task of identifying the type of spam reviews achieved an accuracy of 73.49% (an increase of 1.93% compared to the baseline).

8/2/2024

Online detection and infographic explanation of spam reviews with data drift adaptation

Francisco de Arriba-P'erez, Silvia Garc'ia-M'endez, F'atima Leal, Benedita Malheiro, J. C. Burguillo

Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection & adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.

6/24/2024

🤖

Vietnamese AI Generated Text Detection

Quang-Dan Tran, Van-Quan Nguyen, Quang-Huy Pham, K. B. Thang Nguyen, Trong-Hop Do

In recent years, Large Language Models (LLMs) have become integrated into our daily lives, serving as invaluable assistants in completing tasks. Widely embraced by users, the abuse of LLMs is inevitable, particularly in using them to generate text content for various purposes, leading to difficulties in distinguishing between text generated by LLMs and that written by humans. In this study, we present a dataset named ViDetect, comprising 6.800 samples of Vietnamese essay, with 3.400 samples authored by humans and the remainder generated by LLMs, serving the purpose of detecting text generated by AI. We conducted evaluations using state-of-the-art methods, including ViT5, BartPho, PhoBERT, mDeberta V3, and mBERT. These results contribute not only to the growing body of research on detecting text generated by AI but also demonstrate the adaptability and effectiveness of different methods in the Vietnamese language context. This research lays the foundation for future advancements in AI-generated text detection and provides valuable insights for researchers in the field of natural language processing.

5/7/2024

Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts

Cuong Nhat Vo, Khanh Bao Huynh, Son T. Luu, Trong-Hop Do

The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social media. We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts. The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate. There are 5 targets in the dataset, and each target is labeled with the corresponding level manually by humans with strict annotation guidelines. The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level. Then, we construct a baseline for this task by combining the Bi-GRU-LSTM-CNN with the pre-trained language model to leverage the power of text representation of BERTology. Finally, we suggest a methodology to integrate the baseline model for targeted hate speech detection into the online streaming system for practical application in preventing hateful and offensive content on social media.

5/1/2024