Model-based Subsampling for Knowledge Graph Completion

Read original: arXiv:2309.09296 - Published 4/15/2024 by Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

🖼️

Overview

Subsampling is an effective technique in Knowledge Graph Embedding (KGE) to reduce overfitting caused by the sparsity of Knowledge Graph (KG) datasets.
Current subsampling approaches only consider the frequencies of queries consisting of entities and their relations, potentially underestimating the appearance probabilities of infrequent queries with high-frequency entities or relations.
The paper proposes two new subsampling methods, Model-based Subsampling (MBS) and Mixed Subsampling (MIX), to better estimate the appearance probabilities of queries using KGE model predictions.
Evaluation on popular KGE models, including RotatE, TransE, HAKE, ComplEx, and DistMult, showed improved KG completion performance using the proposed subsampling methods.

Plain English Explanation

Knowledge graphs are structured databases that represent information as a network of interconnected entities and their relationships. When training models to understand and predict these knowledge graphs, researchers often encounter a problem called "sparsity," where the data is spread out and incomplete. This makes it challenging for the models to learn effectively.

To address this, the researchers in this paper explored a technique called "subsampling." Subsampling involves selectively choosing a subset of the data to train on, which can help the model learn better. However, the current subsampling approaches only consider the frequency of the individual entities and relationships, and can sometimes underestimate the importance of less common queries.

To improve upon this, the researchers proposed two new subsampling methods: Model-based Subsampling (MBS) and Mixed Subsampling (MIX). These methods use the predictions of the knowledge graph embedding models themselves to better estimate the appearance probabilities of different queries, including the less common ones.

The researchers tested these new subsampling methods on several popular knowledge graph embedding models, and found that they actually improved the models' performance in completing missing information in the knowledge graphs. This suggests that their approach of using the model's own predictions to guide the subsampling process is an effective way to overcome the challenges posed by sparse knowledge graph data.

Technical Explanation

The paper addresses the problem of subsampling in Knowledge Graph Embedding (KGE), a technique used to represent entities and relations in a knowledge graph as vectors. Subsampling is an effective way to reduce overfitting caused by the sparsity of KG datasets, but current approaches only consider the frequencies of queries consisting of entities and relations.

The authors propose two new subsampling methods to better estimate the appearance probabilities of queries:

Model-based Subsampling (MBS): This method uses the predictions of a KGE model to estimate the appearance probabilities of queries. The model is first trained on the full dataset, then used to predict the probabilities of different queries. These predicted probabilities are then used to guide the subsampling process.
Mixed Subsampling (MIX): This method combines the frequency-based subsampling approach with the model-based approach from MBS. It aims to leverage the strengths of both methods to better capture the importance of queries.

The authors evaluate these subsampling methods on several popular KGE models, including RotatE, TransE, HAKE, ComplEx, and DistMult, using the FB15k-237, WN18RR, and YAGO3-10 datasets. The results show that the proposed subsampling methods can improve the KG completion performance of these models compared to the standard frequency-based subsampling approach.

Critical Analysis

The paper presents a novel approach to subsampling in KGE that addresses the limitations of existing methods. By incorporating the predictions of KGE models to estimate query appearance probabilities, the proposed MBS and MIX methods show promising results in improving KG completion performance.

One potential limitation is that the performance of the subsampling methods may be dependent on the accuracy of the underlying KGE models. If the models struggle to make accurate predictions, the subsampling process may not effectively capture the true importance of different queries.

Additionally, the paper does not explore the computational and memory overhead introduced by the model-based subsampling approaches. Incorporating model predictions into the subsampling process may incur additional overhead, which could be a consideration for practical deployments.

Further research could investigate the robustness of the proposed methods to different types of KG datasets, as well as explore ways to minimize the computational impact of the model-based subsampling approach. Incorporating other types of information, such as the semantic or structural properties of the knowledge graph, could also be an interesting direction to enhance the subsampling process.

Conclusion

This paper presents a novel approach to subsampling in Knowledge Graph Embedding that addresses the limitations of current frequency-based methods. By leveraging the predictions of KGE models to estimate the appearance probabilities of queries, the proposed Model-based Subsampling (MBS) and Mixed Subsampling (MIX) methods demonstrate improved performance in KG completion tasks across various popular KGE models.

The key contribution of this work is the insight that incorporating model-based information can help better capture the importance of less frequent queries, which are often underestimated by traditional subsampling approaches. This research highlights the potential benefits of using model-guided techniques to enhance data sampling strategies in knowledge graph learning, and could inspire further developments in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

Model-based Subsampling for Knowledge Graph Completion

Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.

4/15/2024

Exploiting Large Language Models Capabilities for Question Answer-Driven Knowledge Graph Completion Across Static and Temporal Domains

Rui Yang, Jiahao Zhu, Jianping Man, Li Fang, Yi Zhou

Knowledge graph completion (KGC) aims to identify missing triples in a knowledge graph (KG). This is typically achieved through tasks such as link prediction and instance completion. However, these methods often focus on either static knowledge graphs (SKGs) or temporal knowledge graphs (TKGs), addressing only within-scope triples. This paper introduces a new generative completion framework called Generative Subgraph-based KGC (GS-KGC). GS-KGC employs a question-answering format to directly generate target entities, addressing the challenge of questions having multiple possible answers. We propose a strategy that extracts subgraphs centered on entities and relationships within the KG, from which negative samples and neighborhood information are separately obtained to address the one-to-many problem. Our method generates negative samples using known facts to facilitate the discovery of new information. Furthermore, we collect and refine neighborhood path data of known entities, providing contextual information to enhance reasoning in large language models (LLMs). Our experiments evaluated the proposed method on four SKGs and two TKGs, achieving state-of-the-art Hits@1 metrics on five datasets. Analysis of the results shows that GS-KGC can discover new triples within existing KGs and generate new facts beyond the closed KG, effectively bridging the gap between closed-world and open-world KGC.

8/21/2024

Subgraph-Aware Training of Text-based Methods for Knowledge Graph Completion

Youmin Ko, Hyemin Yang, Taeuk Kim, Hyunjoon Kim

Fine-tuning pre-trained language models (PLMs) has recently shown a potential to improve knowledge graph completion (KGC). However, most PLM-based methods encode only textual information, neglecting various topological structures of knowledge graphs (KGs). In this paper, we empirically validate the significant relations between the structural properties of KGs and the performance of the PLM-based methods. To leverage the structural knowledge, we propose a Subgraph-Aware Training framework for KGC (SATKGC) that combines (i) subgraph-aware mini-batching to encourage hard negative sampling, and (ii) a new contrastive learning method to focus more on harder entities and harder negative triples in terms of the structural properties. To the best of our knowledge, this is the first study to comprehensively incorporate the structural inductive bias of the subgraphs into fine-tuning PLMs. Extensive experiments on four KGC benchmarks demonstrate the superiority of SATKGC. Our code is available.

7/24/2024

Unified Interpretation of Smoothing Methods for Negative Sampling Loss Functions in Knowledge Graph Embedding

Xincan Feng, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

Knowledge Graphs (KGs) are fundamental resources in knowledge-intensive tasks in NLP. Due to the limitation of manually creating KGs, KG Completion (KGC) has an important role in automatically completing KGs by scoring their links with KG Embedding (KGE). To handle many entities in training, KGE relies on Negative Sampling (NS) loss that can reduce the computational cost by sampling. Since the appearance frequencies for each link are at most one in KGs, sparsity is an essential and inevitable problem. The NS loss is no exception. As a solution, the NS loss in KGE relies on smoothing methods like Self-Adversarial Negative Sampling (SANS) and subsampling. However, it is uncertain what kind of smoothing method is suitable for this purpose due to the lack of theoretical understanding. This paper provides theoretical interpretations of the smoothing methods for the NS loss in KGE and induces a new NS loss, Triplet Adaptive Negative Sampling (TANS), that can cover the characteristics of the conventional smoothing methods. Experimental results of TransE, DistMult, ComplEx, RotatE, HAKE, and HousE on FB15k-237, WN18RR, and YAGO3-10 datasets and their sparser subsets show the soundness of our interpretation and performance improvement by our TANS.

7/8/2024