Smart Bilingual Focused Crawling of Parallel Documents

Read original: arXiv:2405.14779 - Published 5/24/2024 by Cristian Garc'ia-Romero, Miquel Espl`a-Gomis, Felipe S'anchez-Mart'inez

↗️

Overview

• This paper proposes a "smart crawling" method for finding parallel texts (texts that are mutual translations) on the internet more efficiently than conventional brute-force approaches. • The method uses two models: one to infer the language of a document from its URL, and another to determine whether a pair of URLs link to parallel documents. • The results show that combining these two models enables the early discovery of parallel content during crawling, reducing the number of downloaded documents deemed useless and yielding a greater quantity of parallel documents compared to conventional crawling approaches.

Plain English Explanation

• Crawling the internet to find texts that are mutual translations (called "parallel texts") is usually done by simply downloading a huge number of documents in an unguided way, and only a small fraction of them end up containing actual parallel content. • The researchers in this study developed a smarter way to crawl for parallel texts. Their approach uses two different models to guide the crawl towards finding parallel content more quickly. • One model can tell the language of a document just from its web address (URL). The other model can determine if a pair of web addresses are linking to parallel documents. • By combining these two models, the researchers were able to find parallel texts much more efficiently than the usual brute-force approach. This led to downloading fewer useless documents and getting more actual parallel content overall.

Technical Explanation

• The researchers developed two models to guide their parallel text crawling approach:

A language identification model that can infer the language of a document from its URL
A parallelism detection model that can determine whether a pair of URLs link to parallel documents • They evaluated the individual effectiveness of these two models, as well as their integration into a crawling tool. • The results showed that both models were effective on their own, and that combining them enabled the early discovery of parallel content during crawling. • This combination led to a reduction in the number of downloaded documents deemed useless, and produced a greater quantity of parallel documents compared to conventional crawling approaches.

Critical Analysis

• The paper does not provide much detail on the specific architectures or training processes of the two models, limiting the ability to fully evaluate their technical merits. • The evaluation is also confined to a single crawling scenario, and it is unclear how the approach would scale or perform in more diverse real-world settings. • Additionally, the paper does not address potential biases or limitations in the web content that the crawlers access, which could influence the diversity and representativeness of the parallel texts discovered. • Further research could explore cross-lingual parallel corpora and alignment of shared cross-lingual spaces to broaden the applicability of this smart crawling approach.

Conclusion

• This paper presents a novel "smart crawling" method for efficiently discovering parallel texts on the internet, using two specialized models to guide the crawling process. • The results demonstrate the effectiveness of this approach in reducing wasted effort and increasing the yield of parallel content, compared to conventional brute-force crawling. • While further research is needed to fully evaluate the scalability and robustness of this technique, the core ideas represent an important step forward in streamlining the collection of multilingual parallel data, which is crucial for building high-quality cross-lingual language models and other cross-lingual NLP applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

↗️

Smart Bilingual Focused Crawling of Parallel Documents

Cristian Garc'ia-Romero, Miquel Espl`a-Gomis, Felipe S'anchez-Mart'inez

Crawling parallel texts $unicode{x2014}$texts that are mutual translations$unicode{x2014}$ from the Internet is usually done following a brute-force approach: documents are massively downloaded in an unguided process, and only a fraction of them end up leading to actual parallel content. In this work we propose a smart crawling method that guides the crawl towards finding parallel content more rapidly. Our approach builds on two different models: one that infers the language of a document from its URL, and another that infers whether a pair of URLs link to parallel documents. We evaluate both models in isolation and their integration into a crawling tool. The results demonstrate the individual effectiveness of both models and highlight that their combination enables the early discovery of parallel content during crawling, leading to a reduction in the amount of downloaded documents deemed useless, and yielding a greater quantity of parallel documents compared to conventional crawling approaches.

5/24/2024

🎯

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Masaaki Nagata, Makoto Morishita, Katsuki Chousa, Norihito Yasuda

Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

5/16/2024

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Peiqin Lin, Andr'e F. T. Martins, Hinrich Schutze

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

7/2/2024

💬

Investigating the translation capabilities of Large Language Models trained on parallel data only

Javier Garc'ia Gilabert, Carlos Escolano, Aleix Sant Savall, Francesca De Luca Fornaciari, Audrey Mash, Xixian Liao, Maite Melero

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

6/14/2024