Tohoku-nlp

Models by this creator

🎯

bert-base-japanese-whole-word-masking

bert-base-japanese-whole-word-masking is a BERT model pretrained on Japanese text. It uses word-level tokenization based on the IPA dictionary, followed by WordPiece subword tokenization. The model is trained with whole word masking, where all subwords corresponding to a single word are masked at once during the masked language modeling (MLM) objective. Similar models include BERT large model (uncased) whole word masking finetuned on SQuAD, Chinese BERT models with whole word masking, and the multilingual BERT base model. These models leverage whole word masking and multilingual training to improve performance on language understanding tasks. Model inputs and outputs Inputs Japanese text as a sequence of tokens Outputs Contextualized token representations that can be used for downstream natural language processing tasks Capabilities The bert-base-japanese-whole-word-masking model can be used for a variety of Japanese language understanding tasks, such as text classification, named entity recognition, and question answering. Its use of whole word masking during pretraining allows the model to better capture word-level semantics in the Japanese language. What can I use it for? You can use this model as a starting point for fine-tuning on your own Japanese language task. For example, you could fine-tune it on a Japanese text classification dataset to build a product categorization system, or on a Japanese question answering dataset to create a customer support chatbot. Things to try One interesting thing to try with this model is to compare its performance on Japanese tasks to models that use character-level or subword-level tokenization, to see if the whole word masking provides a significant boost in accuracy. You could also try using the model's contextualized token representations as input features for other Japanese NLP models, to see if it helps improve their performance.

Updated 5/27/2024

Text-to-Text

💬

bert-base-japanese-v3

tohoku-nlp

The bert-base-japanese-v3 model is a Japanese language model based on the BERT architecture, developed by the tohoku-nlp team. It is trained on a large corpus of Japanese text, including the Japanese portion of the CC-100 dataset and the Japanese Wikipedia. The model uses word-level tokenization based on the Unidic 2.1.2 dictionary, followed by WordPiece subword tokenization. It is trained with whole word masking, where all subword tokens corresponding to a single word are masked at once during pretraining. This model can be compared to other Japanese BERT models like bert-base-japanese-whole-word-masking, which also uses whole word masking, and the multilingual bert-base-multilingual-uncased model, which covers 102 languages including Japanese. Model inputs and outputs Inputs Text**: The bert-base-japanese-v3 model takes in Japanese text as input, which is first tokenized using the Unidic 2.1.2 dictionary and then split into subwords using the WordPiece algorithm. Outputs Token representations**: The model outputs contextual representations for each token in the input text, which can be used for a variety of downstream natural language processing tasks. Capabilities The bert-base-japanese-v3 model is a powerful language model that can be fine-tuned for a wide range of Japanese natural language processing tasks, such as text classification, named entity recognition, and question answering. Its whole word masking approach during pretraining allows the model to better capture the semantics of Japanese words, which are often composed of multiple characters. What can I use it for? The bert-base-japanese-v3 model can be used as a starting point for building Japanese language applications, such as: Text classification**: Classify Japanese text into different categories (e.g., sentiment analysis, topic classification). Named entity recognition**: Identify and extract named entities (e.g., people, organizations, locations) from Japanese text. Question answering**: Build systems that can answer questions based on Japanese text passages. To use the model, you can leverage the Hugging Face Transformers library, which provides easy-to-use APIs for fine-tuning and deploying BERT-based models. Things to try One interesting thing to try with the bert-base-japanese-v3 model is to compare its performance on Japanese language tasks to the performance of other Japanese language models, such as bert-base-japanese-whole-word-masking or the multilingual bert-base-multilingual-uncased model. This could help you understand the trade-offs and advantages of the different approaches to pretraining and tokenization used by these models.

Updated 9/6/2024

Text-to-Text