starpii

Maintainer: bigcode

104

Last updated 4/29/2024

👀

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The starpii model is a Named Entity Recognition (NER) model trained to detect Personal Identifiable Information (PII) in code datasets. It was fine-tuned by bigcode on a PII dataset they annotated, which is available with gated access. The model was initially trained on a pseudo-labeled dataset to enhance its performance on rare PII entities like keys.

The model fine-tuned on the annotated dataset can detect six target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames. It uses the bigcode-encoder as its base encoder model, which was pre-trained on 88 programming languages from the The Stack dataset.

Model inputs and outputs

Inputs

Raw text containing code snippets or documents

Outputs

Annotated text with PII entities highlighted and classified into one of the six target classes

Capabilities

The starpii model demonstrates strong performance in detecting various types of PII entities within code, including rare ones like keys and passwords. This can be useful for privacy-preserving applications that need to automatically identify and redact sensitive information.

What can I use it for?

The starpii model can be applied to a variety of use cases where identifying PII in code is important, such as:

Anonymizing code datasets before sharing or publishing
Detecting sensitive information in internal code repositories
Regulatory compliance by finding PII in financial or legal documents

Things to try

One interesting aspect of the starpii model is its use of a pseudo-labeled dataset for initial training. This technique can be helpful for improving model performance on rare entities that are difficult to obtain labeled data for. You could experiment with applying similar approaches to other domain-specific NER tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

📊

starencoder

bigcode

The starencoder model is an encoder-only Transformer model trained on over 80 programming languages from the The Stack dataset. The model leverages Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives, similar to the approach used in BERT, to learn representations of code. This allows the model to be efficiently fine-tuned for a variety of code-related tasks. Similar models include StarCoder, a 15.5B parameter causal language model trained on the same dataset, as well as StarCoder2 which uses improved architectural techniques like Grouped Query Attention. The StarCoderBase model is an earlier version of the StarCoder model. Model inputs and outputs Inputs Code snippets**: The model takes in sequences of code up to 1024 tokens long, across 80+ programming languages. Outputs Encoded representations**: The primary output of the model is a sequence of encoded representations for the input code, which can then be used for downstream tasks like classification, generation, or retrieval. Capabilities The starencoder model is capable of understanding and encoding code across a wide range of programming languages. This allows it to be used for tasks like code search, code summarization, and anomaly detection. The model has been fine-tuned on a token classification task to detect personally identifiable information (PII) in code, resulting in the StaPII model. What can I use it for? The starencoder model can be used as a powerful pre-trained representation for a variety of code-related applications. Some potential use cases include: Code search and retrieval**: Fine-tune the model on a code search task to quickly find relevant code snippets in a large codebase. Code summarization**: Use the model to generate readable summaries of code functionality. Code anomaly detection**: Leverage the model's understanding of code to identify potentially problematic or security-relevant code patterns. PII detection in code**: The fine-tuned StaPII model can help identify and redact sensitive information in code. Things to try One interesting aspect of the starencoder model is its ability to handle a wide range of programming languages. Try fine-tuning the model on a task that requires understanding multiple languages, like cross-language code search or translation. You can also experiment with using the model's representations as input to other ML models for novel code-related applications.

Updated Invalid Date

Text-to-Text

🐍

starcoder

bigcode

2.7K

The starcoder model is a 15.5B parameter language model developed by BigCode, trained on 80+ programming languages from The Stack (v1.2) dataset. It uses Multi Query Attention, a context window of 8192 tokens, and the Fill-in-the-Middle objective, and was trained on 1 trillion tokens. The model is available on the Hugging Face platform and can be used for various programming-related tasks. The starcoder model can be compared to similar models like Magicoder-S-DS-6.7B, which is also a large language model trained on code, and WizardLM-7B-uncensored-GPTQ, which is a large language model focused on general text generation. These models share similarities in their target domains and capabilities, but may differ in their specific architecture, training data, and intended use cases. Model inputs and outputs The starcoder model is a causal language model, which means it can be used to generate text in an auto-regressive manner. The model takes in a sequence of tokens as input and generates a sequence of tokens as output, where each output token is predicted based on the previous tokens in the sequence. Inputs Prompt**: A sequence of tokens that the model uses as the starting point for text generation. Outputs Generated text**: A sequence of tokens generated by the model, continuing the input prompt. Capabilities The starcoder model is designed to excel at programming-related tasks, such as code generation, code completion, and programming language understanding. It can be used to generate code snippets, complete partially written code, and even translate between different programming languages. The model's broad training on 80+ programming languages allows it to handle a wide variety of coding tasks and contexts. What can I use it for? The starcoder model can be used for a variety of programming-related applications, such as: Code generation**: Automatically generating code based on a natural language description or prompt. Code completion**: Suggesting completions for partially written code. Programming language translation**: Translating code between different programming languages. Documentation generation**: Automatically generating documentation for code. Programming education**: Assisting students in learning programming concepts and syntax. The model's capabilities can be leveraged in various industries, such as software development, programming education, and technical writing. Things to try One interesting aspect of the starcoder model is its use of the Fill-in-the-Middle objective during training. This approach allows the model to learn to generate text in a more holistic, contextual manner, rather than just predicting the next token in a sequence. You can experiment with this by using the `, , and ` tokens to guide the model's text generation. Another interesting area to explore is the model's ability to handle different programming languages. You can try providing prompts in various languages and observe how the model responds, or even attempt to translate code between languages using the model.

Updated Invalid Date

Text-to-Text

📊

starcoder2-15b

bigcode

505

The starcoder2-15b model is a 15B parameter model trained on 600+ programming languages from the The Stack v2 dataset, with opt-out requests excluded. The model uses Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and was trained using the Fill-in-the-Middle objective on 4+ trillion tokens. The model was trained using the NVIDIA NeMo Framework on the NVIDIA Eos Supercomputer built with NVIDIA DGX H100 systems. The starcoder2-15b model is an evolution of the earlier StarCoder model, which was a 15.5B parameter model trained on 80+ programming languages. Both models were developed by the BigCode team. Model inputs and outputs Inputs Text prompts in any of the 600+ programming languages the model was trained on Outputs Generated code in response to the input prompt Capabilities The starcoder2-15b model is capable of generating code in a wide variety of programming languages. It can be used for tasks like code completion, code generation, and even open-ended programming challenges. The model's large size and extensive training data allow it to handle complex programming concepts and idioms across many languages. What can I use it for? The starcoder2-15b model could be useful for a variety of applications, such as: Building programming assistants to help developers write code more efficiently Generating example code snippets for educational or documentation purposes Prototyping new ideas and quickly iterating on code-based projects Integrating code generation capabilities into no-code or low-code platforms Things to try One interesting aspect of the starcoder2-15b model is its ability to handle long-form context. By training on a 16,384 token context window, the model can generate code that is coherent and consistent over a large number of lines. You could try providing the model with a partially completed function or class definition and see if it can generate the remaining implementation. Another interesting experiment would be to fine-tune the starcoder2-15b model on a specific programming language or domain-specific dataset. This could allow the model to develop specialized knowledge and skills tailored to your particular use case.

Updated Invalid Date

Text-to-Text

✨

piiranha-v1-detect-personal-information

iiiorg

piiranha-v1-detect-personal-information is a fine-tuned model developed by iiiorg that is trained to detect 17 types of personally identifiable information (PII) across six languages. It achieves an overall classification accuracy of 99.44% and successfully catches 98.27% of PII tokens. The model is especially accurate at detecting passwords, emails (100%), phone numbers, and usernames. Similar models include StarPII, which is an NER model trained to detect PII in code datasets, and GLiNER PII, a NER model that can recognize various types of PII entities. Model inputs and outputs Inputs Text data containing personally identifiable information Outputs Detected PII entities with their corresponding labels, such as: Account Number Email Phone Number Password Social Security Number Capabilities piiranha-v1-detect-personal-information is highly accurate at identifying a wide range of PII entities, including sensitive information like passwords, credit card numbers, and social security numbers. This makes it a valuable tool for privacy protection and data anonymization use cases. What can I use it for? The piiranha-v1-detect-personal-information model can be used to automatically detect and redact or remove personally identifiable information from text data, such as customer records, support tickets, or user-generated content. This can help organizations comply with data privacy regulations and protect sensitive user information. Things to try You could try using the piiranha-v1-detect-personal-information model to analyze text data from your own organization and identify any PII that may need to be removed or protected. You could also experiment with fine-tuning the model on your own dataset to improve its performance for your specific use case.

Updated Invalid Date

Text-to-Text