Bigcode

Models by this creator

🐍

starcoder

2.7K

The starcoder model is a 15.5B parameter language model developed by BigCode, trained on 80+ programming languages from The Stack (v1.2) dataset. It uses Multi Query Attention, a context window of 8192 tokens, and the Fill-in-the-Middle objective, and was trained on 1 trillion tokens. The model is available on the Hugging Face platform and can be used for various programming-related tasks. The starcoder model can be compared to similar models like Magicoder-S-DS-6.7B, which is also a large language model trained on code, and WizardLM-7B-uncensored-GPTQ, which is a large language model focused on general text generation. These models share similarities in their target domains and capabilities, but may differ in their specific architecture, training data, and intended use cases. Model inputs and outputs The starcoder model is a causal language model, which means it can be used to generate text in an auto-regressive manner. The model takes in a sequence of tokens as input and generates a sequence of tokens as output, where each output token is predicted based on the previous tokens in the sequence. Inputs Prompt**: A sequence of tokens that the model uses as the starting point for text generation. Outputs Generated text**: A sequence of tokens generated by the model, continuing the input prompt. Capabilities The starcoder model is designed to excel at programming-related tasks, such as code generation, code completion, and programming language understanding. It can be used to generate code snippets, complete partially written code, and even translate between different programming languages. The model's broad training on 80+ programming languages allows it to handle a wide variety of coding tasks and contexts. What can I use it for? The starcoder model can be used for a variety of programming-related applications, such as: Code generation**: Automatically generating code based on a natural language description or prompt. Code completion**: Suggesting completions for partially written code. Programming language translation**: Translating code between different programming languages. Documentation generation**: Automatically generating documentation for code. Programming education**: Assisting students in learning programming concepts and syntax. The model's capabilities can be leveraged in various industries, such as software development, programming education, and technical writing. Things to try One interesting aspect of the starcoder model is its use of the Fill-in-the-Middle objective during training. This approach allows the model to learn to generate text in a more holistic, contextual manner, rather than just predicting the next token in a sequence. You can experiment with this by using the `, , and ` tokens to guide the model's text generation. Another interesting area to explore is the model's ability to handle different programming languages. You can try providing prompts in various languages and observe how the model responds, or even attempt to translate code between languages using the model.

Updated 4/28/2024

Text-to-Text

📊

starcoder2-15b

bigcode

505

The starcoder2-15b model is a 15B parameter model trained on 600+ programming languages from the The Stack v2 dataset, with opt-out requests excluded. The model uses Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and was trained using the Fill-in-the-Middle objective on 4+ trillion tokens. The model was trained using the NVIDIA NeMo Framework on the NVIDIA Eos Supercomputer built with NVIDIA DGX H100 systems. The starcoder2-15b model is an evolution of the earlier StarCoder model, which was a 15.5B parameter model trained on 80+ programming languages. Both models were developed by the BigCode team. Model inputs and outputs Inputs Text prompts in any of the 600+ programming languages the model was trained on Outputs Generated code in response to the input prompt Capabilities The starcoder2-15b model is capable of generating code in a wide variety of programming languages. It can be used for tasks like code completion, code generation, and even open-ended programming challenges. The model's large size and extensive training data allow it to handle complex programming concepts and idioms across many languages. What can I use it for? The starcoder2-15b model could be useful for a variety of applications, such as: Building programming assistants to help developers write code more efficiently Generating example code snippets for educational or documentation purposes Prototyping new ideas and quickly iterating on code-based projects Integrating code generation capabilities into no-code or low-code platforms Things to try One interesting aspect of the starcoder2-15b model is its ability to handle long-form context. By training on a 16,384 token context window, the model can generate code that is coherent and consistent over a large number of lines. You could try providing the model with a partially completed function or class definition and see if it can generate the remaining implementation. Another interesting experiment would be to fine-tune the starcoder2-15b model on a specific programming language or domain-specific dataset. This could allow the model to develop specialized knowledge and skills tailored to your particular use case.

Updated 5/28/2024

Text-to-Text

🔄

starcoderbase

bigcode

380

The starcoderbase model is a 15.5B parameter AI model developed by the BigCode project. It was trained on over 80 programming languages from the Stack (v1.2) dataset, using techniques like Multi Query Attention, a context window of 8192 tokens, and the Fill-in-the-Middle objective. The model is available on the Hugging Face platform and can be accessed through the bigcode/starcoderbase checkpoint. The starcoderbase model is related to other BigCode models like starcoder and starcoder2-15b, which also focus on code generation and have been trained on a large corpus of programming languages. Model inputs and outputs Inputs Text prompts**: The model takes text prompts as input, which can be code snippets, natural language instructions, or a combination of the two. Outputs Generated text**: The model outputs generated text, which can be continuations of the input prompt, such as completing a code snippet or generating a new section of code. Capabilities The starcoderbase model is capable of generating code in over 80 programming languages. It can be used for tasks like code completion, code generation, and even code translation between different languages. The model has demonstrated strong performance on metrics like the MultiPL-E benchmark across a variety of programming languages. What can I use it for? The starcoderbase model can be used as a foundation for building various AI-powered coding tools and applications. For example, it could be integrated into an IDE to provide intelligent code completion and generation features, or used to build a virtual programming assistant that can help developers with a wide range of coding tasks. Things to try One interesting aspect of the starcoderbase model is its Fill-in-the-Middle (FIM) capability, which allows it to generate code by filling in the middle of a code snippet while preserving the prefix and suffix. This could be useful for tasks like implementing a specific algorithm or function within a larger codebase. Additionally, the model's ability to generate code in over 80 programming languages opens up the possibility of building multilingual coding tools or applications that can seamlessly switch between different languages as needed.

Updated 4/28/2024

Text-to-Text

🗣️

santacoder

bigcode

324

The santacoder models are a series of 1.1B parameter models trained on the Python, Java, and JavaScript subset of The Stack (v1.1) by the bigcode team. These models use Multi Query Attention, a context window of 2048 tokens, and were trained using near-deduplication and comment-to-code ratio as filtering criteria, as well as the Fill-in-the-Middle objective. There are several variants of the model with different filter parameters, architecture, and objective variations. Similar models from the bigcode team include the StarCoder and StarCoderBase models, which are 15.5B parameter models trained on 80+ programming languages. Model inputs and outputs Inputs Code snippets or prompts related to Python, Java, or JavaScript programming Outputs Completed code that builds upon the input prompt Explanations or annotations about the code Capabilities The santacoder models are capable of generating relevant code completions based on the provided input. They can handle a variety of programming tasks, from filling in function bodies to generating entire classes or modules. The models have been trained to maintain code structure and syntax, making the generated output usable in real-world applications. What can I use it for? The santacoder models can be used as a starting point for building AI-powered code completion tools, intelligent code editors, or automated programming assistants. Developers can fine-tune the models for their specific use cases or integrate them into their existing workflows to boost productivity and save time. Things to try One interesting aspect of the santacoder models is their ability to generate code based on comments or docstrings. Try providing the model with a function signature and docstring, and see if it can complete the function body in a logical and syntactically correct way. You can also experiment with the different variants of the model to see how they perform on your specific tasks or datasets.

Updated 5/28/2024

Text-to-Text

🤔

starcoderplus

bigcode

212

The starcoderplus model is a text-to-text AI model developed by bigcode. This model is similar to other large language models like mpt-30B-chat-GGML, stt_en_conformer_transducer_xlarge, codellama-13b, codellama-13b-instruct, and meta-llama-3-70b-instruct, which are also designed for text generation and natural language processing tasks. Model inputs and outputs The starcoderplus model takes text as input and generates text as output. The model is trained on a large corpus of text data, allowing it to understand and generate human-like language. Inputs Text prompts Outputs Generated text that continues or completes the input prompt Capabilities The starcoderplus model can be used for a variety of text-related tasks, such as language generation, text summarization, and question answering. It can generate coherent and contextually relevant text, making it useful for applications like content creation, chatbots, and language translation. What can I use it for? The starcoderplus model can be used for a range of applications, such as bigcode's own services or for building custom natural language processing solutions. For example, the model could be used to generate product descriptions, write news articles, or provide human-like responses in a conversational interface. Things to try Depending on your specific use case, you could experiment with providing the starcoderplus model with different types of text prompts and observe the generated outputs. This can help you understand the model's strengths and limitations, and identify ways to best leverage its capabilities for your needs.

Updated 4/29/2024

Text-to-Text

🤯

starcoder2-7b

bigcode

138

The starcoder2-7b model is a 7B parameter AI model trained by bigcode on 17 programming languages from The Stack v2 dataset. The model uses advanced techniques like Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and was trained using the Fill-in-the-Middle objective on over 3.5 trillion tokens. The starcoder2-7b model is comparable to other large language models like starcoder2-15b, starcoder, and starcoderbase in terms of its scale and capabilities, but was trained on a more focused set of programming languages. Model inputs and outputs The starcoder2-7b model is a text-to-text transformer model, meaning it takes in text as input and generates text as output. The model can be used for a variety of text generation tasks, such as code completion, commenting, and summarization. Inputs Text prompts**: The model accepts arbitrary text prompts as input, which can be used to guide the model's generation. Outputs Generated text**: The model outputs generated text, which can be code, comments, or other forms of text. Capabilities The starcoder2-7b model is capable of generating high-quality code in 17 programming languages, including Python, Java, and JavaScript. The model can be used for tasks like code completion, where the model can suggest the next few lines of code based on a given prompt. The model can also be used for code summarization, where the model can generate a concise summary of a given code snippet. What can I use it for? The starcoder2-7b model can be used for a variety of applications in the software development and AI research domains. Some potential use cases include: Code generation**: The model can be used to generate boilerplate code, implement algorithms, or complete partially written functions. Code summarization**: The model can be used to generate concise summaries of code snippets, which can be useful for documentation or code review. Code translation**: The model can be used to translate code between different programming languages. Code refactoring**: The model can be used to suggest improvements or optimizations to existing code. Things to try One interesting thing to try with the starcoder2-7b model is using the Fill-in-the-Middle (FIM) technique, which allows the model to generate text by filling in the middle of a provided prefix and suffix. This can be useful for tasks like code completion, where the user provides the function signature and the model generates the function body. Another interesting thing to try is fine-tuning the model on a specific domain or task. Since the starcoder2-7b model was trained on a broad dataset, fine-tuning it on a more specialized dataset could improve its performance on certain tasks.

Updated 5/28/2024

Text-to-Text

❗

starcoder2-3b

bigcode

112

The starcoder2-3b model is a 3 billion parameter AI model developed by the BigCode project. It is trained on 17 programming languages using a large dataset of source code from GitHub, Arxiv, and Wikipedia. The model uses advanced techniques like Grouped Query Attention, a large context window of 16,384 tokens, and a sliding window attention of 4,096 tokens. It was trained using the Fill-in-the-Middle objective on over 3 trillion tokens. This makes it a capable code generation model, able to produce coherent and syntactically correct code snippets given some context. The starcoder2-3b model is part of the StarCoder2 family of models, which also include the starcoder2-7b and starcoder2-15b larger models. These models build upon the original StarCoder model, which was trained on a smaller dataset of 80+ programming languages. The StarCoder2 models represent the next generation of the BigCode project's AI models for code generation. Model inputs and outputs Inputs Text prompts containing context or partial code snippets Outputs Continuation of the input text, generating new code based on the provided context The model can also be used for "Fill-in-the-Middle" tasks, where it is given a prefix and suffix and asked to generate the middle portion Capabilities The starcoder2-3b model is capable of generating coherent and syntactically correct code in 17 different programming languages, including popular ones like Python, Java, and JavaScript. It can continue code snippets, fill in missing parts, and even generate code from scratch given some context. For example, if given the prompt "def print_hello_world():", the model can generate a complete function definition: def print_hello_world(): print('Hello, world!') Or if given the prompt "def fib(n): else:\n return fib(n - 2) + fib(n - 1)", the model can fill in the missing middle part: def fib(n): if n <= 1: return n else: return fib(n - 2) + fib(n - 1) What can I use it for? The starcoder2-3b model can be used for a variety of code generation and automation tasks. Some potential use cases include: Generating boilerplate code or code templates Expanding partial code snippets Assisting with programming tasks by generating suggested completions Prototyping new software features or applications Enabling more efficient code reuse and collaboration The model is particularly well-suited for tasks that require generating code in multiple programming languages, as it has been trained on a diverse set of languages. Things to try One interesting thing to try with the starcoder2-3b model is its "Fill-in-the-Middle" capability. By providing a prefix and suffix, the model can generate the middle portion of a code snippet. This can be useful for tasks like expanding on partially completed code or generating variations on existing code. Another thing to explore is the model's ability to generate code in different programming languages. Try providing prompts in various languages and see how the model performs. You may find it generates more natural and idiomatic code in some languages compared to others. Finally, consider fine-tuning the model on your own domain-specific data or tasks. The BigCode project provides a script for fine-tuning the StarCoder2 models, which could allow you to adapt the model to your particular needs and use cases.

Updated 5/28/2024

Text-to-Text

👀

starpii

bigcode

104

The starpii model is a Named Entity Recognition (NER) model trained to detect Personal Identifiable Information (PII) in code datasets. It was fine-tuned by bigcode on a PII dataset they annotated, which is available with gated access. The model was initially trained on a pseudo-labeled dataset to enhance its performance on rare PII entities like keys. The model fine-tuned on the annotated dataset can detect six target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames. It uses the bigcode-encoder as its base encoder model, which was pre-trained on 88 programming languages from the The Stack dataset. Model inputs and outputs Inputs Raw text containing code snippets or documents Outputs Annotated text with PII entities highlighted and classified into one of the six target classes Capabilities The starpii model demonstrates strong performance in detecting various types of PII entities within code, including rare ones like keys and passwords. This can be useful for privacy-preserving applications that need to automatically identify and redact sensitive information. What can I use it for? The starpii model can be applied to a variety of use cases where identifying PII in code is important, such as: Anonymizing code datasets before sharing or publishing Detecting sensitive information in internal code repositories Regulatory compliance by finding PII in financial or legal documents Things to try One interesting aspect of the starpii model is its use of a pseudo-labeled dataset for initial training. This technique can be helpful for improving model performance on rare entities that are difficult to obtain labeled data for. You could experiment with applying similar approaches to other domain-specific NER tasks.

Updated 4/29/2024

Text-to-Image

🛠️

starcoder2-15b-instruct-v0.1

bigcode

starcoder2-15b-instruct-v0.1 is the very first entirely self-aligned code Large Language Model (LLM) trained with a fully permissive and transparent pipeline. It was developed by bigcode, an organization focused on building open-source AI models. The model was trained using an open-source pipeline that generates thousands of instruction-response pairs, which are then used to fine-tune the base starcoder2-15b model without any human annotations or distilled data from huge proprietary LLMs. This self-alignment approach contrasts with the typical instruction-tuning process, which often relies on distilled data from large, closed-source models. By using a fully transparent and permissive pipeline, starcoder2-15b-instruct-v0.1 aims to provide a more ethical and accountable code generation model. The starcoder2-15b model, which serves as the base for this instructed version, is a 15B parameter model trained on over 600 programming languages from The Stack v2 dataset. It uses advanced transformer architectures like Grouped Query Attention and a sliding window context of 16,384 tokens to enable efficient and high-quality code generation. Model inputs and outputs Inputs Instruction**: A natural language description of a task or request, such as "Write a function that computes the square root." Outputs Generated code**: The model's attempt to generate code that fulfills the given instruction, such as a function that computes the square root. Capabilities starcoder2-15b-instruct-v0.1 is designed to respond to coding-related instructions in a single turn. It can generate code snippets across a wide range of programming languages to help with tasks like algorithm implementation, data processing, and software development. However, the generated code is not guaranteed to be correct or efficient, as the model may introduce bugs or suboptimal solutions. What can I use it for? You can use starcoder2-15b-instruct-v0.1 to help with a variety of coding-related tasks, such as: Prototyping new algorithms or features Automating repetitive coding tasks Generating boilerplate code or scaffolding Exploring different programming approaches to a problem While the model can be a useful tool, it's important to review and test any generated code before using it in a production environment. The search index provided by the BigCode project can help you identify the origin of generated code and ensure proper attribution. Things to try One interesting aspect of starcoder2-15b-instruct-v0.1 is its ability to generate code in a fully self-aligned and transparent manner. This approach aims to address some of the ethical and accountability concerns surrounding large language models trained on proprietary data. You could try providing the model with more complex or open-ended instructions to see how it responds, or experiment with the model's ability to generate code in different programming languages. Additionally, you could explore using the model in conjunction with other tools, such as unit testing frameworks, to validate the correctness of the generated code.

Updated 6/4/2024

Text-to-Text

🔎

tiny_starcoder_py

bigcode

The tiny_starcoder_py is a 164M parameter AI model with the same architecture as the larger StarCoder model. It was trained on Python data from the StarCoderData dataset for around 6 epochs, totaling 100 billion training tokens. Like StarCoder, it uses Multi-Query Attention and the Fill-in-the-Middle training objective. Model inputs and outputs The tiny_starcoder_py model is a text generation model, taking in a prompt as input and generating new text as output. It is designed to assist with tasks like Assisted Generation, where it can help a human author generate or complete code snippets. Inputs Arbitrary text prompts, such as the start of a Python function Outputs Continuations of the input text, generating new code or text Capabilities The tiny_starcoder_py model is capable of generating Python code to continue or complete a given prompt. It can handle a wide variety of Python programming constructs, from simple function definitions to more complex control flow and data structures. However, the generated code may not always be optimal or bug-free, as the model was trained on a large but potentially noisy dataset of Python code from the internet. What can I use it for? The tiny_starcoder_py model can be useful for tasks like: Assisted code generation: Provide a starting prompt and let the model generate the rest of the function or code snippet. Code completion: Use the model to suggest continuations or completions as you're writing code. Prototyping and experimentation: Quickly generate sample code to test ideas or explore new approaches. However, for pure code completion tasks, the maintainer recommends using their larger StarCoder or StarCoderBase models instead. Things to try One interesting aspect of the tiny_starcoder_py model is its ability to perform "fill-in-the-middle" generation. By using special tokens to identify the prefix, middle, and suffix of the input, you can prompt the model to generate the missing middle part of a code snippet. This can be a useful technique for exploring different solutions or variations on a programming problem.

Updated 5/28/2024

Text-to-Text