Vidore

Models by this creator

🚀

colpali

172

colpali is a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is an extension of the PaliGemma-3B model that generates ColBERT-style multi-vector representations of text and images. Developed by vidore, ColPali was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models. Model inputs and outputs Inputs Images and text documents Outputs Ranked list of relevant documents for a given query Efficient document retrieval using ColBERT-style multi-vector representations Capabilities ColPali is designed to enable fast and accurate retrieval of documents based on their visual and textual content. By generating ColBERT-style representations, it can efficiently match queries to relevant passages, outperforming earlier BiPali models that only used text-based representations. What can I use it for? The ColPali model can be used for a variety of document retrieval and search tasks, such as finding relevant research papers, product information, or news articles based on a user's query. Its ability to leverage both visual and textual content makes it particularly useful for tasks that involve mixed media, like retrieving relevant documents for a given image. Things to try One interesting aspect of ColPali is its use of the PaliGemma-3B language model as a starting point. By finetuning this off-the-shelf model and incorporating ColBERT-style multi-vector representations, the researchers were able to create a powerful retrieval system. This suggests that similar techniques could be applied to other large language models to create specialized retrieval systems for different domains or use cases.

Updated 8/7/2024

Image-to-Text

🎯

colqwen2-v0.1

vidore

colqwen2-v0.1 is a model based on a novel model architecture and training strategy called ColPali, which is designed to efficiently index documents from their visual features. It is an extension of the Qwen2-VL-2B model that generates ColBERT-style multi-vector representations of text and images. This version is the untrained base version to guarantee deterministic projection layer initialization. The model was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository. It was developed by the team at vidore. Model inputs and outputs Inputs Images**: The model takes dynamic image resolutions as input and does not resize them, maintaining their aspect ratio. Text**: The model can take text inputs, such as queries, to be used alongside the image inputs. Outputs The model outputs multi-vector representations of the text and images, which can be used for efficient document retrieval. Capabilities colqwen2-v0.1 is designed to efficiently index documents from their visual features. It can generate multi-vector representations of text and images using the ColBERT strategy, which enables improved performance compared to previous models like BiPali. What can I use it for? The colqwen2-v0.1 model can be used for a variety of document retrieval tasks, such as searching for relevant documents based on visual features. It could be particularly useful for applications that deal with large document repositories, such as academic paper search engines or enterprise knowledge management systems. Things to try One interesting aspect of colqwen2-v0.1 is its ability to handle dynamic image resolutions without resizing them. This can be useful for preserving the original aspect ratio and visual information of the documents being indexed. You could experiment with different image resolutions and observe how the model's performance changes. Additionally, you could explore the model's performance on a variety of document types beyond just PDFs, such as scanned images or screenshots, to see how it generalizes to different visual input formats.

Updated 10/4/2024

Image-to-Text