SketchGPT: Autoregressive Modeling for Sketch Generation and Recognition

Read original: arXiv:2405.03099 - Published 5/7/2024 by Adarsh Tiwari, Sanket Biswas, Josep Llad'os

SketchGPT: Autoregressive Modeling for Sketch Generation and Recognition

Overview

This paper presents SketchGPT, an autoregressive model for generating and recognizing freehand sketches.
The model is trained to predict the next stroke in a sketch based on the previous strokes, allowing it to complete partially drawn sketches or generate new sketches from scratch.
SketchGPT also enables recognition of hand-drawn sketches by mapping them to a fixed set of sketch primitives.

Plain English Explanation

SketchGPT: Autoregressive Modeling for Sketch Generation and Recognition is a new AI system that can generate and recognize freehand sketches. The key idea is to train the model to predict the next stroke in a sketch based on the previous strokes. This allows the model to complete partially drawn sketches or generate entirely new sketches from scratch, simply by continuing to predict the next stroke.

The model can also recognize hand-drawn sketches by mapping them to a set of basic "sketch primitives" - simple shapes and lines that can be combined to represent more complex drawings. This sketch recognition capability means the model can understand and interpret freehand sketches, not just generate them.

Overall, SketchGPT demonstrates how autoregressive models, which predict the next element in a sequence, can be applied to the domain of freehand sketching. This opens up possibilities for more intuitive and natural sketch-based interfaces and creative tools powered by AI.

Technical Explanation

SketchGPT is an autoregressive model that is trained to generate and recognize freehand sketches. The model takes in a sequence of sketch strokes as input and predicts the next stroke to be added. By iteratively predicting the next stroke, the model can either complete a partially drawn sketch or generate a new sketch from scratch.

The model is also trained to map sketches to a fixed set of "sketch primitives" - basic shapes and lines that can be combined to represent more complex drawings. This allows the model to recognize and classify hand-drawn sketches, not just generate them.

The key technical components of SketchGPT include:

An autoregressive transformer architecture that predicts the next stroke given the previous strokes
A stroke-to-primitive mapping module that associates sketches with a set of predefined primitives
Training on large datasets of freehand sketches to enable both generation and recognition capabilities

Through extensive experiments, the authors demonstrate SketchGPT's strong performance on tasks like sketch completion, recognition, and next-stroke prediction. The model outperforms previous approaches and shows the potential of autoregressive models for sketch-based applications.

Critical Analysis

One limitation of the SketchGPT approach is that it relies on a fixed set of sketch primitives for recognition. While this provides a way to interpret freehand sketches, it may not capture the full richness and diversity of human drawing. An interesting area for future research would be to explore more open-ended or generative approaches to sketch recognition that can handle a wider range of drawing styles and complexities.

Additionally, the paper does not delve deeply into the potential biases or limitations of the training data used for SketchGPT. As with many machine learning models, the performance and behavior of SketchGPT is likely influenced by the characteristics of the sketches used during training. Further investigation into dataset bias and its impact on the model's capabilities would be a valuable addition to this line of research.

Overall, SketchGPT represents an exciting advancement in the field of sketch-based AI, demonstrating the power of autoregressive modeling for both generation and recognition tasks. As the authors note, this work opens up new possibilities for more natural and intuitive sketch-based interfaces and creative tools. Continued research and development in this area could lead to significant advancements in how humans interact with and express themselves through digital media.

Conclusion

SketchGPT is a novel autoregressive model that can generate and recognize freehand sketches. By training the model to predict the next stroke in a sketch, it can complete partially drawn sketches or generate new sketches from scratch. The model also has the ability to map sketches to a set of predefined primitives, enabling sketch recognition and classification.

This work demonstrates the potential of autoregressive models, which are trained to predict the next element in a sequence, for applications beyond traditional text-based tasks. The ability to generate and understand freehand sketches opens up new possibilities for more intuitive and natural user interfaces, creative tools, and sketch-based applications powered by AI. As the field of sketch-based AI continues to evolve, SketchGPT represents an exciting step forward in bridging the gap between human and machine drawing and expression.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SketchGPT: Autoregressive Modeling for Sketch Generation and Recognition

Adarsh Tiwari, Sanket Biswas, Josep Llad'os

We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By mapping complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction objective strategy to understand sketch patterns, facilitating the creation and completion of drawings and also categorizing them accurately. This proposed sketch representation strategy aids in overcoming existing challenges of autoregressive modeling for continuous stroke data, enabling smoother model training and competitive performance. Our findings exhibit SketchGPT's capability to generate a diverse variety of drawings by adding both qualitative and quantitative comparisons with existing state-of-the-art, along with a comprehensive human evaluation study. The code and pretrained models will be released on our official GitHub.

5/7/2024

{sigma}-GPTs: A New Approach to Autoregressive Models

205

{sigma}-GPTs: A New Approach to Autoregressive Models

Arnaud Pannatier, Evann Courdier, Franc{c}ois Fleuret

Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample which offers key advantageous properties. It allows for the sampling of and conditioning on arbitrary subsets of tokens, and it also allows sampling in one shot multiple tokens dynamically according to a rejection strategy, leading to a sub-linear number of model evaluations. We evaluate our method across various domains, including language modeling, path-solving, and aircraft vertical rate prediction, decreasing the number of steps required for generation by an order of magnitude.

7/2/2024

A new approach for encoding code and assisting code understanding

Mengdan Fan, Wei Zhang, Haiyan Zhao, Zhi Jin

Some companies(e.g., Microsoft Research and Google DeepMind) have discovered some of the limitations of GPTs autoregressive paradigm next-word prediction, manifested in the model lack of planning, working memory, backtracking, and reasoning skills. GPTs rely on a local and greedy process of generating the next word, without a global understanding of the task or the output.We have confirmed the above limitations through specialized empirical studies of code comprehension. Although GPT4 is good at producing fluent and coherent text, it cannot handle complex logic and generate new code that haven not been seen, and it relies too much on the formatting of the prompt to generate the correct code.We propose a new paradigm for code understanding that goes beyond the next-word prediction paradigm, inspired by the successful application of diffusion techniques to image generation(Dalle2, Sora) and protein structure generation(AlphaFold3), which have no autoregressive constraints.Instead of encoding the code in a form that mimics natural language, we encode the code as a heterogeneous image paradigm with a memory of global information that mimics both images and protein structures.We then refer to Sora's CLIP upstream text-to-image encoder model to design a text-to-code encoder model that can be applied to various downstream code understanding tasks.The model learns the global understanding of code under the new paradigm heterogeneous image, connects the encoding space of text and code, and encodes the input of text into the vector of code most similar to it.Using self-supervised comparative learning on 456,360 text-code pairs, the model achieved a zero-shot prediction of new data. This work is the basis for future work on code generation using diffusion techniques under a new paradigm to avoid autoregressive limitations.

8/2/2024

G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer

Jinzhi Zhang, Feng Xiong, Mu Xu

Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable coarse-to-fine 3D generative model utilizing a cross-scale querying transformer. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship between different levels suitable for autoregressive modeling. Additionally, the cross-scale querying transformer connects tokens globally across different levels of detail without requiring an ordered sequence. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports diverse conditional structures, enabling the generation of 3D shapes from various types of conditions. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous 3D generation methods. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.

9/11/2024