tokenflow

Maintainer: cjwbw

Last updated 9/18/2024

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

TokenFlow is a framework that enables consistent video editing using a pre-trained text-to-image diffusion model, without any further training or finetuning. It builds upon key observations that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. The method propagates diffusion features based on inter-frame correspondences to preserve the spatial layout and dynamics of the input video, while adhering to the target text prompt. This approach contrasts with similar models like consisti2v, which focuses on enhancing visual consistency for I2V generation, and stable-video-diffusion, which aims to generate high-quality videos from text.

Model inputs and outputs

TokenFlow is designed for structure-preserving video editing. The model takes in a source video and a target text prompt, and generates a new video that adheres to the prompt while preserving the spatial layout and dynamics of the input.

Inputs

Video: The input video to be edited
Inversion Prompt: A text description of the input video (optional)
Diffusion Prompt: A text description of the desired output video
Negative Diffusion Prompt: Words or phrases to avoid in the output video

Outputs

Edited Video: The output video that reflects the target text prompt while maintaining the consistency of the input video

Capabilities

TokenFlow leverages a pre-trained text-to-image diffusion model to enable text-driven video editing without additional training. It can be used to make localized and global edits that change the texture of existing objects or augment the scene with semi-transparent effects (e.g., smoke, fire, snow).

What can I use it for?

The TokenFlow framework can be useful for a variety of video editing applications, such as:

Video Augmentation: Enhancing existing videos by adding new elements like visual effects or changing the appearance of objects
Video Retouching: Improving the quality and consistency of videos by addressing issues like lighting, texture, or composition
Video Personalization: Customizing videos to match a specific style or theme by aligning the content with a target text prompt

Things to try

One key aspect of TokenFlow is its ability to preserve the spatial layout and dynamics of the input video while editing. This can be particularly useful for creating seamless and natural-looking video edits. Experiment with a variety of text prompts to see how the model can transform the visual elements of a video while maintaining the overall structure and flow.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

clip-guided-diffusion

cjwbw

clip-guided-diffusion is a Cog implementation of the CLIP Guided Diffusion model, originally developed by Katherine Crowson. This model leverages the CLIP (Contrastive Language-Image Pre-training) technique to guide the image generation process, allowing for more semantically meaningful and visually coherent outputs compared to traditional diffusion models. Unlike the Stable Diffusion model, which is trained on a large and diverse dataset, clip-guided-diffusion is focused on generating images from text prompts in a more targeted and controlled manner. Model inputs and outputs The clip-guided-diffusion model takes a text prompt as input and generates a set of images as output. The text prompt can be anything from a simple description to a more complex, imaginative scenario. The model then uses the CLIP technique to guide the diffusion process, resulting in images that closely match the semantic content of the input prompt. Inputs Prompt**: The text prompt that describes the desired image. Timesteps**: The number of diffusion steps to use during the image generation process. Display Frequency**: The frequency at which the intermediate image outputs should be displayed. Outputs Array of Image URLs**: The generated images, each represented as a URL. Capabilities The clip-guided-diffusion model is capable of generating a wide range of images based on text prompts, from realistic scenes to more abstract and imaginative compositions. Unlike the more general-purpose Stable Diffusion model, clip-guided-diffusion is designed to produce images that are more closely aligned with the semantic content of the input prompt, resulting in a more targeted and coherent output. What can I use it for? The clip-guided-diffusion model can be used for a variety of applications, including: Content Generation**: Create unique, custom images to use in marketing materials, social media posts, or other visual content. Prototyping and Visualization**: Quickly generate visual concepts and ideas based on textual descriptions, which can be useful in fields like design, product development, and architecture. Creative Exploration**: Experiment with different text prompts to generate unexpected and imaginative images, opening up new creative possibilities. Things to try One interesting aspect of the clip-guided-diffusion model is its ability to generate images that capture the nuanced semantics of the input prompt. Try experimenting with prompts that contain specific details or evocative language, and observe how the model translates these textual descriptions into visually compelling outputs. Additionally, you can explore the model's capabilities by comparing its results to those of other diffusion-based models, such as Stable Diffusion or DiffusionCLIP, to understand the unique strengths and characteristics of the clip-guided-diffusion approach.

Updated Invalid Date

Text-to-Image

lavie

cjwbw

749

LaVie is a high-quality video generation framework developed by cjwbw, the same creator behind similar models like tokenflow, video-retalking, kandinskyvideo, and videocrafter. LaVie uses a cascaded latent diffusion approach to generate high-quality videos from text prompts, with the ability to perform video interpolation and super-resolution. Model inputs and outputs LaVie takes in a text prompt and various configuration options to generate a high-quality video. The model can produce videos with resolutions up to 1280x2048 and lengths of up to 61 frames. Inputs Prompt**: The text prompt that describes the desired video content. Width/Height**: The resolution of the output video. Seed**: A random seed value to control the stochastic generation process. Quality**: An integer value between 0-10 that controls the overall visual quality of the output. Video FPS**: The number of frames per second in the output video. Interpolation**: A boolean flag to enable video interpolation for longer videos. Super Resolution**: A boolean flag to enable 4x super-resolution of the output video. Outputs Output Video**: A high-quality video file generated from the input prompt and configuration. Capabilities LaVie can generate a wide variety of video content, from realistic scenes to fantastical and imaginative scenarios. The model is capable of producing videos with a high level of visual detail and coherence, with natural camera movements and seamless transitions between frames. Some example videos generated by LaVie include: A Corgi walking in a park at sunrise, with an oil painting style A panda taking a selfie in high-quality 2K resolution A polar bear playing a drum kit in the middle of Times Square, in high-resolution 4K What can I use it for? LaVie is a powerful tool for content creators, filmmakers, and artists who want to generate high-quality video content quickly and efficiently. The model can be used to create visually stunning promotional videos, short films, or even as a starting point for more complex video projects. Additionally, the ability to generate videos from text prompts opens up new possibilities for interactive storytelling, educational content, and even virtual events. By leveraging the capabilities of LaVie, creators can bring their imaginative visions to life in a way that was previously difficult or time-consuming. Things to try One interesting aspect of LaVie is its ability to generate videos with a diverse range of visual styles, from realistic to fantastical. Experiment with different prompts that combine realistic elements (e.g., a park, a city street) with more imaginative or surreal components (e.g., a teddy bear walking, a shark swimming in a clear Caribbean ocean) to see the range of outputs the model can produce. Additionally, try using the video interpolation and super-resolution features to create longer, higher-quality videos from your initial text prompts. These advanced capabilities can help bring your video ideas to life in a more polished and visually stunning way.

Updated Invalid Date

Video-to-Video

⛏️

text2video-zero

cjwbw

The text2video-zero model, developed by cjwbw from Picsart AI Research, leverages the power of existing text-to-image synthesis methods, like Stable Diffusion, to enable zero-shot video generation. This means the model can generate videos directly from text prompts without any additional training or fine-tuning. The model is capable of producing temporally consistent videos that closely follow the provided textual guidance. The text2video-zero model is related to other text-guided diffusion models like Clip-Guided Diffusion and TextDiffuser, which explore various techniques for using diffusion models as text-to-image and text-to-video generators. Model Inputs and Outputs Inputs Prompt**: The textual description of the desired video content. Model Name**: The Stable Diffusion model to use as the base for video generation. Timestep T0 and T1**: The range of DDPM steps to perform, controlling the level of variance between frames. Motion Field Strength X and Y**: Parameters that control the amount of motion applied to the generated frames. Video Length**: The desired duration of the output video. Seed**: An optional random seed to ensure reproducibility. Outputs Video**: The generated video file based on the provided prompt and parameters. Capabilities The text2video-zero model can generate a wide variety of videos from text prompts, including scenes with animals, people, and fantastical elements. For example, it can produce videos of "a horse galloping on a street", "a panda surfing on a wakeboard", or "an astronaut dancing in outer space". The model is able to capture the movement and dynamics of the described scenes, resulting in temporally consistent and visually compelling videos. What can I use it for? The text2video-zero model can be useful for a variety of applications, such as: Generating video content for social media, marketing, or entertainment purposes. Prototyping and visualizing ideas or concepts that can be described in text form. Experimenting with creative video generation and exploring the boundaries of what is possible with AI-powered video synthesis. Things to try One interesting aspect of the text2video-zero model is its ability to incorporate additional guidance, such as poses or edges, to further influence the generated video. By providing a reference video or image with canny edges, the model can generate videos that closely follow the visual structure of the guidance, while still adhering to the textual prompt. Another intriguing feature is the model's support for Dreambooth specialization, which allows you to fine-tune the model on a specific visual style or character. This can be used to generate videos that have a distinct artistic or stylistic flair, such as "an astronaut dancing in the style of Van Gogh's Starry Night".

Updated Invalid Date

Text-to-Video

video-retalking

cjwbw

video-retalking is a system developed by researchers at Tencent AI Lab and Xidian University that enables audio-based lip synchronization and expression editing for talking head videos. It builds on prior work like Wav2Lip, PIRenderer, and GFP-GAN to create a pipeline for generating high-quality, lip-synced videos from talking head footage and audio. Unlike models like voicecraft, which focus on speech editing, or tokenflow, which aims for consistent video editing, video-retalking is specifically designed for synchronizing lip movements with audio. Model inputs and outputs video-retalking takes two main inputs: a talking head video and an audio file. The model then generates a new video with the facial expressions and lip movements synchronized to the provided audio. This allows users to edit the appearance and emotion of a talking head video while preserving the original audio. Inputs Face**: Input video file of a talking-head. Input Audio**: Input audio file to synchronize with the video. Outputs Output**: The generated video with synchronized lip movements and expressions. Capabilities video-retalking can generate high-quality, lip-synced videos even in the wild, meaning it can handle real-world footage without the need for extensive pre-processing or manual alignment. The model is capable of disentangling the task into three key steps: generating a canonical face expression, synchronizing the lip movements to the audio, and enhancing the photo-realism of the final output. What can I use it for? video-retalking can be a powerful tool for content creators, video editors, and anyone looking to edit or enhance talking head videos. Its ability to preserve the original audio while modifying the visual elements opens up possibilities for a wide range of applications, such as: Dubbing or re-voicing videos in different languages Adjusting the emotion or expression of a speaker Repairing or improving the lip sync in existing footage Creating animated avatars or virtual presenters Things to try One interesting aspect of video-retalking is its ability to control the expression of the upper face using pre-defined templates like "smile" or "surprise". This allows for more nuanced expression editing beyond just lip sync. Additionally, the model's sequential pipeline means each step can be examined and potentially fine-tuned for specific use cases.

Updated Invalid Date

Video-to-Audio