Controlling the World by Sleight of Hand

Read original: arXiv:2408.07147 - Published 8/15/2024 by Sruthi Sudhakar, Ruoshi Liu, Basile Van Hoorick, Carl Vondrick, Richard Zemel

Controlling the World by Sleight of Hand

Overview

This paper provides author guidelines for submitting papers to the European Conference on Computer Vision (ECCV).
It covers important information on formatting, structure, and content requirements for ECCV submissions.
The guidelines aim to ensure a consistent and high-quality submission process for the conference.

Plain English Explanation

The ECCV submission guidelines outline the key requirements for authors preparing papers to be considered for the European Conference on Computer Vision. This includes details on formatting the paper, such as font size, margins, and page limits, as well as the expected structure of the submission, with sections like Introduction, Related Work, and Experiments.

The guidelines also highlight important content that should be included, like a clear statement of the problem being addressed, a description of the proposed approach, and a thorough evaluation of the results. Authors are advised to carefully follow these instructions to maximize the chances of their work being accepted for presentation at the prestigious ECCV conference.

Technical Explanation

The introduction explains the purpose of the author guidelines, which is to ensure a standardized submission process for ECCV. The related work section discusses prior work on diffusion models and their applications, as well as techniques for generating and editing handheld objects.

The guidelines then provide detailed information on paper formatting, including specifications for font size, margins, column width, and page limits. The required structure of ECCV submissions is also outlined, with sections for introduction, related work, method, experiments, and conclusion.

In terms of content, authors are instructed to clearly state the problem being addressed, describe their proposed approach, and present a thorough evaluation of the results. Additional requirements cover things like citing related work, formatting equations and figures, and submitting supplementary material.

The final section emphasizes the importance of carefully following all guidelines to ensure a successful ECCV submission.

Critical Analysis

The author guidelines provide a comprehensive and well-structured set of instructions for ECCV submissions. The clear formatting and content requirements are likely to result in a consistent presentation of research across the conference, which can benefit both authors and reviewers.

However, some of the guidelines may place significant constraints on authors, such as the strict page limits. This could make it challenging for researchers to thoroughly explain their work and provide sufficient details on methodology and results.

Additionally, the guidelines do not address potential issues around ethics, bias, or broader societal implications of the research being presented. These are important considerations that could be incorporated to encourage authors to consider the responsible development and deployment of their computer vision techniques.

Conclusion

The ECCV author guidelines offer a detailed and standardized framework for preparing high-quality submissions to the prestigious computer vision conference. By following these instructions, authors can ensure that their work is presented in a clear and accessible manner, increasing the likelihood of acceptance and impact.

While the guidelines have some limitations in terms of flexibility and scope, they serve an important role in maintaining the quality and consistency of ECCV. Researchers seeking to contribute to the advancement of computer vision should carefully review and adhere to these guidelines when preparing their papers for submission.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Controlling the World by Sleight of Hand

Sruthi Sudhakar, Ruoshi Liu, Basile Van Hoorick, Carl Vondrick, Richard Zemel

Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well, with strong generalization particularly to translation, stretching, and squeezing interactions of unseen objects in unseen environments. Further, CosHand can be sampled many times to predict multiple possible effects, modeling the uncertainty of forces in the interaction/environment. Finally, method generalizes to different embodiments, including non-human hands, i.e. robot hands, suggesting that generative video models can be powerful models for robotics.

8/15/2024

Hand-Object Interaction Pretraining from Videos

Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik

We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: url{https://hgaurav2k.github.io/hop/}.

9/14/2024

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, Sean Kirmani

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

9/25/2024

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick

A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

6/26/2024