VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

Read original: arXiv:2404.06369 - Published 4/10/2024 by Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Shaoling Dong, Xing Zhou, Wenbin Jiang

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

Overview

• This paper introduces a new dataset called Vision2UI, which contains real-world web user interface (UI) designs and their corresponding HTML/CSS code. • The dataset is designed to help train machine learning models that can automatically translate UI designs into functional web code, a task known as "design-to-code." • The authors conduct an empirical analysis of existing design-to-code datasets and identify key limitations, which the Vision2UI dataset aims to address.

Plain English Explanation

The paper presents a new dataset called Vision2UI that could help make it easier to turn web design mockups into working web pages. Currently, this process of "design-to-code" is often done manually by web developers, which can be time-consuming and expensive.

The Vision2UI dataset contains real-world examples of web UI designs, along with the HTML and CSS code that was used to build those designs. By training machine learning models on this dataset, the authors hope to enable AI systems that can automatically translate design mockups into functional web code.

This could be helpful for web designers, who often have to work closely with developers to bring their designs to life. An AI-powered "design-to-code" tool could streamline this process and allow designers to be more self-sufficient.

The authors also analyze some of the limitations of existing datasets for this task, such as not having enough real-world examples or not including the full code implementation. The Vision2UI dataset aims to address these shortcomings and provide a more comprehensive resource for training and evaluating design-to-code models.

Technical Explanation

The paper begins with an empirical analysis of existing design-to-code datasets, identifying key limitations such as a lack of real-world examples and incomplete code implementations. The authors then introduce the Vision2UI dataset, which contains over 10,000 web UI designs scraped from real websites, along with the corresponding HTML and CSS code used to build those designs.

The dataset is structured to include not just the UI screenshots, but also the full DOM structure, CSS styles, and additional metadata such as the viewport size. This allows for a more comprehensive training and evaluation of design-to-code models, compared to existing datasets that may only provide partial information.

The authors conduct several experiments to analyze the quality and diversity of the Vision2UI dataset. They find that it covers a wide range of web UI styles and layouts, and that the code implementations closely match the visual designs, making it a valuable resource for training text-to-image generation and testing computer vision models in the context of design-to-code tasks.

Critical Analysis

The Vision2UI dataset appears to be a well-designed and potentially useful resource for advancing the field of design-to-code automation. The authors have made a concerted effort to address the limitations of existing datasets, and the dataset seems to offer a more comprehensive and realistic set of examples for training AI models.

However, the paper does not delve into some potential caveats or limitations of the dataset. For example, it's unclear how the authors selected and filtered the web UI examples, and whether there are any biases or skews in the types of designs included. Additionally, the paper does not discuss the potential challenges of training models on this dataset, such as handling the diversity of web technologies and design patterns.

Furthermore, the paper does not provide a critical analysis of how large language models or other advanced AI techniques might fare on the design-to-code task, or how the Vision2UI dataset could be used to enhance the robustness of such models in this context.

Conclusion

The Vision2UI dataset introduced in this paper represents a valuable contribution to the field of design-to-code automation. By providing a large, diverse, and comprehensive collection of real-world web UI examples and their corresponding code implementations, the dataset has the potential to significantly advance the development of AI-powered tools that can translate design mockups into functional web pages.

The dataset's emphasis on capturing the full complexity of web UI design, including the DOM structure, CSS styles, and metadata, sets it apart from previous efforts and could lead to more robust and capable design-to-code models. While the paper does not address all potential limitations, the Vision2UI dataset is a promising step forward in automating this important aspect of web development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Shaoling Dong, Xing Zhou, Wenbin Jiang

Automatically generating UI code from webpage design visions can significantly alleviate the burden of developers, enabling beginner developers or designers to directly generate Web pages from design diagrams. Currently, prior research has accomplished the objective of generating UI code from rudimentary design visions or sketches through designing deep neural networks. Inspired by the groundbreaking advancements achieved by Multimodal Large Language Models (MLLMs), the automatic generation of UI code from high-fidelity design images is now emerging as a viable possibility. Nevertheless, our investigation reveals that existing MLLMs are hampered by the scarcity of authentic, high-quality, and large-scale datasets, leading to unsatisfactory performance in automated UI code generation. To mitigate this gap, we present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information, tailored specifically for finetuning MLLMs in UI code generation. Specifically, this dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset. In order to uphold its quality, a neural scorer trained on labeled samples is utilized to refine the data, retaining higher-quality instances. Ultimately, this process yields a dataset comprising 2,000 (Much more is coming soon) parallel samples encompassing design visions and UI code. The dataset is available at https://huggingface.co/datasets/xcodemind/vision2ui.

4/10/2024

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, Michael R. Lyu

Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods. To the best of our knowledge, DCGen is the first segment-aware prompt-based approach for generating UI code directly from screenshots.

6/26/2024

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Jason Wu, Eldon Schoop, Alan Leung, Titus Barik, Jeffrey P. Bigham, Jeffrey Nichols

Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

6/13/2024

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.

7/1/2024