Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Read original: arXiv:2406.16386 - Published 6/26/2024 by Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, Michael R. Lyu

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Overview

This paper presents a divide-and-conquer-based approach for automatically generating UI code from screenshots.
The proposed method involves segmenting the screenshot into individual UI elements, classifying each element, and then generating the corresponding code for each element.
The authors demonstrate the effectiveness of their approach on a new large-scale dataset of real-world UI layouts and their associated code.

Plain English Explanation

The paper describes a method for automatically converting a screenshot of a user interface (UI) into the actual code that would be used to build that UI. This is a challenging task, as screenshots are just images, while the code needs to capture the structure, layout, and functionality of the UI.

The key idea is to break down the problem into smaller, more manageable steps. First, the screenshot is analyzed to identify the individual UI elements, such as buttons, text boxes, and menus. Each element is then classified to determine what type of UI component it represents. Finally, the appropriate code for each element is generated and assembled into a complete UI implementation.

This divide-and-conquer approach allows the system to handle the complexity of real-world UI designs, which can be quite intricate. The authors have also created a new dataset of UI layouts and their corresponding code, which can be used to train and evaluate such systems. This dataset is an important contribution, as it provides a realistic testbed for evaluating the performance of UI code generation algorithms.

Overall, this work represents a significant step forward in automating the process of UI development, which could save time and resources for designers and developers. By simplifying the translation from design to implementation, it has the potential to make the UI development process more efficient and accessible.

Technical Explanation

The proposed approach follows a three-step process:

UI Element Segmentation: The screenshot is first segmented into individual UI elements using a deep learning-based image segmentation model. This allows the system to identify the boundaries and locations of each UI component within the overall layout.
UI Element Classification: Each segmented UI element is then classified into one of several predefined UI component types, such as buttons, text inputs, or menus. This is accomplished using a separate deep learning-based classification model.
UI Code Generation: Finally, the classified UI elements are used to generate the corresponding UI code. This is done by mapping each element type to a set of pre-defined code templates, which are then customized with the specific attributes of the element (e.g., size, position, text content).

The authors evaluate their approach on a new dataset called Vision2UI, which contains over 100,000 real-world UI layouts and their associated code. They demonstrate that their divide-and-conquer method outperforms several baselines, including end-to-end neural network approaches like UICoder and AI-Inspired UI Design.

Additionally, the authors show that their approach is capable of handling a wide range of UI designs, including those with complex layouts and interactions, as demonstrated on the You Only Look at Screens dataset.

Critical Analysis

One potential limitation of the proposed approach is that it relies on a predefined set of UI component types, which may not capture the full diversity of real-world UI elements. As user interfaces continue to evolve, the system may need to be frequently updated to recognize new UI component types.

Additionally, the authors note that their code generation step is currently based on template-matching, which may not be flexible enough to handle highly customized or complex UI designs. Exploring more advanced code generation techniques, such as those used in GUING, could be a promising area for future research.

Finally, while the authors demonstrate the effectiveness of their approach on large-scale datasets, it would be valuable to also evaluate the system's performance in real-world deployment scenarios, where the input screenshots may be noisier or less well-structured than the examples in the datasets.

Conclusion

This paper presents a novel divide-and-conquer-based approach for automatically generating UI code from screenshots. By breaking down the problem into smaller, more manageable steps, the authors have developed a system that can handle the complexity of real-world UI designs. The new Vision2UI dataset is a valuable contribution that will enable further research and development in this area.

Overall, this work represents an important step forward in automating the UI development process, which could significantly improve the efficiency and accessibility of user interface design and implementation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, Michael R. Lyu

Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods. To the best of our knowledge, DCGen is the first segment-aware prompt-based approach for generating UI code directly from screenshots.

6/26/2024

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Shaoling Dong, Xing Zhou, Wenbin Jiang

Automatically generating UI code from webpage design visions can significantly alleviate the burden of developers, enabling beginner developers or designers to directly generate Web pages from design diagrams. Currently, prior research has accomplished the objective of generating UI code from rudimentary design visions or sketches through designing deep neural networks. Inspired by the groundbreaking advancements achieved by Multimodal Large Language Models (MLLMs), the automatic generation of UI code from high-fidelity design images is now emerging as a viable possibility. Nevertheless, our investigation reveals that existing MLLMs are hampered by the scarcity of authentic, high-quality, and large-scale datasets, leading to unsatisfactory performance in automated UI code generation. To mitigate this gap, we present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information, tailored specifically for finetuning MLLMs in UI code generation. Specifically, this dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset. In order to uphold its quality, a neural scorer trained on labeled samples is utilized to refine the data, retaining higher-quality instances. Ultimately, this process yields a dataset comprising 2,000 (Much more is coming soon) parallel samples encompassing design visions and UI code. The dataset is available at https://huggingface.co/datasets/xcodemind/vision2ui.

4/10/2024

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Jason Wu, Eldon Schoop, Alan Leung, Titus Barik, Jeffrey P. Bigham, Jeffrey Nichols

Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

6/13/2024

On AI-Inspired UI-Design

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, G'erard Dray, Walid Maalej

Graphical User Interface (or simply UI) is a primary mean of interaction between users and their device. In this paper, we discuss three major complementary approaches on how to use Artificial Intelligence (AI) to support app designers create better, more diverse, and creative UI of mobile apps. First, designers can prompt a Large Language Model (LLM) like GPT to directly generate and adjust one or multiple UIs. Second, a Vision-Language Model (VLM) enables designers to effectively search a large screenshot dataset, e.g. from apps published in app stores. The third approach is to train a Diffusion Model (DM) specifically designed to generate app UIs as inspirational images. We discuss how AI should be used, in general, to inspire and assist creative app design rather than automating it.

6/21/2024