ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

2307.13883

Published 5/7/2024 by Kensen Shi, Joey Hong, Yinlin Deng, Pengcheng Yin, Manzil Zaheer, Charles Sutton

🧠

Abstract

When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance.

Create account to get full access

Overview

The paper explores the ability of neural program synthesis models to compositionally generalize, meaning whether they can solve more complex tasks by building on their knowledge of simpler subtasks.
The researchers propose a novel decomposition-based synthesis strategy called ExeDec that predicts execution subgoals to solve problems step-by-step.
The paper evaluates the compositional generalization abilities of ExeDec and large language models (LLMs) on programming-by-example tasks.

Plain English Explanation

When writing computer programs, people often break down complex tasks into smaller, more manageable pieces. This ability to decompose and recombine subtasks is an important aspect of human intelligence. Language Models as Compilers: Simulating Pseudocode Execution and Towards Compositionally Generalizable Semantic Parsing have explored whether neural program synthesis models can demonstrate similar capabilities.

The researchers in this paper wanted to see if neural models could also learn to solve more complex programming tasks by building on their knowledge of simpler subtasks. They call this "compositional generalization," and they developed a set of benchmarks to test for different forms of this ability.

The researchers then proposed a new strategy called ExeDec, which predicts execution subgoals to solve problems step-by-step. When combined with Transformer models, ExeDec showed improved synthesis performance and better compositional generalization compared to baseline approaches.

The paper also found that large language models (LLMs) struggled to compositionally generalize when asked to do programming-by-example in a few-shot setting. However, the ExeDec-style prompting approach was able to improve the LLMs' generalization and overall performance on these tasks.

Technical Explanation

The paper begins by characterizing several desirable forms of compositional generalization for program synthesis, such as the ability to combine learned subprograms in novel ways. The researchers then used these insights to create generalization tasks for two popular programming-by-example datasets: RobustFill and DeepCoder.

Next, the paper introduces ExeDec, a novel decomposition-based synthesis strategy. ExeDec predicts execution subgoals that can be used to solve problems step-by-step, informed by the program's execution at each step. Sequential Compositional Generalization in Multimodal Models and Next: Teaching Large Language Models to Reason have explored similar approaches for improving compositional generalization.

When used with Transformer models trained from scratch, ExeDec demonstrated better synthesis performance and greater compositional generalization abilities compared to baseline methods. The paper also found that LLMs struggled to compositionally generalize on programming-by-example tasks, but the ExeDec-style prompting approach could improve their generalization and overall performance.

Critical Analysis

The paper provides a thorough evaluation of compositional generalization for program synthesis, which is an important but understudied aspect of neural program generation. The proposed ExeDec strategy represents a novel and promising approach, but its effectiveness may be limited to certain types of programming tasks.

The paper also raises the concern that LLMs, despite their impressive language capabilities, may not inherently possess the same level of compositional and hierarchical reasoning abilities as humans. Comparing Decision-Making Mechanisms by Transformers and CNNs has explored some of the limitations of Transformer-based models in this regard.

Further research is needed to better understand the factors that enable compositional generalization in neural program synthesis, and to develop more robust and generally applicable approaches. Additionally, the paper's benchmarks could be expanded to cover a wider range of programming tasks and more diverse forms of compositional generalization.

Conclusion

This paper makes an important contribution to the field of neural program synthesis by introducing a novel decomposition-based strategy, ExeDec, and using it to characterize and evaluate the compositional generalization abilities of various models. The findings suggest that while LLMs have impressive language skills, they may struggle with the type of compositional reasoning required for complex programming tasks. The ExeDec approach offers a promising direction for improving the generalization capabilities of neural program synthesis models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Data-Efficient Learning with Neural Programs

Alaia Solko-Breslin, Seewon Choi, Ziyang Li, Neelay Velingker, Rajeev Alur, Mayur Naik, Eric Wong

Many computational tasks can be naturally expressed as a composition of a DNN followed by a program written in a traditional programming language or an API call to an LLM. We call such composites neural programs and focus on the problem of learning the DNN parameters when the training data consist of end-to-end input-output labels for the composite. When the program is written in a differentiable logic programming language, techniques from neurosymbolic learning are applicable, but in general, the learning for neural programs requires estimating the gradients of black-box components. We present an algorithm for learning neural programs, called ISED, that only relies on input-output samples of black-box components. For evaluation, we introduce new benchmarks that involve calls to modern LLMs such as GPT-4 and also consider benchmarks from the neurosymolic learning literature. Our evaluation shows that for the latter benchmarks, ISED has comparable performance to state-of-the-art neurosymbolic frameworks. For the former, we use adaptations of prior work on gradient approximations of black-box components as a baseline, and show that ISED achieves comparable accuracy but in a more data- and sample-efficient manner.

6/11/2024

cs.LG

Learning to Reason via Program Generation, Emulation, and Search

Nathaniel Weir, Muhammad Khalifa, Linlu Qiu, Orion Weller, Peter Clark

Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate their own pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at url{https://github.com/nweir127/CoGEX}.

5/30/2024

cs.CL cs.AI

💬

Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models

Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, Jinyoung Yeo

Algorithmic reasoning refers to the ability to understand the complex patterns behind the problem and decompose them into a sequence of reasoning steps towards the solution. Such nature of algorithmic reasoning makes it a challenge for large language models (LLMs), even though they have demonstrated promising performance in other reasoning tasks. Within this context, some recent studies use programming languages (e.g., Python) to express the necessary logic for solving a given instance/question (e.g., Program-of-Thought) as inspired by their strict and precise syntaxes. However, it is non-trivial to write an executable code that expresses the correct logic on the fly within a single inference call. Also, the code generated specifically for an instance cannot be reused for others, even if they are from the same task and might require identical logic to solve. This paper presents Think-and-Execute, a novel framework that decomposes the reasoning process of language models into two steps. (1) In Think, we discover a task-level logic that is shared across all instances for solving a given task and then express the logic with pseudocode; (2) In Execute, we further tailor the generated pseudocode to each instance and simulate the execution of the code. With extensive experiments on seven algorithmic reasoning tasks, we demonstrate the effectiveness of Think-and-Execute. Our approach better improves LMs' reasoning compared to several strong baselines performing instance-specific reasoning (e.g., CoT and PoT), suggesting the helpfulness of discovering task-level logic. Also, we show that compared to natural language, pseudocode can better guide the reasoning of LMs, even though they are trained to follow natural language instructions.

4/4/2024

cs.CL

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Aleksandar Stani'c, Sergi Caelles, Michael Tschannen

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

5/16/2024

cs.CV cs.AI cs.LG