Plan with Code: Comparing approaches for robust NL to DSL generation

Read original: arXiv:2408.08335 - Published 8/19/2024 by Nastaran Bassamzadeh, Chhaya Methani

Plan with Code: Comparing approaches for robust NL to DSL generation

Overview

The paper compares different approaches for generating robust natural language to domain-specific language (DSL) translations.
It explores techniques like fine-tuning, prompting, and using language models as planning domain generators.
The goal is to improve the reliability and accuracy of translating natural language instructions into executable code or plans.

Plain English Explanation

The research paper is looking at different ways to take natural language instructions, like step-by-step directions, and turn them into a programming language or specialized format that a computer can understand and execute. This is a challenging task because natural language can be ambiguous or vague, while computer languages need to be very precise.

The researchers tested several techniques, including:

Fine-tuning: Taking a large language model and training it specifically on the task of translating natural language to code.
Prompting: Giving the language model carefully crafted instructions to guide it in generating the correct code.
Using language models as planning domain generators: Having the language model first create an internal representation of the task, which can then be converted to executable code.

The goal is to find approaches that are more reliable and produce higher-quality translations from natural language to the specialized code or planning format. This could be useful for all sorts of applications, like automating workflow instructions, programming by voice, or even having AI agents follow natural language commands.

Technical Explanation

The paper compares several approaches for translating natural language (NL) instructions into executable domain-specific languages (DSLs) or planning representations.

One approach is fine-tuning large language models on NL-to-DSL translation datasets. This aims to specialize the model for the task, improving performance compared to zero-shot generation.

Another approach is to use prompting techniques, where the model is given carefully constructed prompts to guide the generation of the DSL code or plan.

The researchers also explore using language models as planning domain generators. Here, the model first generates an abstract representation of the task, which is then translated into an executable planning domain.

Through a series of experiments, the paper compares the effectiveness of these different techniques on NL-to-DSL translation benchmarks. The results provide insights into the strengths and weaknesses of each approach, informing the development of more robust NL-to-code generation systems.

Critical Analysis

The paper provides a comprehensive evaluation of several techniques for improving the reliability and accuracy of natural language to domain-specific language translation. The researchers acknowledge the limitations of their work, such as the need to further explore the generalization capabilities of the approaches and their performance on more diverse datasets.

One potential area for further research could be investigating ways to better integrate the model's understanding of the underlying task or planning domain into the generation process. The current approaches rely heavily on learning from translation examples, but incorporating more explicit task modeling could potentially lead to more robust and generalizable results.

Additionally, the paper does not deeply explore the tradeoffs between the different techniques in terms of factors like computational cost, training data requirements, and ease of deployment. These practical considerations may be important for real-world applications of the proposed methods.

Overall, the paper presents a valuable contribution to the field of natural language-to-code generation, providing a comparative analysis of several promising approaches and highlighting directions for future research.

Conclusion

This research paper compares various techniques for translating natural language instructions into executable domain-specific languages or planning representations. The explored approaches include fine-tuning large language models, using prompting strategies, and leveraging language models as planning domain generators.

The findings suggest that these methods can improve the reliability and accuracy of NL-to-DSL translation, which has important applications in areas like workflow automation, programming by voice, and enabling AI agents to follow natural language commands.

The paper provides a solid foundation for further research in this domain, highlighting the strengths and limitations of the different approaches and identifying potential avenues for improving the robustness and generalization capabilities of these systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Plan with Code: Comparing approaches for robust NL to DSL generation

Nastaran Bassamzadeh, Chhaya Methani

Planning in code is considered a more reliable approach for many orchestration tasks. This is because code is more tractable than steps generated via Natural Language and make it easy to support more complex sequences by abstracting deterministic logic into functions. It also allows spotting issues with incorrect function names with the help of parsing checks that can be run on code. Progress in Code Generation methodologies, however, remains limited to general-purpose languages like C, C++, and Python. LLMs continue to face challenges with custom function names in Domain Specific Languages or DSLs, leading to higher hallucination rates and syntax errors. This is more common for custom function names, that are typically part of the plan. Moreover, keeping LLMs up-to-date with newer function names is an issue. This poses a challenge for scenarios like task planning over a large number of APIs, since the plan is represented as a DSL having custom API names. In this paper, we focus on workflow automation in RPA (Robotic Process Automation) domain as a special case of task planning. We present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies with a fine-tuned model. Our results showed that the fine-tuned model scored the best on code similarity metric. However, with our optimizations, RAG approach is able to match the quality for in-domain API names in the test set. Additionally, it offers significant advantage for out-of-domain or unseen API names, outperforming Fine-Tuned model on similarity metric by 7 pts.

8/19/2024

A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation

Nastaran Bassamzadeh, Chhaya Methani

Natural Language to Code Generation has made significant progress in recent years with the advent of Large Language Models(LLMs). While generation for general-purpose languages like C, C++, and Python has improved significantly, LLMs struggle with custom function names in Domain Specific Languages or DSLs. This leads to higher hallucination rates and syntax errors, specially for DSLs having a high number of custom function names. Additionally, constant updates to function names add to the challenge as LLMs need to stay up-to-date. In this paper, we present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies. We generated a train as well as test dataset with a DSL to represent automation tasks across roughly 700 APIs in public domain. We used the training dataset to fine-tune a Codex model for this DSL. Our results showed that the fine-tuned model scored the best on code similarity metric. With our RAG optimizations, we achieved parity for similarity metric. The compilation rate, however, showed that both the models still got the syntax wrong many times, with RAG-based method being 2 pts better. Conversely, hallucination rate for RAG model lagged by 1 pt for API names and by 2 pts for API parameter keys. We conclude that an optimized RAG model can match the quality of fine-tuned models and offer advantages for new, unseen APIs.

7/4/2024

💬

Large Language Models as Planning Domain Generators

James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, Shirin Sohrabi

Developing domain models is one of the few remaining places that require manual human labor in AI planning. Thus, in order to make planning more accessible, it is desirable to automate the process of domain model generation. To this end, we investigate if large language models (LLMs) can be used to generate planning domain models from simple textual descriptions. Specifically, we introduce a framework for automated evaluation of LLM-generated domains by comparing the sets of plans for domain instances. Finally, we perform an empirical analysis of 7 large language models, including coding and chat models across 9 different planning domains, and under three classes of natural language domain descriptions. Our results indicate that LLMs, particularly those with high parameter counts, exhibit a moderate level of proficiency in generating correct planning domains from natural language descriptions. Our code is available at https://github.com/IBM/NL2PDDL.

5/14/2024

Planning In Natural Language Improves LLM Search For Code Generation

Evan Wang, Federico Cassano, Catherine Wu, Yunfeng Bai, Will Song, Vaskar Nath, Ziwen Han, Sean Hendryx, Summer Yue, Hugh Zhang

While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.

9/6/2024