uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

2404.02474

Published 4/4/2024 by Pouya Sadeghi, Amirhossein Abaskohi, Yadollah Yaghoobzadeh

uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?

Abstract

Inspired by human cognition, Jiang et al.(2023c) create a benchmark for assessing LLMs' lateral thinking-thinking outside the box. Building upon this benchmark, we investigate how different prompting methods enhance LLMs' performance on this task to reveal their inherent power for outside-the-box thinking ability. Through participating in SemEval-2024, task 9, Sentence Puzzle sub-task, we explore prompt engineering methods: chain of thoughts (CoT) and direct prompting, enhancing with informative descriptions, and employing contextualizing prompts using a retrieval augmented generation (RAG) pipeline. Our experiments involve three LLMs including GPT-3.5, GPT-4, and Zephyr-7B-beta. We generate a dataset of thinking paths between riddles and options using GPT-4, validated by humans for quality. Findings indicate that compressed informative prompts enhance performance. Dynamic in-context learning enhances model performance significantly. Furthermore, fine-tuning Zephyr on our dataset enhances performance across other commonsense datasets, underscoring the value of innovative thinking.

Create account to get full access

Overview

This paper explores whether large language models (LLMs) can engage in lateral thinking, which involves generating creative and unconventional solutions to problems.
The researchers used a SemEval-2024 task to test the lateral thinking capabilities of LLMs by assessing their performance on an open-ended creative writing challenge.
The study aimed to gain insights into the true reasoning and problem-solving abilities of these advanced AI systems.

Plain English Explanation

Large language models (LLMs) like GPT-3 are incredibly advanced AI systems that can generate human-like text on a wide range of topics. However, it's unclear whether these models can truly think creatively and come up with novel, out-of-the-box solutions to problems.

The researchers in this paper wanted to put LLMs' lateral thinking skills to the test. Lateral thinking involves finding unexpected connections and unconventional approaches to solve problems. It's a deeper type of reasoning that goes beyond just reciting information.

To assess the lateral thinking capabilities of LLMs, the researchers used a creative writing challenge as part of the SemEval-2024 competition. They tasked the models with generating short stories or scenarios in response to open-ended prompts. This required the LLMs to think flexibly, make imaginative leaps, and produce original content, rather than just regurgitating facts.

By analyzing the LLMs' performance on this task, the researchers aimed to gain insights into the true reasoning and problem-solving abilities of these advanced AI systems. The results could shed light on the current limitations of LLMs and point the way towards developing more flexible, creative, and laterally-thinking AI in the future.

Technical Explanation

The paper describes a SemEval-2024 task that evaluated the lateral thinking capabilities of large language models (LLMs). The task involved an open-ended creative writing challenge, where LLMs were asked to generate short stories or scenarios in response to prompts.

The researchers used a technique called "chain of thoughts prompting" to encourage the LLMs to engage in more lateral and creative reasoning. This involved providing a series of step-by-step prompts that guided the models through the problem-solving process, rather than simply giving a single open-ended prompt.

The performance of multiple LLM systems, including GPT-3 and other state-of-the-art models, was evaluated on this task. Metrics such as story coherence, originality, and creativity were used to assess the lateral thinking abilities of the models.

The results provided insights into the current limitations of LLMs when it comes to engaging in truly flexible, open-ended reasoning. While the models were able to generate plausible and coherent responses, they struggled to consistently produce highly original and creative content.

Critical Analysis

The paper acknowledges that evaluating the lateral thinking capabilities of LLMs is a complex and challenging task. The researchers note that the creative writing challenge used in the SemEval-2024 task may not fully capture the nuances of lateral thinking, and that additional tasks and evaluation metrics may be needed.

Furthermore, the paper suggests that the chain of thoughts prompting approach, while helpful, may still constrain the LLMs' reasoning to some degree. There is a need to explore even more open-ended and unconstrained task formulations to truly assess the models' ability to think creatively and unconventionally.

The researchers also point out that the performance of LLMs on this task may be influenced by biases in the training data and the potential for the models to simply mimic patterns in the data, rather than engaging in true lateral thinking. Addressing these potential biases and limitations will be crucial for developing LLMs that can reliably demonstrate lateral thinking abilities.

Conclusion

This paper takes an important step towards understanding the lateral thinking capabilities of large language models. The results suggest that while LLMs can generate coherent and plausible responses to open-ended creative writing prompts, they still struggle to consistently produce highly original and creative content.

The insights gained from this research could inform the development of more flexible and laterally-thinking AI systems in the future. By continuing to explore the boundaries of LLM reasoning and creativity, researchers can work towards creating AI assistants that can truly think outside the box and provide innovative solutions to complex problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛠️

Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization

Alvin Po-Chun Chen, Ray Groshan, Sean von Bayern

Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system's ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.

5/7/2024

cs.CL

🏷️

SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Yifan Jiang, Filip Ilievski, Kaixin Ma

While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9: BRAIN-TEASER(S), the first task at this competition designed to test the system's reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)'s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally. We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models.

4/26/2024

cs.AI cs.CL cs.LG

💬

Why Can Large Language Models Generate Correct Chain-of-Thoughts?

Rasul Tutunov, Antoine Grosnit, Juliusz Ziomek, Jun Wang, Haitham Bou-Ammar

This paper delves into the capabilities of large language models (LLMs), specifically focusing on advancing the theoretical comprehension of chain-of-thought prompting. We investigate how LLMs can be effectively induced to generate a coherent chain of thoughts. To achieve this, we introduce a two-level hierarchical graphical model tailored for natural language generation. Within this framework, we establish a compelling geometrical convergence rate that gauges the likelihood of an LLM-generated chain of thoughts compared to those originating from the true language. Our findings provide a theoretical justification for the ability of LLMs to produce the correct sequence of thoughts (potentially) explaining performance gains in tasks demanding reasoning skills.

6/7/2024

cs.CL

🌐

DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning

Suyash Vardhan Mathur, Akshett Rai Jindal, Manish Shrivastava

While significant work has been done in the field of NLP on vertical thinking, which involves primarily logical thinking, little work has been done towards lateral thinking, which involves looking at problems from an unconventional perspective and defying existing conceptions and notions. Towards this direction, SemEval 2024 introduces the task of BRAINTEASER, which involves two types of questions -- Sentence Puzzles and Word Puzzles that defy conventional common-sense reasoning and constraints. In this paper, we tackle both types of questions using few-shot prompting on GPT-3.5 and gain insights regarding the difference in the nature of the two types. Our prompting strategy placed us 26th on the leaderboard for the Sentence Puzzle and 15th on the Word Puzzle task.

5/21/2024

cs.CL cs.AI