SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

2404.16068

Published 4/26/2024 by Yifan Jiang, Filip Ilievski, Kaixin Ma

🏷️

Abstract

While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9: BRAIN-TEASER(S), the first task at this competition designed to test the system's reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)'s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally. We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models.

Create account to get full access

Overview

This paper discusses the concept of lateral thinking, which involves defying common sense associations and using unconventional reasoning to solve problems.
The researchers present a benchmark called BRAINTEASER, which aims to evaluate current models' lateral thinking abilities in a zero-shot setting.
The paper also introduces SemEval Task 9: BRAIN-TEASER(S), the first task in a competition designed to test systems' reasoning and lateral thinking abilities.
The paper provides a detailed analysis of the competition results and reflects on the implications for the ability of current systems to reason laterally.

Plain English Explanation

Vertical thinking is based on logical, commonsense reasoning, but lateral thinking requires systems to break away from these typical associations and think in more unconventional ways. This type of lateral thinking has proven challenging for current AI models, but it hasn't received much attention.

The researchers created a benchmark called BRAINTEASER to specifically evaluate models' lateral thinking skills in a zero-shot setting (without any prior training). They then split the original BRAINTEASER benchmark and introduced it as SemEval Task 9: BRAIN-TEASER(S), the first competition designed to test systems' reasoning and lateral thinking abilities.

Many teams participated in this competition, submitting a total of 483 entries. The paper provides a detailed analysis of the competition results, which sheds light on the current capabilities and limitations of AI systems when it comes to lateral thinking and robust reasoning.

The researchers hope that the BRAINTEASER(S) subtasks and the findings in this paper will inspire future work on improving lateral thinking and robust reasoning in AI models.

Technical Explanation

The paper presents a benchmark called BRAINTEASER that aims to evaluate the lateral thinking abilities of current AI models in a zero-shot setting. The researchers then split the original BRAINTEASER benchmark and introduced it as SemEval Task 9: BRAIN-TEASER(S), a competition task designed to test systems' reasoning and lateral thinking skills.

The SemEval Task 9: BRAIN-TEASER(S) competition received a total of 483 team submissions from 182 participants across its two subtasks. The paper provides a detailed analysis of the competition results, including insights into the performance of the participating systems and the challenges they faced in demonstrating lateral thinking abilities.

The researchers observe that while current models excel at vertical thinking, which relies on logical and commonsense reasoning, they struggle with lateral thinking, which requires defying common sense associations and using unconventional approaches. The paper reflects on what this means for the ability of AI systems to reason laterally and highlights the need for further research and development in this area.

Critical Analysis

The paper provides valuable insights into the current limitations of AI systems when it comes to lateral thinking. The researchers acknowledge that lateral thinking has received little attention in the field, and the BRAINTEASER(S) benchmark and competition represent an important step in addressing this gap.

However, the paper does not delve into the specific reasons why lateral thinking is challenging for current models. It would be helpful to have a more detailed discussion of the underlying factors, such as the cognitive biases or architectural limitations that hinder models' ability to engage in unconventional reasoning.

Additionally, the paper could have explored potential approaches or architectural modifications that could help improve lateral thinking capabilities in AI systems. While the researchers express hope that the BRAINTEASER(S) findings will inspire future work in this area, they do not provide any concrete suggestions or directions for further research.

Overall, the paper makes a valuable contribution by highlighting the importance of lateral thinking and the need for more research in this domain. However, a deeper exploration of the underlying challenges and potential solutions could have strengthened the paper's impact and usefulness for the research community.

Conclusion

This paper sheds light on the current limitations of AI models when it comes to lateral thinking, a type of reasoning that requires defying common sense associations and using unconventional approaches. The researchers present the BRAINTEASER benchmark and the SemEval Task 9: BRAIN-TEASER(S) competition as tools to evaluate and improve lateral thinking abilities in computational systems.

The detailed analysis of the competition results reveals that while current models excel at vertical thinking based on logical and commonsense reasoning, they struggle with lateral thinking. This finding highlights the need for further research and development in this area to advance the field of robust and flexible reasoning in AI.

The researchers hope that the BRAINTEASER(S) benchmark and the insights from this paper will stimulate future work on improving lateral thinking and reasoning capabilities in AI systems, which could have significant implications for their real-world applicability and problem-solving abilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌐

DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning

Suyash Vardhan Mathur, Akshett Rai Jindal, Manish Shrivastava

While significant work has been done in the field of NLP on vertical thinking, which involves primarily logical thinking, little work has been done towards lateral thinking, which involves looking at problems from an unconventional perspective and defying existing conceptions and notions. Towards this direction, SemEval 2024 introduces the task of BRAINTEASER, which involves two types of questions -- Sentence Puzzles and Word Puzzles that defy conventional common-sense reasoning and constraints. In this paper, we tackle both types of questions using few-shot prompting on GPT-3.5 and gain insights regarding the difference in the nature of the two types. Our prompting strategy placed us 26th on the leaderboard for the Sentence Puzzle and 15th on the Word Puzzle task.

5/21/2024

cs.CL cs.AI

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Baktash Ansari, Mohammadmostafa Rostamkhani, Sauleh Eetemadi

This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think outside of the box. We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a round table conference approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.

6/10/2024

cs.CL

📶

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

Mina Ghashami, Soumya Smruti Mishra

The SemEval 2024 BRAINTEASER task represents a pioneering venture in Natural Language Processing (NLP) by focusing on lateral thinking, a dimension of cognitive reasoning that is often overlooked in traditional linguistic analyses. This challenge comprises of Sentence Puzzle and Word Puzzle subtasks and aims to test language models' capacity for divergent thinking. In this paper, we present our approach to the BRAINTEASER task. We employ a holistic strategy by leveraging cutting-edge pre-trained models in multiple choice architecture, and diversify the training data with Sentence and Word Puzzle datasets. To gain further improvement, we fine-tuned the model with synthetic humor or jokes dataset and the RiddleSense dataset which helped augmenting the model's lateral thinking abilities. Empirical results show that our approach achieve 92.5% accuracy in Sentence Puzzle subtask and 80.2% accuracy in Word Puzzle subtask.

5/21/2024

cs.CL cs.AI cs.IR cs.LG

iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers

Harshit Gupta, Manav Chaudhary, Tathagata Raha, Shivansh Subramanian, Vasudeva Varma

This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common-sense associations and exhibit unconventional thinking. We propose a unique strategy to improve the performance of pre-trained language models, notably the Gemini 1.0 Pro Model, in both subtasks. We employ static and dynamic few-shot prompting techniques and introduce a model-generated reasoning strategy that utilizes the LLM's reasoning capabilities to improve performance. Our approach demonstrated significant improvements, showing that it performed better than the baseline models by a considerable margin but fell short of performing as well as the human annotators, thus highlighting the efficacy of the proposed strategies.

5/28/2024

cs.CL