BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

2406.04947

Published 6/10/2024 by Baktash Ansari, Mohammadmostafa Rostamkhani, Sauleh Eetemadi

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Abstract

This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think outside of the box. We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a round table conference approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.

Create account to get full access

Overview

• This paper introduces "BRAINTEASER", a novel task for the SemEval-2024 competition that challenges participants to defy common sense. • The task aims to push the boundaries of natural language understanding by asking models to reason about statements that appear to contradict everyday knowledge. • The paper describes the task setup, dataset, and baseline models submitted by participating teams, including iREL at SemEval-2024 Task 9: Improving, AmAzUtahNLP at SemEval-2024 Task 9: Multichoice, and DaVinci at SemEval-2024 Task 9: Few.

Plain English Explanation

The paper introduces a new challenge for natural language AI systems called "BRAINTEASER" as part of the SemEval-2024 competition. The goal of this task is to test the common sense reasoning abilities of AI models by presenting them with statements that seem to defy everyday logic or knowledge.

For example, the AI might be asked to evaluate a statement like "Elephants are smaller than mice" and determine whether it is true or false, even though this clearly contradicts our normal understanding of the relative sizes of these animals. The task pushes models to go beyond just recognizing the literal meaning of words and instead truly comprehend the intended meaning in a way that aligns with human common sense.

The paper describes the setup of this BRAINTEASER task, including the dataset of such counterintuitive statements that will be used to evaluate the participating AI systems. It also summarizes the baseline models submitted by several teams, such as iREL, AmAzUtahNLP, and DaVinci, which provide a starting point for understanding how current AI approaches might handle this unique challenge.

Technical Explanation

The BAMO at SemEval-2024 Task 9: BRAINTEASER paper introduces a new task called "BRAINTEASER" as part of the SemEval-2024 competition. The goal of this task is to test the common sense reasoning abilities of natural language processing (NLP) models by presenting them with statements that seem to defy everyday logic or knowledge.

The dataset for this task consists of a collection of such counterintuitive statements, covering a wide range of topics. For each statement, the participating models must determine whether it is true or false. This challenges the models to go beyond just recognizing the literal meanings of the words and instead truly comprehend the intended meaning in a way that aligns with human common sense.

The paper provides baseline results from several teams that submitted models for this task, including iREL, which used an iterative reasoning approach, AmAzUtahNLP, which leveraged a multiple-choice format, and DaVinci, which explored a few-shot learning strategy. These baseline models provide a starting point for understanding the challenges and potential approaches to the BRAINTEASER task.

Critical Analysis

The BRAINTEASER task proposed in this paper represents an interesting and important step in pushing the boundaries of natural language understanding. By challenging models to reason about statements that defy common sense, the task aims to go beyond the recognition of literal word meanings and instead test the models' ability to truly comprehend the intended meaning in a way that aligns with human intuition.

However, the paper does not delve into the potential limitations or caveats of this approach. For example, it is unclear how the dataset of counterintuitive statements was curated and whether there are any biases or inconsistencies in the way these statements were constructed. Additionally, the paper does not discuss the potential implications of successfully solving this task, such as whether it would lead to meaningful advancements in common sense reasoning or simply expose the shortcomings of current NLP models.

Furthermore, the paper could have explored alternative approaches to evaluating common sense reasoning, such as incorporating real-world scenarios or open-ended questions that require models to draw upon their understanding of the world. The reliance on true/false judgments of isolated statements may not fully capture the nuances of how humans reason about common sense.

Overall, the BRAINTEASER task represents an intriguing and novel direction for NLP research, but the paper could have provided a more comprehensive discussion of the potential limitations, challenges, and broader implications of this approach.

Conclusion

The BAMO at SemEval-2024 Task 9: BRAINTEASER paper introduces a novel task for the SemEval-2024 competition that aims to push the boundaries of natural language understanding. The BRAINTEASER task challenges AI models to reason about statements that defy common sense, forcing them to go beyond just recognizing literal word meanings and instead comprehend the intended meaning in a way that aligns with human intuition.

The paper describes the task setup, the dataset of counterintuitive statements, and the baseline models submitted by participating teams, including iREL, AmAzUtahNLP, and DaVinci. While the BRAINTEASER task represents an interesting and important step in advancing common sense reasoning in NLP, the paper could have provided a more comprehensive discussion of the potential limitations, challenges, and broader implications of this approach.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers

Harshit Gupta, Manav Chaudhary, Tathagata Raha, Shivansh Subramanian, Vasudeva Varma

This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models' lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default common-sense associations and exhibit unconventional thinking. We propose a unique strategy to improve the performance of pre-trained language models, notably the Gemini 1.0 Pro Model, in both subtasks. We employ static and dynamic few-shot prompting techniques and introduce a model-generated reasoning strategy that utilizes the LLM's reasoning capabilities to improve performance. Our approach demonstrated significant improvements, showing that it performed better than the baseline models by a considerable margin but fell short of performing as well as the human annotators, thus highlighting the efficacy of the proposed strategies.

5/28/2024

cs.CL

📶

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

Mina Ghashami, Soumya Smruti Mishra

The SemEval 2024 BRAINTEASER task represents a pioneering venture in Natural Language Processing (NLP) by focusing on lateral thinking, a dimension of cognitive reasoning that is often overlooked in traditional linguistic analyses. This challenge comprises of Sentence Puzzle and Word Puzzle subtasks and aims to test language models' capacity for divergent thinking. In this paper, we present our approach to the BRAINTEASER task. We employ a holistic strategy by leveraging cutting-edge pre-trained models in multiple choice architecture, and diversify the training data with Sentence and Word Puzzle datasets. To gain further improvement, we fine-tuned the model with synthetic humor or jokes dataset and the RiddleSense dataset which helped augmenting the model's lateral thinking abilities. Empirical results show that our approach achieve 92.5% accuracy in Sentence Puzzle subtask and 80.2% accuracy in Word Puzzle subtask.

5/21/2024

cs.CL cs.AI cs.IR cs.LG

🏷️

SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Yifan Jiang, Filip Ilievski, Kaixin Ma

While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models' lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9: BRAIN-TEASER(S), the first task at this competition designed to test the system's reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)'s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally. We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models.

4/26/2024

cs.AI cs.CL cs.LG

🌐

DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning

Suyash Vardhan Mathur, Akshett Rai Jindal, Manish Shrivastava

While significant work has been done in the field of NLP on vertical thinking, which involves primarily logical thinking, little work has been done towards lateral thinking, which involves looking at problems from an unconventional perspective and defying existing conceptions and notions. Towards this direction, SemEval 2024 introduces the task of BRAINTEASER, which involves two types of questions -- Sentence Puzzles and Word Puzzles that defy conventional common-sense reasoning and constraints. In this paper, we tackle both types of questions using few-shot prompting on GPT-3.5 and gain insights regarding the difference in the nature of the two types. Our prompting strategy placed us 26th on the leaderboard for the Sentence Puzzle and 15th on the Word Puzzle task.

5/21/2024

cs.CL cs.AI