1. NAACL
    Accelerating LLM Inference by Enabling Intermediate Layer Decoding

    Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive which poses a practical challenge for resource constrained real-world applications. Focusing on this problem, we propose to instruction tune LLMs in a way that enables intermediate layer decoding for efficiently generating text, but importantly without compromising the quality of the generation. Specifically, we instruction tune LLMs with additional explicit Losses from the InTermediate layErs (LITE) and show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer. We perform 'dynamic confidence-based early exiting' at token level from the intermediate layers which improves the efficiency of inference while maintaining the generation quality. We conduct comprehensive experiments by instruction tuning LLaMA-2 models on the widely used Alpaca dataset and holistically evaluate on four different human-instruction test sets: Vicuna, WizardLM, Koala, and Self-Instruct. We show that 'dynamic early exiting' achieves consistent and considerable cost improvements (37.86% on average) while maintaining the generation quality of the responses. We further conduct a thorough analysis of the results over several important aspects, such as comparing the semantic similarity of the outputs and dissecting the efficiency improvements by comparing the number of tokens generated in the output. In summary, our work contributes to improving the efficiency of LLM inference while maintaining the generation quality, a crucial step en route to enabling their widespread adoption.
              @article{varshney2023accelerating,
      title={Accelerating LLM Inference by Enabling Intermediate Layer Decoding},
      author={Varshney, Neeraj and Chatterjee, Agneet and Parmar, Mihir and Baral, Chitta},
      journal={arXiv preprint arXiv:2310.18581},
      year={2023}
    }

2024

  1. Preprint
    A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation
    Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu

    Recently developed large language models have achieved remarkable success in generating fluent and coherent text. However, these models often tend to 'hallucinate' which critically hampers their reliability. In this work, we address this crucial problem and propose an approach that actively detects and mitigates hallucinations during the generation process. Specifically, we first identify the candidates of potential hallucination leveraging the model's logit output values, check their correctness through a validation procedure, mitigate the detected hallucinations, and then continue with the generation process. Through extensive experiments with GPT-3.5 (text-davinci-003) on the 'article generation task', we first demonstrate the individual efficacy of our detection and mitigation techniques. Specifically, the detection technique achieves a recall of ~88% and the mitigation technique successfully mitigates 57.6% of the correctly detected hallucinations. Importantly, our mitigation technique does not introduce new hallucinations even in the case of incorrectly detected hallucinations, i.e., false positives. Then, we show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average. We further demonstrate the effectiveness and wide applicability of our approach through additional studies including performance on different types of questions (multi-hop and false premise questions) and with another LLM from a different model family (Vicuna). In summary, our work contributes to improving the reliability and trustworthiness of large language models, a crucial step en route to enabling their widespread adoption in real-world applications.
              @article{varshney2023stitch,
      title={A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation},
      author={Varshney, Neeraj and Yao, Wenlin and Zhang, Hongming and Chen, Jianshu and Yu, Dong},
      journal={arXiv preprint arXiv:2307.03987},
      year={2023}
    }

2023

  1. Preprint
    The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
    Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral

    As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' With SODE, we study a variety of LLM defense strategies over multiple state-of-the-art LLMs, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. Overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of LLMs.
              @article{varshney2023accelerating,
      title={Accelerating LLM Inference by Enabling Intermediate Layer Decoding},
      author={Varshney, Neeraj and Chatterjee, Agneet and Parmar, Mihir and Baral, Chitta},
      journal={arXiv preprint arXiv:2310.18581},
      year={2023}
    }

2023

  1. EMNLP
    LogicAttack: Adversarial Attacks for Evaluating Logical Consistency of Natural Language Inference
    Mutsumi Nakamura, Santosh Mashetty, Mihir Parmar, Neeraj Varshney and Chitta Baral

    Conference on Empirical Methods in Natural Language Processing 2023

    Recently Large Language Models (LLMs) such as GPT-3, ChatGPT, and FLAN have led to impressive progress in Natural Language Inference (NLI) tasks. However, these models may rely on simple heuristics or artifacts in the evaluation data to achieve their high performance, which suggests that they still suffer from logical inconsistency. To assess the logical consistency of these models, we propose a LogicAttack, a method to attack NLI models using diverse logical forms of premise and hypothesis, providing a more robust evaluation of their performance. Our approach leverages a range of inference rules from propositional logic, such as Modus Tollens and Bidirectional Dilemma, to generate effective adversarial attacks and identify common vulnerabilities across multiple NLI models. We achieve an average ~53% Attack Success Rate (ASR) across multiple logic-based attacks. Moreover, we demonstrate that incorporating generated attack samples into training enhances the logical reasoning ability of the target model and decreases its vulnerability to logic-based attacks. Data and source code are available at https://github.com/msantoshmadhav/LogicAttack.
              
    @inproceedings{nakamura-etal-2023-logicattack, title = "{L}ogic{A}ttack: Adversarial Attacks for Evaluating Logical Consistency of Natural Language Inference", author = "Nakamura, Mutsumi and Mashetty, Santosh and Parmar, Mihir and Varshney, Neeraj and Baral, Chitta", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.889", pages = "13322--13334", abstract = "Recently Large Language Models (LLMs) such as GPT-3, ChatGPT, and FLAN have led to impressive progress in Natural Language Inference (NLI) tasks. However, these models may rely on simple heuristics or artifacts in the evaluation data to achieve their high performance, which suggests that they still suffer from logical inconsistency. To assess the logical consistency of these models, we propose a LogicAttack, a method to attack NLI models using diverse logical forms of premise and hypothesis, providing a more robust evaluation of their performance. Our approach leverages a range of inference rules from propositional logic, such as Modus Tollens and Bidirectional Dilemma, to generate effective adversarial attacks and identify common vulnerabilities across multiple NLI models. We achieve an average {\textasciitilde}53{\%} Attack Success Rate (ASR) across multiple logic-based attacks. Moreover, we demonstrate that incorporating generated attack samples into training enhances the logical reasoning ability of the target model and decreases its vulnerability to logic-based attacks. Data and source code are available at https://github.com/msantoshmadhav/LogicAttack.", }

2023

  1. ACL
    Post-Abstention: Towards Reliably Re-Attempting the Abstained Instances in QA
    Neeraj Varshney and Chitta Baral

    Association for Computational Linguistics 2023

    Despite remarkable progress made in natural language processing, even the state-of-the-art models often make incorrect predictions. Such predictions hamper the reliability of systems and limit their widespread adoption in real-world applications. 'Selective prediction' partly addresses the above concern by enabling models to abstain from answering when their predictions are likely to be incorrect. While selective prediction is advantageous, it leaves us with a pertinent question 'what to do after abstention'. To this end, we present an explorative study on 'Post-Abstention', a task that allows re-attempting the abstained instances with the aim of increasing 'coverage' of the system without significantly sacrificing its 'accuracy'. We first provide mathematical formulation of this task and then explore several methods to solve it. Comprehensive experiments on 11 QA datasets show that these methods lead to considerable risk improvements --performance metric of the Post-Abstention task-- both in the in-domain and the out-of-domain settings. We also conduct a thorough analysis of these results which further leads to several interesting findings. Finally, we believe that our work will encourage and facilitate further research in this important area of addressing the reliability of NLP systems.
              @inproceedings{varshney-baral-2023-post,
        title = "Post-Abstention: Towards Reliably Re-Attempting the Abstained Instances in {QA}",
        author = "Varshney, Neeraj  and
          Baral, Chitta",
        booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
        month = jul,
        year = "2023",
        address = "Toronto, Canada",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2023.acl-long.55",
        pages = "967--982",
        abstract = "Despite remarkable progress made in natural language processing, even the state-of-the-art models often make incorrect predictions. Such predictions hamper the reliability of systems and limit their widespread adoption in real-world applications. {`}Selective prediction{'} partly addresses the above concern by enabling models to abstain from answering when their predictions are likely to be incorrect. While selective prediction is advantageous, it leaves us with a pertinent question {`}what to do after abstention{'}. To this end, we present an explorative study on {`}Post-Abstention{'}, a task that allows re-attempting the abstained instances with the aim of increasing **coverage** of the system without significantly sacrificing its **accuracy**. We first provide mathematical formulation of this task and then explore several methods to solve it. Comprehensive experiments on 11 QA datasets show that these methods lead to considerable risk improvements {--}performance metric of the Post-Abstention task{--} both in the in-domain and the out-of-domain settings. We also conduct a thorough analysis of these results which further leads to several interesting findings. Finally, we believe that our work will encourage and facilitate further research in this important area of addressing the reliability of NLP systems.",
    }

2023

  1. ACL
    A Unified Evaluation Framework for Novelty Detection and Accommodation in NLP with an Instantiation in Authorship Attribution

    Findings of Association for Computational Linguistics 2023

    State-of-the-art natural language processing models have been shown to achieve remarkable performance in 'closed-world' settings where all the labels in the evaluation set are known at training time. However, in real-world settings, `novel' instances that do not belong to any known class are often observed. This renders the ability to deal with novelties crucial. To initiate a systematic research in this important area of 'dealing with novelties', we introduce 'NoveltyTask', a multi-stage task to evaluate a system's performance on pipelined novelty 'detection' and 'accommodation' tasks. We provide mathematical formulation of NoveltyTask and instantiate it with the authorship attribution task that pertains to identifying the correct author of a given text. We use Amazon reviews corpus and compile a large dataset (consisting of 250k instances across 200 authors/labels) for NoveltyTask. We conduct comprehensive experiments and explore several baseline methods for the task. Our results show that the methods achieve considerably low performance making the task challenging and leaving sufficient room for improvement. Finally, we believe our work will encourage research in this underexplored area of dealing with novelties, an important step en route to developing robust systems.
              @inproceedings{varshney-etal-2023-unified,
        title = "A Unified Evaluation Framework for Novelty Detection and Accommodation in {NLP} with an Instantiation in Authorship Attribution",
        author = "Varshney, Neeraj  and
          Gupta, Himanshu  and
          Robertson, Eric  and
          Liu, Bing  and
          Baral, Chitta",
        booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
        month = jul,
        year = "2023",
        address = "Toronto, Canada",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2023.findings-acl.113",
        pages = "1794--1818",
        abstract = "State-of-the-art natural language processing models have been shown to achieve remarkable performance in {`}closed-world{'} settings where all the labels in the evaluation set are known at training time. However, in real-world settings, {`}novel{'} instances that do not belong to any known class are often observed. This renders the ability to deal with novelties crucial. To initiate a systematic research in this important area of {`}dealing with novelties{'}, we introduce NoveltyTask, a multi-stage task to evaluate a system{'}s performance on pipelined novelty {`}detection{'} and {`}accommodation{'} tasks. We provide mathematical formulation of NoveltyTask and instantiate it with the authorship attribution task that pertains to identifying the correct author of a given text. We use amazon reviews corpus and compile a large dataset (consisting of 250k instances across 200 authors/labels) for NoveltyTask. We conduct comprehensive experiments and explore several baseline methods for the task. Our results show that the methods achieve considerably low performance making the task challenging and leaving sufficient room for improvement. Finally, we believe our work will encourage research in this underexplored area of dealing with novelties, an important step en route to developing robust systems.",
    }

2023

  1. AAAI
    Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?
    Neeraj Varshney, Man Luo, and Chitta Baral

    AAAI'23 Workshop on Knowledge Augmented Methods for NLP

    Recent state-of-the-art open-domain QA models are typically based on a two stage retriever-reader approach in which the retriever first finds the relevant knowledge/passages and the reader then leverages that to predict the answer. Prior work has shown that the performance of the reader usually tends to improve with the increase in the number of these passages. Thus, state-of-the-art models use a large number of passages (e.g. 100) for inference. While the reader in this approach achieves high prediction performance, its inference is computationally very expensive. We humans, on the other hand, use a more efficient strategy while answering: firstly, if we can confidently answer the question using our already acquired knowledge then we do not even use the external knowledge, and in the case when we do require external knowledge, we don't read the entire knowledge at once, instead, we only read that much knowledge that is sufficient to find the answer. Motivated by this procedure, we ask a research question "Can the open-domain QA reader utilize external knowledge efficiently like humans without sacrificing the prediction performance?"
    Driven by this question, we explore an approach that utilizes both 'closed-book' (leveraging knowledge already present in the model parameters) and 'open-book' inference (leveraging external knowledge). Furthermore, instead of using a large fixed number of passages for open-book inference, we dynamically read the external knowledge in multiple 'knowledge iterations'. Through comprehensive experiments on NQ and TriviaQA datasets, we demonstrate that this dynamic reading approach improves both the 'inference efficiency' and the 'prediction accuracy' of the reader. Comparing with the FiD reader, this approach matches its accuracy by utilizing just 18.32% of its reader inference cost and also outperforms it by achieving up to 55.10% accuracy on NQ Open.
    @article{varshney2022can,
      title={Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?},
      author={Varshney, Neeraj and Luo, Man and Baral, Chitta},
      journal={arXiv preprint arXiv:2211.12707},
      year={2022}
    }

2023

  1. EACL
    "John is 50 years old, can his son be 65?" Evaluating NLP Models' Understanding of Feasibility

    European Chapter of the Association for Computational Linguistics 2023

    In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. We introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3 struggle to answer the feasibility questions correctly. Specifically, on (MCQ, BCQ) questions, GPT-3 achieves accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question and find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in GPT-3 and how well the model can reason about it.
              @inproceedings{gupta-etal-2023-john,
        title = "{``}John is 50 years old, can his son be 65?{''} Evaluating {NLP} Models{'} Understanding of Feasibility",
        author = "Gupta, Himanshu  and
          Varshney, Neeraj  and
          Mishra, Swaroop  and
          Pal, Kuntal Kumar  and
          Sawant, Saurabh Arjun  and
          Scaria, Kevin  and
          Goyal, Siddharth  and
          Baral, Chitta",
        booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
        month = may,
        year = "2023",
        address = "Dubrovnik, Croatia",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2023.eacl-main.30",
        pages = "407--417",
        abstract = "In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on (MCQ, BCQ) questions, GPT-3 achieves accuracy of just (19{\%}, 62{\%}) and (25{\%}, 64{\%}) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question and find that the additional knowledge leads to a 7{\%} gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.",
    }

2023

  1. ACL
    On Dealing with Questions that Don't have Definitive Answers
    Neeraj Varshney*, Ayushi Agarwal*, Nisarg Patel*, Mihir Parmar, Pavan Mallina, Aryan Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, and Chitta Baral

    TrustNLP @ Association for Computational Linguistics 2023

    Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer. Incorrectly answering such questions certainly hampers a system's reliability and trustworthiness. Can SOTA models accurately identify such questions and provide a reasonable response? To investigate the above question, we introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers. Furthermore, for each QnotA instance, we also provide a corresponding QA instance i.e. an alternate question that ''can be'' answered. With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions. Through comprehensive experiments, we show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline. We conduct a thorough analysis which further leads to several interesting findings. Overall, we believe our work and findings will encourage and facilitate further research in this important area and help develop more robust models.
              @article{agarwal2023can,
      title={Can NLP Models' Identify','Distinguish', and'Justify'Questions that Don't have a Definitive Answer?},
      author={Agarwal, Ayushi and Patel, Nisarg and Varshney, Neeraj and Parmar, Mihir and Mallina, Pavan and Shah, Aryan Bhavin and Sangaraju, Srihari Raju and Patel, Tirth and Thakkar, Nihar and Baral, Chitta},
      journal={arXiv preprint arXiv:2309.04635},
      year={2023}
    }

2023

  1. AAMAS
    Methods and Mechanisms for Interactive Novelty Handling in Adversarial Environments
    Tung Thai, Ming Shen, Mayang Garg, Ayush Kalani, Nakul Vaidya, Utkarsh Soni, Mudit Verma, Sriram Gopalakrishnan, Neeraj Varshney, Chitta Baral, Subbarao Kambhampati, Jivko Sinapov, Matthias Scheutz

    Extended Abstract at the Conference on Autonomous Agents and Multiagent Systems

2023

  1. EMNLP
    Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems
    Neeraj Varshney and Chitta Baral

    Conference on Empirical Methods in Natural Language Processing

    Do all instances need inference through the big models for a correct prediction?
    Perhaps not; some instances are easy and can be answered correctly by even small capacity models. This provides opportunities for improving the computational efficiency of systems. In this work, we present an explorative study on 'model cascading', a simple technique that utilizes a collection of models of varying capacities to accurately yet efficiently output predictions. Through comprehensive experiments in multiple task settings that differ in the number of models available for cascading (K value), we show that cascading improves both the computational efficiency and the prediction accuracy. For instance, in K=3 setting, cascading saves up to 88.93% computation cost and consistently achieves superior prediction accuracy with an improvement of up to 2.18%. We also study the impact of introducing additional models in the cascade and show that it further increases the efficiency improvements. Finally, we hope that our work will facilitate development of efficient NLP systems making their widespread adoption in real-world applications possible.
    @inproceedings{varshney-baral-2022-model,
        title = "Model Cascading: Towards Jointly Improving Efficiency and Accuracy of {NLP} Systems",
        author = "Varshney, Neeraj  and
          Baral, Chitta",
        booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
        month = dec,
        year = "2022",
        address = "Abu Dhabi, United Arab Emirates",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.emnlp-main.756",
        pages = "11007--11021",
        abstract = "Do all instances need inference through the big models for a correct prediction? Perhaps not; some instances are easy and can be answered correctly by even small capacity models. This provides opportunities for improving the computational efficiency of systems. In this work, we present an explorative study on {`}model cascading{'}, a simple technique that utilizes a collection of models of varying capacities to accurately yet efficiently output predictions. Through comprehensive experiments in multiple task settings that differ in the number of models available for cascading (K value), we show that cascading improves both the computational efficiency and the prediction accuracy. For instance, in K=3 setting, cascading saves up to 88.93{\%} computation cost and consistently achieves superior prediction accuracy with an improvement of up to 2.18{\%}. We also study the impact of introducing additional models in the cascade and show that it further increases the efficiency improvements. Finally, we hope that our work will facilitate development of efficient NLP systems making their widespread adoption in real-world applications possible.",
    }

2022

  1. EMNLP
    Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

    Conference on Empirical Methods in Natural Language Processing

    How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions—training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones.Furthermore, we build Tk-Instruct, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-Instruct outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.
              @inproceedings{wang-etal-2022-super,
        title = "Super-{N}atural{I}nstructions: Generalization via Declarative Instructions on 1600+ {NLP} Tasks",
        author = "Wang, Yizhong  and
          Mishra, Swaroop  and
          Alipoormolabashi, Pegah  and
          Kordi, Yeganeh  and
          Mirzaei, Amirreza  and
          Naik, Atharva  and
          Ashok, Arjun  and
          Dhanasekaran, Arut Selvan  and
          Arunkumar, Anjana  and
          Stap, David  and
          Pathak, Eshaan  and
          Karamanolakis, Giannis  and
          Lai, Haizhi  and
          Purohit, Ishan  and
          Mondal, Ishani  and
          Anderson, Jacob  and
          Kuznia, Kirby  and
          Doshi, Krima  and
          Pal, Kuntal Kumar  and
          Patel, Maitreya  and
          Moradshahi, Mehrad  and
          Parmar, Mihir  and
          Purohit, Mirali  and
          Varshney, Neeraj  and
          Kaza, Phani Rohitha  and
          Verma, Pulkit  and
          Puri, Ravsehaj Singh  and
          Karia, Rushang  and
          Doshi, Savan  and
          Sampat, Shailaja Keyur  and
          Mishra, Siddhartha  and
          Reddy A, Sujan  and
          Patro, Sumanta  and
          Dixit, Tanay  and
          Shen, Xudong",
        booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
        month = dec,
        year = "2022",
        address = "Abu Dhabi, United Arab Emirates",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.emnlp-main.340",
        pages = "5085--5109",
        abstract = "How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions{---}training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones.Furthermore, we build Tk-Instruct, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-Instruct outperforms existing instruction-following models such as InstructGPT by over 9{\%} on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.",
    }

2022

  1. ACL
    Unsupervised Natural Language Inference Using PHL Triplet Generation

    Findings of Association for Computational Linguistics

    We explore three unsupervised settings for NLI and propose a procedural data generation approach that outperforms the existing approaches by ~13% and raises the state-of-the-art unsupervised performance on SNLI to 66.75%.

    Transformer-based models achieve impressive performance on numerous Natural Language Inference (NLI) benchmarks when trained on respective training datasets. However, in certain cases, training samples may not be available or collecting them could be time-consuming and resource-intensive. In this work, we address the above challenge and present an explorative study on unsupervised NLI, a paradigm in which no human-annotated training samples are available. We investigate it under three settings: PH, P, and NPH that differ in the extent of unlabeled data available for learning. As a solution, we propose a procedural data generation approach that leverages a set of sentence transformations to collect PHL (Premise, Hypothesis, Label) triplets for training NLI models, bypassing the need for human-annotated training data. Comprehensive experiments with several NLI datasets show that the proposed approach results in accuracies of up to 66.75%, 65.9%, 65.39% in PH, P, and NPH settings respectively, outperforming all existing unsupervised baselines. Furthermore, fine-tuning our model with as little as ~0.1% of the human-annotated training dataset (500 instances) leads to 12.2% higher accuracy than the model trained from scratch on the same 500 instances. Supported by this superior performance, we conclude with a recommendation for collecting high-quality task-specific data.
    @inproceedings{varshney-etal-2022-unsupervised,
        title = "Unsupervised Natural Language Inference Using {PHL} Triplet Generation",
        author = "Varshney, Neeraj  and
          Banerjee, Pratyay  and
          Gokhale, Tejas  and
          Baral, Chitta",
        booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
        month = may,
        year = "2022",
        address = "Dublin, Ireland",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.findings-acl.159",
        pages = "2003--2016",
        abstract = "Transformer-based models achieve impressive performance on numerous Natural Language Inference (NLI) benchmarks when trained on respective training datasets. However, in certain cases, training samples may not be available or collecting them could be time-consuming and resource-intensive. In this work, we address the above challenge and present an explorative study on unsupervised NLI, a paradigm in which no human-annotated training samples are available. We investigate it under three settings: PH, P, and NPH that differ in the extent of unlabeled data available for learning. As a solution, we propose a procedural data generation approach that leverages a set of sentence transformations to collect PHL (Premise, Hypothesis, Label) triplets for training NLI models, bypassing the need for human-annotated training data. Comprehensive experiments with several NLI datasets show that the proposed approach results in accuracies of up to 66.75{\%}, 65.9{\%}, 65.39{\%} in PH, P, and NPH settings respectively, outperforming all existing unsupervised baselines. Furthermore, fine-tuning our model with as little as {\textasciitilde}0.1{\%} of the human-annotated training dataset (500 instances) leads to 12.2{\%} higher accuracy than the model trained from scratch on the same 500 instances. Supported by this superior performance, we conclude with a recommendation for collecting high-quality task-specific data.",
    }

2022

  1. ACL
    Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings
    Neeraj Varshney, Swaroop Mishra, Chitta Baral

    Findings of Association for Computational Linguistics

    Selective Prediciton enables systems to abstain from making predictions when they are likely to be incorrect. In this work, we systematically study 'selective prediction' in a large-scale setup of 17 datasets across several NLP tasks. We conduct experiments in in-domain, out-of-domain, and adversarial settings and evaluate several selective prediction approaches such as MaxProb, Monte-Carlo Dropout, Label Smoothing, and Calibration (C, R, and T). Our investigation results in numerous interesting findings.

    In order to equip NLP systems with selective prediction capability, several task-specific approaches have been proposed. However, which approaches work best across tasks or even if they consistently outperform the simplest baseline 'MaxProb' remains to be explored. To this end, we systematically study 'selective prediction' in a large-scale setup of 17 datasets across several NLP tasks. Through comprehensive experiments under in-domain (IID), out-of-domain (OOD), and adversarial (ADV) settings, we show that despite leveraging additional resources (held-out data/computation), none of the existing approaches consistently and considerably outperforms MaxProb in all three settings. Furthermore, their performance does not translate well across tasks. For instance, Monte-Carlo Dropout outperforms all other approaches on Duplicate Detection datasets but does not fare well on NLI datasets, especially in the OOD setting. Thus, we recommend that future selective prediction approaches should be evaluated across tasks and settings for reliable estimation of their capabilities.
    @inproceedings{varshney-etal-2022-investigating,
        title = "Investigating Selective Prediction Approaches Across Several Tasks in {IID}, {OOD}, and Adversarial Settings",
        author = "Varshney, Neeraj  and
          Mishra, Swaroop  and
          Baral, Chitta",
        booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
        month = may,
        year = "2022",
        address = "Dublin, Ireland",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.findings-acl.158",
        pages = "1995--2002",
        abstract = "In order to equip NLP systems with {`}selective prediction{'} capability, several task-specific approaches have been proposed. However, which approaches work best across tasks or even if they consistently outperform the simplest baseline MaxProb remains to be explored. To this end, we systematically study selective prediction in a large-scale setup of 17 datasets across several NLP tasks. Through comprehensive experiments under in-domain (IID), out-of-domain (OOD), and adversarial (ADV) settings, we show that despite leveraging additional resources (held-out data/computation), none of the existing approaches consistently and considerably outperforms MaxProb in all three settings. Furthermore, their performance does not translate well across tasks. For instance, Monte-Carlo Dropout outperforms all other approaches on Duplicate Detection datasets but does not fare well on NLI datasets, especially in the OOD setting. Thus, we recommend that future selective prediction approaches should be evaluated across tasks and settings for reliable estimation of their capabilities.",
    }

2022

  1. ACL
    ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
    Neeraj Varshney, Swaroop Mishra, Chitta Baral

    Association for Computational Linguistics

    We conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications such as efficient evaluations, improving quality of evaluation datasets, dataset analysis to guide future data creation, etc.

    Knowledge of questions' difficulty level helps a teacher in several ways, such as estimating students' potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in NLP? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications result in several interesting findings, such as evaluation using just 5% instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2% higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our analyses and findings will bring more attention to this important yet understudied field of leveraging instance difficulty in evaluations.
    @inproceedings{varshney-etal-2022-ildae,
        title = "{ILDAE}: Instance-Level Difficulty Analysis of Evaluation Data",
        author = "Varshney, Neeraj  and
          Mishra, Swaroop  and
          Baral, Chitta",
        booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
        month = may,
        year = "2022",
        address = "Dublin, Ireland",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.acl-long.240",
        pages = "3412--3425",
        abstract = "Knowledge of difficulty level of questions helps a teacher in several ways, such as estimating students{'} potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in Natural Language Processing? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications lead to several interesting results, such as evaluation using just 5{\%} instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2{\%} higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our work will encourage research in this important yet understudied field of leveraging instance difficulty in evaluations.",
    }

2022

  1. ACL
    NumGLUE: A Suite of Mathematical Reasoning Tasks

    Association for Computational Linguistics

    We proposed a multi-task benchmark that evaluates AI systems on eight different numerical understanding tasks and showed that it is far from being solved with neural models including large language models performing significantly worse than humans (lower by 46.4%).Proposed a knowledge-retrieval based MTL method that outperforms existing models.

    Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NumGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.
    @inproceedings{mishra-etal-2022-numglue,
        title = "{N}um{GLUE}: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks",
        author = "Mishra, Swaroop  and
          Mitra, Arindam  and
          Varshney, Neeraj  and
          Sachdeva, Bhavdeep  and
          Clark, Peter  and
          Baral, Chitta  and
          Kalyan, Ashwin",
        booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
        month = may,
        year = "2022",
        address = "Dublin, Ireland",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.acl-long.246",
        pages = "3505--3523",
        abstract = "Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4 {\%}). Further, NumGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4 {\%} on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.",
    }

2022

  1. ACL
    Towards Improving Selective Prediction Ability of NLP Systems
    Neeraj Varshney, Swaroop Mishra, Chitta Baral

    Repl4NLP @ Association for Computational Linguistics

    Prior work has shown that existing 'selective prediction' techniques fail to perform well, especially in the out-of-domain setting. In this work, we propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances. Using these two signals, we first annotate held-out instances and then train a calibrator to predict the likelihood of correctness of the model's prediction. We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings. In (IID, OOD) settings, we show that the representations learned by our calibrator result in an improvement of (15.81%, 5.64%) and (6.19%, 13.9%) over MaxProb --a selective prediction baseline-- on NLI and DD tasks respectively.

    It's better to say "I can't answer" than to answer incorrectly. This selective prediction ability is crucial for NLP systems to be reliably deployed in real-world applications. Prior work has shown that existing selective prediction techniques fail to perform well, especially in the out-of-domain setting. In this work, we propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances. Using these two signals, we first annotate held-out instances and then train a calibrator to predict the likelihood of correctness of the model's prediction. We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings. In (IID, OOD) settings, we show that the representations learned by our calibrator result in an improvement of (15.81%, 5.64%) and (6.19%, 13.9%) over 'MaxProb' -- a selective prediction baseline -- on NLI and DD tasks respectively.
    @inproceedings{varshney-etal-2022-towards,
        title = "Towards Improving Selective Prediction Ability of {NLP} Systems",
        author = "Varshney, Neeraj  and
          Mishra, Swaroop  and
          Baral, Chitta",
        booktitle = "Proceedings of the 7th Workshop on Representation Learning for NLP",
        month = may,
        year = "2022",
        address = "Dublin, Ireland",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.repl4nlp-1.23",
        pages = "221--226",
        abstract = "It{'}s better to say {``}I can{'}t answer{''} than to answer incorrectly. This selective prediction ability is crucial for NLP systems to be reliably deployed in real-world applications. Prior work has shown that existing selective prediction techniques fail to perform well, especially in the out-of-domain setting. In this work, we propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances. Using these two signals, we first annotate held-out instances and then train a calibrator to predict the likelihood of correctness of the model{'}s prediction. We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings. In (IID, OOD) settings, we show that the representations learned by our calibrator result in an improvement of (15.81{\%}, 5.64{\%}) and (6.19{\%}, 13.9{\%}) over {`}MaxProb{'} -a selective prediction baseline- on NLI and DD tasks respectively.",
    }

2022

  1. NAACL
    Let the Model Decide its Curriculum for Multitask Learning
    Neeraj Varshney, Swaroop Mishra, Chitta Baral

    DeepLo @ North American Chapter of the Association for Computational Linguistics

    We propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in granularity of arrangement. Through comprehensive experiments with 12 datasets, we show that instance-level and dataset-level techniques result in strong representations as they lead to an average performance improvement of 4.17% and 3.15% over their respective baselines. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks.

    Curriculum learning strategies in prior multi-task learning approaches arrange datasets in a difficulty hierarchy either based on human perception or by exhaustively searching the optimal arrangement. However, human perception of difficulty may not always correlate well with machine interpretation leading to poor performance and exhaustive search is computationally expensive. Addressing these concerns, we propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in granularity of arrangement. Through comprehensive experiments with 12 datasets, we show that instance-level and dataset-level techniques result in strong representations as they lead to an average performance improvement of 4.17% and 3.15% over their respective baselines. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks.
              @inproceedings{varshney-mishra-and-chitta-baral-2022-model,
        title = "Let the Model Decide its Curriculum for Multitask Learning",
        author = "Varshney, Neeraj  and
          Mishra and Chitta Baral, Swaroop",
        booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
        month = jul,
        year = "2022",
        address = "Hybrid",
        publisher = "Association for Computational Linguistics",
        url = "https://aclanthology.org/2022.deeplo-1.13",
        pages = "117--125",
        abstract = "t",
    }
    
    }

2022

  1. AAAI
    An Architecture for Novelty Handling in a Multi-Agent Stochastic Environment: Case Study in Open-World Monopoly

    AAAI Spring Symposium 2022

    We introduce an architecture that allows agents to detect novelties, characterize those novelties, and build an appropriate adaptive model to accommodate them.

    The ability of AI agents and architectures to detect and adapt to sudden changes in their environments remains an outstanding challenge. In the context of multi-agent games, the agent may face novel situations where the rules of the game, the available actions, the environment dynamics, the behavior of other agents, as well as the agent’s goals suddenly change. In this paper, we introduce an architecture that allows agents to detect novelties, characterize those novelties, and build an appropriate adaptive model to accommodate them. Our agent utilizes logic and reasoning (specifically, Answer Set Programming) to characterize novelties into different categories, as to enable the agent to adapt to the novelty while maintaining high performance in the game. We demonstrate the effectiveness of the proposed agent architecture in a multi-agent imperfect information board game, Monopoly. We measure the success of the architecture by comparing our method to heuristics, and vanilla Monte-Carlo Tree Search approaches. Our results indicate precise novelty detection, and significant improvements in the performance of agents utilizing the novelty handling architecture.

2022

  1. arXiv
    Can Transformers Reason About Effects of Actions?

    arXiv

    Reasoning about action and change has been a top focus in the knowledge representation subfield of AI from the early days of AI and more recently it has been a highlight aspect in common sense question answering. We consider four action domains (Blocks World, Logistics, Dock-Worker-Robots and a Generic Domain) in natural language and create QA datasets that involve reasoning about the effects of actions in these domains. We investigate the ability of transformers to (a) learn to reason in these domains and (b) transfer that learning from the generic domains to the other domains.

    A recent work has shown that transformers are able to "reason" with facts and rules in a limited setting where the rules are natural language expressions of conjunctions of conditions implying a conclusion. Since this suggests that transformers may be used for reasoning with knowledge given in natural language, we do a rigorous evaluation of this with respect to a common form of knowledge and its corresponding reasoning -- the reasoning about effects of actions. Reasoning about action and change has been a top focus in the knowledge representation subfield of AI from the early days of AI and more recently it has been a highlight aspect in common sense question answering. We consider four action domains (Blocks World, Logistics, Dock-Worker-Robots and a Generic Domain) in natural language and create QA datasets that involve reasoning about the effects of actions in these domains. We investigate the ability of transformers to (a) learn to reason in these domains and (b) transfer that learning from the generic domains to the other domains.
              @article{banerjee2020can,
      title={Can Transformers Reason About Effects of Actions?},
      author={Banerjee, Pratyay and Baral, Chitta and Luo, Man and Mitra, Arindam and Pal, Kuntal and Son, Tran C and Varshney, Neeraj},
      journal={arXiv preprint arXiv:2012.09938},
      year={2020}
    }

2020