AITopics | Gunasekara, Chulaka

Collaborating Authors

Gunasekara, Chulaka

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

Katsis, Yannis, Rosenthal, Sara, Fadnis, Kshitij, Gunasekara, Chulaka, Lee, Young-Suk, Popa, Lucian, Shah, Vraj, Zhu, Huaiyu, Contractor, Danish, Danilevsky, Marina

arXiv.org Artificial IntelligenceJan-6-2025

Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2501.03468

Country:

Europe (0.67)
North America > United States > Texas (0.14)
North America > United States > New York (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Government (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reducing the Scope of Language Models with Circuit Breakers

Yunis, David, Huo, Siyu, Gunasekara, Chulaka, Contractor, Danish

arXiv.org Artificial IntelligenceOct-28-2024

Language models are now deployed in a wide variety of user-facing applications, often for specific purposes like answering questions about documentation or acting as coding assistants. As these models are intended for particular purposes, they should not be able to answer irrelevant queries like requests for poetry or questions about physics, or even worse, queries that can only be answered by humans like sensitive company policies. Instead we would like them to only answer queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We find that, despite the use of system prompts, two representative language models can be poorly scoped and respond to queries they should not be addressing. We then conduct a comprehensive empirical evaluation of methods which could be used for scoping the behavior of language models. Among many other results, we show that a recently-proposed method for general alignment, Circuit Breakers (CB), can be adapted to scope language models to very specific tasks like sentiment analysis or summarization or even tasks with finer-grained scoping (e.g. When compared to standard methods like fine-tuning or preference learning, CB is more robust both for out of distribution tasks, and to adversarial prompting techniques. We also show that layering SFT and CB together often results in the best of both worlds: improved performance only on relevant queries, while rejecting irrelevant ones. In the past few years Large Language Models have exploded into the popular conscience. One major recent addition is the "alignment" process through Reinforcement Learning with Human Feedback (RLHF) (Christiano et al., 2017; Ouyang et al., 2022), which has made the current generation of language models much less likely to emit toxic content than previous generations (Wolf et al., 2017), and thus much more acceptable for general use. As a result, many businesses and individuals feel more comfortable using these technologies than they would be in the past. As a result, we have generally capable language models which refuse to answer toxic or dangerous queries, but it is still difficult to deploy these language models. Even though they may not emit toxic content as often, they still will happily answer any question, irrelevant or not. This becomes a problem when we wish to deploy language models as products in specific contexts: e.g. While language models have general language capability, there is still a need to scope them for specific uses. David Yunis is a PhD student at the Toyota Technological Institute at Chicago. Work was performed during an internship at IBM. Arrows indicate the direction of best performance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.21597

Country: North America > United States > Illinois > Cook County > Chicago (0.24)

Genre:

Research Report (0.81)
Overview (0.65)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Abdelaziz, Ibrahim, Basu, Kinjal, Agarwal, Mayank, Kumaravel, Sadhana, Stallone, Matthew, Panda, Rameswar, Rizk, Yara, Bhargav, GP, Crouse, Maxwell, Gunasekara, Chulaka, Ikbal, Shajith, Joshi, Sachin, Karanam, Hima, Kumar, Vineet, Munawar, Asim, Neelam, Sumit, Raghu, Dinesh, Sharma, Udit, Soria, Adriana Meza, Sreedhar, Dheeraj, Venkateswaran, Praveen, Unuvar, Merve, Cox, David, Roukos, Salim, Lastras, Luis, Kapanipathi, Pavan

arXiv.org Artificial IntelligenceJun-27-2024

Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2407.00121

Country:

North America > United States (0.14)
North America > Canada (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations

Gupta, Ankita, Gunasekara, Chulaka, Wan, Hui, Ganhotra, Jatin, Joshi, Sachindra, Danilevsky, Marina

arXiv.org Artificial IntelligenceNov-15-2023

Dialogue summarization task involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations) and existing dialogue summarization models suffer from performance drop on such conversations. In this study, we systematically investigate the impact of such variations on state-of-the-art dialogue summarization models using publicly available datasets. To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We conduct our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which capture different aspects of the summarization model's performance. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate if the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data and observe that this approach is insufficient to address robustness challenges with current models and thus warrants a more thorough investigation to identify better solutions. Overall, our work highlights robustness challenges in dialogue summarization and provides insights for future research.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2311.08705

Country:

Asia (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types

Murugesan, Keerthiram, Swaminathan, Sarathkrishna, Dan, Soham, Chaudhury, Subhajit, Gunasekara, Chulaka, Crouse, Maxwell, Mahajan, Diwakar, Abdelaziz, Ibrahim, Fokoue, Achille, Kapanipathi, Pavan, Roukos, Salim, Gray, Alexander

arXiv.org Artificial IntelligenceJun-17-2023

With the growing interest in large language models, the need for evaluating the quality of machine text compared to reference (typically human-generated) text has become focal attention. Most recent works focus either on task-specific evaluation metrics or study the properties of machine-generated text captured by the existing metrics. In this work, we propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types such as spatial/geographic errors, entity errors, etc, to guide the model for better prediction of human judgments. We propose a neural framework for evaluating machine texts that uses these mismatch error types as auxiliary tasks and re-purposes the existing single-number evaluation metrics as additional scalar features, in addition to textual features extracted from the machine and reference texts. Our experiments reveal key insights about the existing metrics via the mismatch errors. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.

artificial intelligence, machine translation, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.10452

Country: North America > United States > New Mexico (0.14)

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.50)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Add feedback

Semi-Structured Object Sequence Encoders

Murthy, Rudra V, Bhat, Riyaz, Gunasekara, Chulaka, Patel, Siva Sankalp, Wan, Hui, Dhamecha, Tejas Indulal, Contractor, Danish, Danilevsky, Marina

arXiv.org Artificial IntelligenceMay-22-2023

In this paper we explore the task of modeling semi-structured object sequences; in particular, we focus our attention on the problem of developing a structure-aware input representation for such sequences. Examples of such data include user activity on websites, machine logs, and many others. This type of data is often represented as a sequence of sets of key-value pairs over time and can present modeling challenges due to an ever-increasing sequence length. We propose a two-part approach, which first considers each key independently and encodes a representation of its values over time; we then self-attend over these value-aware key representations to accomplish a downstream task. This allows us to operate on longer object sequences than existing methods. We introduce a novel shared-attention-head architecture between the two modules and present an innovative training schedule that interleaves the training of both modules with shared weights for some attention heads. Our experiments on multiple prediction tasks using real-world data demonstrate that our approach outperforms a unified network with hierarchical encoding, as well as other methods including a record-centric representation and a flattened representation of the sequence.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2301.01015

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Services (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

Gera, Ariel, Friedman, Roni, Arviv, Ofir, Gunasekara, Chulaka, Sznajder, Benjamin, Slonim, Noam, Shnarch, Eyal

arXiv.org Artificial IntelligenceMay-2-2023

Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Specifically, in choosing between the probable next token predictions of a generative model, the predictions of lower layers can be used to highlight which candidates are best avoided. We propose a novel approach that utilizes the contrast between layers to improve text generation outputs, and show that it mitigates degenerative behaviors of the model in open-ended generation, significantly improving the quality of generated texts. Furthermore, our results indicate that contrasting between model layers at inference time can yield substantial benefits to certain aspects of general language model capabilities, more effectively extracting knowledge during inference from a given set of model parameters.

machine learning, natural language, prediction, (18 more...)

arXiv.org Artificial Intelligence

2305.01628

Country:

Oceania > Australia (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback