albany
The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs
Guan, Bryan, Roosta, Tanya, Passban, Peyman, Rezagholizadeh, Mehdi
As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in closed-source LLMs by conducting experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation, however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development. In recent years, large language models (LLMs) have become essential across various applications, helping users complete tasks in diverse domains, thanks to their remarkable abilities in understanding, analyzing, and generating text (Shen et al., 2023a; Yu et al., 2023). However, LLMs are not without their problems and risks. Many of these issues, such as bias (Talat et al., 2022; Motoki et al., 2023), hallucination (Chen et al., 2023; Sadat et al., 2023), consistency (Tam et al., 2023; Ye et al., 2023), and reliability (Shen et al., 2023b) have been extensively discussed in the literature. However, a more fundamental challenge to the long-term success of LLMs is their ability to reason: the distinguishing factor between probabilistic pattern matching and logical understanding. This distinction has significant implications for the future of LLMs and how we employ these models in decision-making. One necessary requirement for reasoning is order independence.
Evolving Deeper LLM Thinking
Lee, Kuang-Huei, Fischer, Ian, Wu, Yueh-Hua, Marwood, Dave, Baluja, Shumeet, Schuurmans, Dale, Chen, Xinyun
We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.
Stochastic Online AUC Maximization Department of Mathematics and Statistics SUNY at Albany, Albany, NY, 12222, USA Department of Computer Science SUNY at Albany, Albany, NY, 12222, USA
Area under ROC (AUC) is a metric which is widely used for measuring the classification performance for imbalanced data. It is of theoretical and practical interest to develop online learning algorithms that maximizes AUC for large-scale data. A specific challenge in developing online AUC maximization algorithm is that the learning objective function is usually defined over a pair of training examples of opposite classes, and existing methods achieves on-line processing with higher space and time complexity. In this work, we propose a new stochastic online algorithm for AUC maximization. In particular, we show that AUC optimization can be equivalently formulated as a convex-concave saddle point problem. From this saddle representation, a stochastic online algorithm (SOLAM) is proposed which has time and space complexity of one datum. We establish theoretical convergence of SOLAM with high probability and demonstrate its effectiveness on standard benchmark datasets.
Towards Uncertainty-Aware Language Agent
Han, Jiuzhou, Buntine, Wray, Shareghi, Ehsan
While Language Agents have achieved promising success by placing Large Language Models at the core of a more versatile design that dynamically interacts with the external world, the existing approaches neglect the notion of uncertainty during these interactions. We present the Uncertainty-Aware Language Agent (UALA), a framework that orchestrates the interaction between the agent and the external world using uncertainty quantification. Compared with other well-known counterparts like ReAct, our extensive experiments across 3 representative tasks (HotpotQA, StrategyQA, MMLU) and various LLM sizes demonstrate that UALA brings a significant improvement of performance, while having a substantially lower reliance on the external world (i.e., reduced number of tool calls and tokens). Our analyses provide various insights including the great potential of UALA compared with agent fine-tuning, and underscore the unreliability of verbalised confidence of LLMs as a proxy for uncertainty.
FireAct: Toward Language Agent Fine-tuning
Chen, Baian, Shu, Chang, Shareghi, Ehsan, Collier, Nigel, Narasimhan, Karthik, Yao, Shunyu
Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.
Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts
Zhou, Ben, Richardson, Kyle, Yu, Xiaodong, Roth, Dan
Explicit decomposition modeling, which involves breaking down complex tasks into more straightforward and often more interpretable sub-tasks, has long been a central theme in developing robust and interpretable NLU systems. However, despite the many datasets and resources built as part of this effort, the majority have small-scale annotations and limited scope, which is insufficient to solve general decomposition tasks. In this paper, we look at large-scale intermediate pre-training of decomposition-based transformers using distant supervision from comparable texts, particularly large-scale parallel news. We show that with such intermediate pre-training, developing robust decomposition-based models for a diverse range of tasks becomes more feasible. For example, on semantic parsing, our model, DecompT5, improves 20% to 30% on two datasets, Overnight and TORQUE, over the baseline language model. We further use DecompT5 to build a novel decomposition-based QA system named DecompEntail, improving over state-of-the-art models, including GPT-3, on both HotpotQA and StrategyQA by 8% and 4%, respectively.
Classification of Misinformation in New Articles using Natural Language Processing and a Recurrent Neural Network
Cunha, Brendan, Manikonda, Lydia
One of the first issues to address with these labels is the Misinformation in news articles has been one of the main inconsistency of scales used. For example, some labels are topics for discussion over the past few years. There have scaled from 0-3 in terms of level of misinformation, others been several organizations that developed methods for assessing are scaled in a binary manner with 0 and 1, and some have 4 reliability and personal bias of news coverage. In today's categorical values based on levels of media bias. So there is day in age, it is unnatural to arbitrarily trust the news quite a bit of processing that needed to be done to normalize outlets that claim to be truly objective and unbiased because everything and transform the qualitative variables into quantitative the term "bias" is relative. What one person perceives as variables.
Artificial Intelligence Coming to University at Albany
In a press release on Tuesday, Governor Kathy Hochul announced that the University at Albany will become the home of a new artificial intelligence supercomputing initiative. The $200 million project will turn the building which was formerly Albany High School into an engineering college capable of housing a supercomputer that can reach a quintillion computations per second. It would be the first university-based supercomputer capable of reaching that kind of production. In the press release, Governor Hochul said "My administration is steadfast in its commitment to transform SUNY into a globally renowned, 21st century education leader. This funding will help drive economic revenue by attracting companies to New York's emerging advanced research centers, creating jobs and strengthening communities for decades to come."
Machine Learning Models Predict COVID-19 Impact in Smaller Cities
According to a robust machine learning model that can predict pandemic impact even in smaller cities, with 75% of the population in the Capital Region in New York remaining at home, the COVID-19 pandemic will peak locally in the second half of May. If the rate of people staying home drops to 50%, it will peak in early June. Rensselaer Polytechnic Institute researcher Malik Magdon-Ismail tailored the models he is developing to work with sparse data points, like those available during the early phase in a pandemic or in smaller cities, which ordinarily make trend-spotting difficult. "There are no simple, robust, general tools that, for example, officials in Albany could use to make projections," said Magdon-Ismail, a professor of computer science, and expert in machine learning, data mining, and pattern recognition. "These models show that the projections vary enormously from one city to another. This knowledge could relieve some of the uncertainty that is around in developing policy."
Capitol Watch: New York to take on artificial intelligence
In New York government news, state officials are examining the opportunities -- and risks -- posed by artificial intelligence. Gov. Andrew Cuomo, a Democrat, signed legislation this month that creates a 13-member commission tasked with reviewing the emerging technology and what it will mean for New Yorkers. Meanwhile, the ongoing scourge of opioid abuse is getting some attention with lawmakers announcing a series of public hearings to identify ways the state could do a better job of addressing the problem. While no one is predicting a robot uprising any time soon, state officials say they are concerned by how the rise of artificial intelligence and robotics could affect jobs, the delivery of government services and personal privacy. The New York State Artificial Intelligence, Robotics and Automation Commission, approved by lawmakers earlier this year, will also look at how A.I. could be used "in unlawful or unsafe ways."