Large Language Model
Legal Syllogism Prompting: Teaching Large Language Models for Legal Judgment Prediction
Legal syllogism is a form of deductive reasoning commonly used by legal professionals to analyze cases. In this paper, we propose legal syllogism prompting (LoT), a simple prompting method to teach large language models (LLMs) for legal judgment prediction. LoT teaches only that in the legal syllogism the major premise is law, the minor premise is the fact, and the conclusion is judgment. Then the models can produce a syllogism reasoning of the case and give the judgment without any learning, fine-tuning, or examples. On CAIL2018, a Chinese criminal case dataset, we performed zero-shot judgment prediction experiments with GPT-3 models. Our results show that LLMs with LoT achieve better performance than the baseline and chain of thought prompting, the state-of-art prompting method on diverse reasoning tasks. LoT enables the model to concentrate on the key information relevant to the judgment and to correctly understand the legal meaning of acts, as compared to other methods. Our method enables LLMs to predict judgment along with law articles and justification, which significantly enhances the explainability of models.
LogPr\'ecis: Unleashing Language Models for Automated Shell Log Analysis
Boffa, Matteo, Valentim, Rodolfo Vieira, Vassio, Luca, Giordano, Danilo, Drago, Idilio, Mellia, Marco, Houidi, Zied Ben
The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPr\'ecis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPr\'ecis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPr\'ecis, released as open source, paves the way for better and more responsive defense against cyberattacks.
Extending the Frontier of ChatGPT: Code Generation and Debugging
Sakib, Fardin Ahsan, Khan, Saadat Hasan, Karim, A. H. M. Rezaul
Large-scale language models (LLMs) have emerged as a groundbreaking innovation in the realm of question-answering and conversational agents. These models, leveraging different deep learning architectures such as Transformers, are trained on vast corpora to predict sentences based on given queries. Among these LLMs, ChatGPT, developed by OpenAI, has ushered in a new era by utilizing artificial intelligence (AI) to tackle diverse problem domains, ranging from composing essays and biographies to solving intricate mathematical integrals. The versatile applications enabled by ChatGPT offer immense value to users. However, assessing the performance of ChatGPT's output poses a challenge, particularly in scenarios where queries lack clear objective criteria for correctness. For instance, evaluating the quality of generated essays becomes arduous and relies heavily on manual labor, in stark contrast to evaluating solutions to well-defined, closed-ended questions such as mathematical problems. This research paper delves into the efficacy of ChatGPT in solving programming problems, examining both the correctness and the efficiency of its solution in terms of time and memory complexity. The research reveals a commendable overall success rate of 71.875\%, denoting the proportion of problems for which ChatGPT was able to provide correct solutions that successfully satisfied all the test cases present in Leetcode. It exhibits strengths in structured problems and shows a linear correlation between its success rate and problem acceptance rates. However, it struggles to improve solutions based on feedback, pointing to potential shortcomings in debugging tasks. These findings provide a compact yet insightful glimpse into ChatGPT's capabilities and areas for improvement.
Harnessing Scalable Transactional Stream Processing for Managing Large Language Models [Vision]
Zhang, Shuhao, Zeng, Xianzhi, Wu, Yuhao, Yang, Zhonghao
Large Language Models (LLMs) have demonstrated extraordinary performance across a broad array of applications, from traditional language processing tasks to interpreting structured sequences like time-series data. Yet, their effectiveness in fast-paced, online decision-making environments requiring swift, accurate, and concurrent responses poses a significant challenge. This paper introduces TStreamLLM, a revolutionary framework integrating Transactional Stream Processing (TSP) with LLM management to achieve remarkable scalability and low latency. By harnessing the scalability, consistency, and fault tolerance inherent in TSP, TStreamLLM aims to manage continuous & concurrent LLM updates and usages efficiently. We showcase its potential through practical use cases like real-time patient monitoring and intelligent traffic management. The exploration of synergies between TSP and LLM management can stimulate groundbreaking developments in AI and database research. This paper provides a comprehensive overview of challenges and opportunities in this emerging field, setting forth a roadmap for future exploration and development.
Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training
Chen, Xiaofei, He, Yuting, Xue, Cheng, Ge, Rongjun, Li, Shuo, Yang, Guanyu
The foundation models based on pre-training technology have significantly advanced artificial intelligence from theoretical to practical applications. These models have facilitated the feasibility of computer-aided diagnosis for widespread use. Medical contrastive vision-language pre-training, which does not require human annotations, is an effective approach for guiding representation learning using description information in diagnostic reports. However, the effectiveness of pre-training is limited by the large-scale semantic overlap and shifting problems in medical field. To address these issues, we propose the Knowledge-Boosting Contrastive Vision-Language Pre-training framework (KoBo), which integrates clinical knowledge into the learning of vision-language semantic consistency. The framework uses an unbiased, open-set sample-wise knowledge representation to measure negative sample noise and supplement the correspondence between vision-language mutual information and clinical knowledge. Extensive experiments validate the effect of our framework on eight tasks including classification, segmentation, retrieval, and semantic relatedness, achieving comparable or better performance with the zero-shot or few-shot settings. Our code is open on https://github.com/ChenXiaoFei-CS/KoBo.
Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs
Wadhwa, Somin, DeYoung, Jay, Nye, Benjamin, Amir, Silvio, Wallace, Byron C.
Results from Randomized Controlled Trials (RCTs) establish the comparative effectiveness of interventions, and are in turn critical inputs for evidence-based care. However, results from RCTs are presented in (often unstructured) natural language articles describing the design, execution, and outcomes of trials; clinicians must manually extract findings pertaining to interventions and outcomes of interest from such articles. This onerous manual process has motivated work on (semi-)automating extraction of structured evidence from trial reports. In this work we propose and evaluate a text-to-text model built on instruction-tuned Large Language Models (LLMs) to jointly extract Interventions, Outcomes, and Comparators (ICO elements) from clinical abstracts, and infer the associated results reported. Manual (expert) and automated evaluations indicate that framing evidence extraction as a conditional generation task and fine-tuning LLMs for this purpose realizes considerable ($\sim$20 point absolute F1 score) gains over the previous SOTA. We perform ablations and error analyses to assess aspects that contribute to model performance, and to highlight potential directions for further improvements. We apply our model to a collection of published RCTs through mid-2022, and release a searchable database of structured findings: http://ico-relations.ebm-nlp.com
Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution
Modern language modeling tasks are often underspecified: for a given token prediction, many words may satisfy the user's intent of producing natural language at inference time, however only one word would minimize the task's loss function at training time. We provide a simple yet plausible causal mechanism describing the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods, that we apply to gendered pronoun resolution tasks on a wide range of LLMs to 1) aid in the detection of inference-time task underspecification by exploiting 2) previously unreported gender vs. time and gender vs. location spurious correlations on LLMs with a range of A) sizes: from BERT-base to GPT 3.5, B) pre-training objectives: from masked & autoregressive language modeling to a mixture of these objectives, and C) training stages: from pre-training only to reinforcement learning from human feedback (RLHF). Code and open-source demos available at https: //github.com/2dot71mily/sib_paper.
All Those Professors Warning About ChatGPT? My Class Is Their Worst Nightmare.
At the end of June, I started teaching a course that was designed by ChatGPT, uses ChatGPT, and is even assessed by ChatGPT. The course itself is also about ChatGPT. The idea for it came to me earlier this year when I saw news about job opportunities for "prompt engineers," who could purportedly make more than $300,000 per year. As a professor who teaches about advanced technology transitions, my academic ears pricked up. Was there a skillset here we should be teaching our students? And if so, how should we go about it?
AI learned from their work. Now they want compensation.
This past week, comedian Sarah Silverman filed a lawsuit against OpenAI and Facebook parent company Meta, alleging they used a pirated copy of her book in training data because the companies' chatbots can summarize her book accurately. Novelists Mona Awad and Paul Tremblay filed a similar lawsuit against OpenAI. And more than 5,000 authors, including Jodi Picoult, Margaret Atwood and Viet Thanh Nguyen, have signed a petition asking tech companies to get consent from and give credit and compensation to writers whose books were used in training data.
Measuring Faithfulness in Chain-of-Thought Reasoning
Lanham, Tamera, Chen, Anna, Radhakrishnan, Ansh, Steiner, Benoit, Denison, Carson, Hernandez, Danny, Li, Dustin, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Nguyen, Karina, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Kundu, Sandipan, Kadavath, Saurav, Yang, Shannon, Henighan, Thomas, Maxwell, Timothy, Telleen-Lawton, Timothy, Hume, Tristan, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R., Perez, Ethan
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.