Bao, Forrest Sheng
FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
Bao, Forrest Sheng, Li, Miaoran, Qu, Renyi, Luo, Ge, Wan, Erana, Tang, Yujia, Fan, Weisi, Tamber, Manveer Singh, Kazi, Suleman, Sourabh, Vivek, Qi, Mike, Tu, Ruixuan, Xu, Chenyu, Gonzales, Matthew, Mendelevitch, Ofer, Ahmad, Amin
Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. ``Challenging'' here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50\% accuracies on FaithBench, indicating lots of room for future improvement. The repo is https://github.com/vectara/FaithBench
DocAsRef: An Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely
Bao, Forrest Sheng, Tu, Ruixuan, Luo, Ge, Yang, Yinfei, Li, Hebi, Qiu, Minghui, He, Youbiao, Chen, Cen
Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.
Circuit Routing Using Monte Carlo Tree Search and Deep Neural Networks
He, Youbiao, Bao, Forrest Sheng
Circuit routing is a fundamental problem in designing electronic systems such as integrated circuits (ICs) and printed circuit boards (PCBs) which form the hardware of electronics and computers. Like finding paths between pairs of locations, circuit routing generates traces of wires to connect contacts or leads of circuit components. It is challenging because finding paths between dense and massive electronic components involves a very large search space. Existing solutions are either manually designed with domain knowledge or tailored to specific design rules, hence, difficult to adapt to new problems or design needs. Therefore, a general routing approach is highly desired. In this paper, we model the circuit routing as a sequential decision-making problem, and solve it by Monte Carlo tree search (MCTS) with deep neural network (DNN) guided rollout. It could be easily extended to routing cases with more routing constraints and optimization goals. Experiments on randomly generated single-layer circuits show the potential to route complex circuits. The proposed approach can solve the problems that benchmark methods such as sequential A* method and Lee's algorithm cannot solve, and can also outperform the vanilla MCTS approach.
Triaging moderate COVID-19 and other viral pneumonias from routine blood tests
Bao, Forrest Sheng, He, Youbiao, Liu, Jie, Chen, Yuanfang, Li, Qian, Zhang, Christina R., Han, Lei, Zhu, Baoli, Ge, Yaorong, Chen, Shi, Xu, Ming, Ouyang, Liu
The COVID-19 is sweeping the world with deadly consequences. Its contagious nature and clinical similarity to other pneumonias make separating subjects contracted with COVID-19 and non-COVID-19 viral pneumonia a priority and a challenge. However, COVID-19 testing has been greatly limited by the availability and cost of existing methods, even in developed countries like the US. Intrigued by the wide availability of routine blood tests, we propose to leverage them for COVID-19 testing using the power of machine learning. Two proven-robust machine learning model families, random forests (RFs) and support vector machines (SVMs), are employed to tackle the challenge. Trained on blood data from 208 moderate COVID-19 subjects and 86 subjects with non-COVID-19 moderate viral pneumonia, the best result is obtained in an SVM-based classifier with an accuracy of 84%, a sensitivity of 88%, a specificity of 80%, and a precision of 92%. The results are found explainable from both machine learning and medical perspectives. A privacy-protected web portal is set up to help medical personnel in their practice and the trained models are released for developers to further build other applications. We hope our results can help the world fight this pandemic and welcome clinical verification of our approach on larger populations.
RLScheduler: Learn to Schedule HPC Batch Jobs Using Deep Reinforcement Learning
Zhang, Di, Dai, Dong, He, Youbiao, Bao, Forrest Sheng
We present RLScheduler, a deep reinforcement learning based job scheduler for scheduling independent batch jobs in high-performance computing (HPC) environment. From knowing nothing about scheduling at beginning, RLScheduler is able to autonomously learn how to effectively schedule HPC batch jobs, targeting a given optimization goal. This is achieved by deep reinforcement learning with the help of specially designed neural network structures and various optimizations to stabilize and accelerate the learning. Our results show that RLScheduler can outperform existing heuristic scheduling algorithms, including a manually fine-tuned machine learning-based scheduler on the same workload. More importantly, we show that RLScheduler does not blindly over-fit the given workload to achieve such optimization, instead, it learns general rules for scheduling batch jobs which can be further applied to different workloads and systems to achieve similarly optimized performance. We also demonstrate that RLScheduler is capable of adjusting itself along with changing goals and workloads, making it an attractive solution for the future autonomous HPC management.
Accelerating SAT Solving by Common Subclause Elimination
Yan, Yaowei (University of Akron) | Gutierrez, Chris E. (Texas Tech University) | Jn-Charles, Jeriah (Texas Tech University) | Bao, Forrest Sheng (University of Akron) | Zhang, Yuanlin (Texas Tech University)
Boolean SATisfiability (SAT) is an important problem in AI. SAT solvers have been effectively used in important industrial applications including automated planning and verification. In this paper, we present novel algorithms for fast SAT solving by employing two common subclause elimination (CSE) approaches. Our motivation is that modern SAT solving techniques can be more efficient on CSE-processed instances. Empirical study shows that CSE can significantly speed up SAT solving.
Temporally Expressive Planning Based on Answer Set Programming with Constraints
Bao, Forrest Sheng (Texas Tech University) | Zhang, Yuanlin (Texas Tech University)
Recently, a new language AC(C) was proposed to integrate answer set programming (ASP) and constraint logic programming (CLP). In this paper, we show that temporally expressive planning problems in PDDL2.1 can be translated into AC(C) and solved using AC(C) solvers. Compared with existing approaches, the new approach puts less restrictions on the planning problems and is easy to extend with new features like PDDL axioms. It can also leverage the inference engine for AC(C) which has the potential to exploit the best reasoning mechanisms developed in the ASP, SAT and CP communities.
Combining Probabilistic Planning and Logic Programming on Mobile Robots
Zhang, Shiqi (Texas Tech University) | Bao, Forrest Sheng (Texas Tech University) | Sridharan, Mohan (Texas Tech University)
Key challenges to widespread deployment of mobile robots to interact with humans in real-world domains include the ability to: (a) robustly represent and revise domain knowledge; (b) autonomously adapt sensing and processing to the task at hand; and (c) learn from unreliable high-level human feedback. Partially observable Markov decision processes (POMDPs) have been used to plan sensing and navigation in different application domains. It is however a challenge to include common sense knowledge obtained from sensory or human inputs in POMDPs. In addition, information extracted from sensory and human inputs may have varying levels of relevance to current and future tasks. On the other hand, although a non-monotonic logic programming paradigm such as Answer Set Programming (ASP) is wellsuited for common sense reasoning, it is unable to model the uncertainty in real-world sensing and navigation (Gelfond 2008). This paper presents a hybrid framework that integrates ASP, hierarchical POMDPs (Zhang and Sridharan 2012) and psychophysics principles to address the challenges stated above. Experimental results in simulation and on mobile robots deployed in indoor domains show that the framework results in reliable and efficient operation.
Medical Treatment Conflict Resolving in Answer Set Programming
Bao, Forrest Sheng (Texas Tech University) | Zhang, Zhizheng (Southeast University) | Zhang, Yuanlin (Texas Tech University)
Medical treatment decision making is a good application of knowledge representation and reasoning. We are particularly interested in using it to resolve treatment conflicts, a complicated condition when two treatments cannot be given simultaneously to a patient of multiple symptoms. The logic system is required to reason on cases with and without treatment conflicts. Thanks to the nonmonotonicity of Answer Set Programming (ASP), we elegantly automate medical treatment conflict resolving on an example problem and show the importance of nonmonotonicity in medical reasoning.
The AC(C) Language: Integrating Answer Set Programming and Constraint Logic Programming
Bao, Forrest Sheng (Texas Tech University)
Combining Answer Set Programming (ASP) and Constraint Logic Programming (CLP) can create a more powerful language for knowledge representation and reasoning. The language AC(C) is designed to integrate ASP and CLP. Compared with existing integration of ASP and CSP, AC(C) allows representing user-defined constraints. Such integration provides great power for applications requiring logical reasoning involving constraints, e.g., temporal planning. In AC(C), user-defined and primitive constraints can be solved by a CLP inference engine while the logical reasoning over those constraints and regular logic literals is solved by an ASP inference engine (i.e., solver). My PhD work includes improving the language AC(C), implementing its faster inference engine and investigating how effective the new system can be used to solve a challenging application, temporal planning.