Goto

Collaborating Authors

 factual


ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Brassard, Ana, Heinzerling, Benjamin, Kudo, Keito, Sakaguchi, Keisuke, Inui, Kentaro

arXiv.org Artificial Intelligence

Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: https://github.com/a-brassard/ACORN.


FACTUAL: A Novel Framework for Contrastive Learning Based Robust SAR Image Classification

Wang, Xu, Ye, Tian, Kannan, Rajgopal, Prasanna, Viktor

arXiv.org Artificial Intelligence

Deep Learning (DL) Models for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR), while delivering improved performance, have been shown to be quite vulnerable to adversarial attacks. Existing works improve robustness by training models on adversarial samples. However, by focusing mostly on attacks that manipulate images randomly, they neglect the real-world feasibility of such attacks. In this paper, we propose FACTUAL, a novel Contrastive Learning framework for Adversarial Training and robust SAR classification. FACTUAL consists of two components: (1) Differing from existing works, a novel perturbation scheme that incorporates realistic physical adversarial attacks (such as OTSA) to build a supervised adversarial pre-training network. This network utilizes class labels for clustering clean and perturbed images together into a more informative feature space. (2) A linear classifier cascaded after the encoder to use the computed representations to predict the target labels. By pre-training and fine-tuning our model on both clean and adversarial samples, we show that our model achieves high prediction accuracy on both cases. Our model achieves 99.7% accuracy on clean samples, and 89.6% on perturbed samples, both outperforming previous state-of-the-art methods.


Test-time Augmentation for Factual Probing

Kamoda, Go, Heinzerling, Benjamin, Sakaguchi, Keisuke, Inui, Kentaro

arXiv.org Artificial Intelligence

Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA.


Forgetful Large Language Models: Lessons Learned from Using LLMs in Robot Programming

Chen, Juo-Tung, Huang, Chien-Ming

arXiv.org Artificial Intelligence

Large language models offer new ways of empowering people to program robot applications-namely, code generation via prompting. However, the code generated by LLMs is susceptible to errors. This work reports a preliminary exploration that empirically characterizes common errors produced by LLMs in robot programming. We categorize these errors into two phases: interpretation and execution. In this work, we focus on errors in execution and observe that they are caused by LLMs being "forgetful" of key information provided in user prompts. Based on this observation, we propose prompt engineering tactics designed to reduce errors in execution. We then demonstrate the effectiveness of these tactics with three language models: ChatGPT, Bard, and LLaMA-2. Finally, we discuss lessons learned from using LLMs in robot programming and call for the benchmarking of LLM-powered end-user development of robot applications.


Factual, a Location Data Company Leverages Machine Learning to Update Its Data Insights Solution

#artificialintelligence

Factual, the location data company, today announced a significant update to its Audience product, adding Predictive and Loyalty audiences built using machine-learned predictive insights to its roster of targeting solutions for marketers. Beginning today, marketers will have access to new Predictive Audiences and Loyalty Audiences, both built on sophisticated visitation pattern analysis, which will further enable marketers to construct highly scalable and accurate audience segments based on real-world consumer behavior and designed for ROI. The company has also added more than 100 ready-to-use audience segments in every vertical, including auto, retail and QSR. Factual builds its Predictive Audiences by developing an understanding of visitors to a place category and mapping their visitation patterns beforehand. Using Factual's Observation Graph, consumers most likely to visit a category based on these patterns can be segmented into audiences, giving marketers the ability to connect with consumers before they set foot in a brand's retail location.


SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums

Mihaylova, Tsvetomila, Karadjov, Georgi, Atanasova, Pepa, Baly, Ramy, Mohtarami, Mitra, Nakov, Preslav

arXiv.org Machine Learning

We present SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums, which features two subtasks. Subtask A is about deciding whether a question asks for factual information vs. an opinion/advice vs. just socializing. Subtask B asks to predict whether an answer to a factual question is true, false or not a proper answer. We received 17 official submissions for subtask A and 11 official submissions for Subtask B. For subtask A, all systems improved over the majority class baseline. For Subtask B, all systems were below a majority class baseline, but several systems were very close to it. The leaderboard and the data from the competition can be found at http://competitions.codalab.org/competitions/20022