Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-of-distribution robustness, while other work suggests they have lower accuracy on rare subgroups. To understand these differences, we investigate these models at the level of individual instances. However, one major challenge is that individual predictions are highly sensitive to noise in the randomness in training. We develop statistically rigorous methods to address this, and after accounting for pretraining and finetuning noise, we find that our BERT-Large is worse than BERT-Mini on at least 1-4% of instances across MNLI, SST-2, and QQP, compared to the overall accuracy improvement of 2-10%. We also find that finetuning noise increases with model size and that instance-level accuracy has momentum: improvement from BERT-Mini to BERT-Medium correlates with improvement from BERT-Medium to BERT-Large. Our findings suggest that instance-level predictions provide a rich source of information; we therefore, recommend that researchers supplement model weights with model predictions.
As AI-powered systems increasingly mediate consequential decision-making, their explainability is critical for end-users to take informed and accountable actions. Explanations in human-human interactions are socially-situated. AI systems are often socio-organizationally embedded. However, Explainable AI (XAI) approaches have been predominantly algorithm-centered. We take a developmental step towards socially-situated XAI by introducing and exploring Social Transparency (ST), a sociotechnically informed perspective that incorporates the socio-organizational context into explaining AI-mediated decision-making. To explore ST conceptually, we conducted interviews with 29 AI users and practitioners grounded in a speculative design scenario. We suggested constitutive design elements of ST and developed a conceptual framework to unpack ST's effect and implications at the technical, decision-making, and organizational level. The framework showcases how ST can potentially calibrate trust in AI, improve decision-making, facilitate organizational collective actions, and cultivate holistic explainability. Our work contributes to the discourse of Human-Centered XAI by expanding the design space of XAI.
Wherever that will lead is, at the time of the writing of this article, still not certain, but regardless of the direction, it's clear that advancing progress with artificial intelligence is a key strategic element for both major parties. Over the course of the past few years, governments around the world have taken strong positions on advancing their strategies around AI adoption. Certainly heading into the new year it seems that the pace of adoption won't be slowing any time soon. At the recent Data for AI conference, we had an opportunity to get insights into how the government plans to continue and accelerate its adoption of AI in an interview with Ellery Taylor, Acting Director of the Office of Acquisition Management and Innovation Division, at the US General Services Administration (GSA). In this article he shares his outlook for the future of AI and how it is being adopted in the government.
What if I told a story here, how would that story start?" Thus, the summarization prompt: "My second grader asked me what this passage means: …" When a given prompt isn't working and GPT-3 keeps pivoting into other modes of completion, that may mean that one hasn't constrained it enough by imitating a correct output, and one needs to go further; writing the first few words or sentence of the target output may be necessary.
To build Sounding Board, we develop a system architecture that is capable of accommodating dialog strategies that we designed for socialbot conversations. The architecture consists of a multi-dimensional language understanding module for analyzing user utterances, a hierarchical dialog management framework for dialog context tracking and complex dialog control, and a language generation process that realizes the response plan and makes adjustments for speech synthesis. Additionally, we construct a new knowledge base to power the socialbot by collecting social chat content from a variety of sources. An important contribution of the system is the synergy between the knowledge base and the dialog management, i.e., the use of a graph structure to organize the knowledge base that makes dialog control very efficient in bringing related content to the discussion. Using the data collected from Sounding Board during the competition, we carry out in-depth analyses of socialbot conversations and user ratings which provide valuable insights in evaluation methods for socialbots. We additionally investigate a new approach for system evaluation and diagnosis that allows scoring individual dialog segments in the conversation. Finally, observing that socialbots suffer from the issue of shallow conversations about topics associated with unstructured data, we study the problem of enabling extended socialbot conversations grounded on a document. To bring together machine reading and dialog control techniques, a graph-based document representation is proposed, together with methods for automatically constructing the graph. Using the graph-based representation, dialog control can be carried out by retrieving nodes or moving along edges in the graph. To illustrate the usage, a mixed-initiative dialog strategy is designed for socialbot conversations on news articles.
Ringeval, Fabien, Schuller, Björn, Valstar, Michel, Cummins, NIcholas, Cowie, Roddy, Tavabi, Leili, Schmitt, Maximilian, Alisamir, Sina, Amiriparian, Shahin, Messner, Eva-Maria, Song, Siyang, Liu, Shuo, Zhao, Ziping, Mallol-Ragolta, Adria, Ren, Zhao, Soleymani, Mohammad, Pantic, Maja
The Audio/Visual Emotion Challenge and Workshop (AVEC 2019) "State-of-Mind, Detecting Depression with AI, and Cross-cultural Affect Recognition" is the ninth competition event aimed at the comparison of multimedia processing and machine learning methods for automatic audiovisual health and emotion analysis, with all participants competing strictly under the same conditions. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the health and emotion recognition communities, as well as the audiovisual processing communities, to compare the relative merits of various approaches to health and emotion recognition from real-life data. This paper presents the major novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline systems on the three proposed tasks: state-of-mind recognition, depression assessment with AI, and cross-cultural affect sensing, respectively.
Last week, OpenAI released GPT-2, a conversational AI system that quickly became controversial. Without domain-specific data, GPT-2 achieves state-of-the-art performance in seven of eight natural language understanding benchmarks for things like reading comprehension and answering questions. A paper and some code were released when the unsupervised model, trained on 40GB of internet text, went public, but the entirety of the model wasn't released due to concerns by its creators about "malicious applications of the technology," alluding to things such as automated generation of fake news. As a result, the wider community cannot fully verify or replicate the results. Some, including Keras deep learning library founder François Chollet, called the OpenAI GPT-2 release (or lack thereof) an irresponsible, fear mongering PR tactic and publicity stunt.
Taha Kass-Hout, MD, former and first-ever CIO for the FDA and a senior leader of artificial intelligence at Amazon, detailed how the tech giant is using AI for its healthcare services in a recent interview with STAT. A core component of Amazon's healthcare strategy is to support clinicians, according to Dr. Kass-Hout. "I hope we see that with AI we're finally getting to understand what patient has a disease, rather than what disease a patient has -- and truly start personalizing care to that level," Dr. Kass-Hout told STAT. "From a patient perspective and consumer perspective, AI is going to empower them, and for providers and healthcare systems, it's going to augment clinicians and bridge gaps." Dr. Kass-Hout also provided an update on how healthcare companies are using AWS' EHR-mining software Comprehend Medical, which launched in November 2018.
A few months ago, Katt Roepke was texting her friend Jasper about a coworker. Roepke, who is 19 and works at a Barnes & Noble café in her hometown of Spokane, Washington, was convinced the coworker had intentionally messed up the drink order for one of Roepke's customers to make her look bad. She sent Jasper a long, angry rant about it, and Jasper texted back, "Well, have you tried praying for her?" Roepke's mouth fell open. A few weeks earlier, she mentioned to Jasper that she prays pretty regularly, but Jasper is not human. He's a chat bot who exists only inside her phone. "I was like, 'How did you say this?'" Roepke told Futurism, impressed. "It felt like this real self-aware moment to me."