Personal
Google teases new camera-powered AI feature one day ahead of I/O
Google is teasing an intriguing new AI feature one day ahead of its I/O developer conference. The company shared a brief video on X that appears to show a new camera-powered AI feature that's able to recognize what's in the frame in real time. The video, which is labeled as a "prototype," shows what appears to be a Pixel device with the camera open viewing the keynote stage at I/O. The person holding the camera asks, "hey, what do you think is happening here?" A voice replies that "it looks like people are setting up for a large event, perhaps a conference or presentation."
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Vidgen, Bertie, Agrawal, Adarsh, Ahmed, Ahmed M., Akinwande, Victor, Al-Nuaimi, Namir, Alfaraj, Najla, Alhajjar, Elie, Aroyo, Lora, Bavalatti, Trupti, Bartolo, Max, Blili-Hamelin, Borhane, Bollacker, Kurt, Bomassani, Rishi, Boston, Marisa Ferrara, Campos, Simรฉon, Chakra, Kal, Chen, Canyu, Coleman, Cody, Coudert, Zacharie Delpierre, Derczynski, Leon, Dutta, Debojyoti, Eisenberg, Ian, Ezick, James, Frase, Heather, Fuller, Brian, Gandikota, Ram, Gangavarapu, Agasthya, Gangavarapu, Ananya, Gealy, James, Ghosh, Rajat, Goel, James, Gohar, Usman, Goswami, Sujata, Hale, Scott A., Hutiri, Wiebke, Imperial, Joseph Marvin, Jandial, Surgan, Judd, Nick, Juefei-Xu, Felix, Khomh, Foutse, Kailkhura, Bhavya, Kirk, Hannah Rose, Klyman, Kevin, Knotz, Chris, Kuchnik, Michael, Kumar, Shachi H., Kumar, Srijan, Lengerich, Chris, Li, Bo, Liao, Zeyi, Long, Eileen Peters, Lu, Victor, Luger, Sarah, Mai, Yifan, Mammen, Priyanka Mary, Manyeki, Kelvin, McGregor, Sean, Mehta, Virendra, Mohammed, Shafee, Moss, Emanuel, Nachman, Lama, Naganna, Dinesh Jinenhally, Nikanjam, Amin, Nushi, Besmira, Oala, Luis, Orr, Iftach, Parrish, Alicia, Patlak, Cigdem, Pietri, William, Poursabzi-Sangdeh, Forough, Presani, Eleonora, Puletti, Fabrizio, Rรถttger, Paul, Sahay, Saurav, Santos, Tim, Scherrer, Nino, Sebag, Alice Schoenauer, Schramowski, Patrick, Shahbazi, Abolfazl, Sharma, Vin, Shen, Xudong, Sistla, Vamsi, Tang, Leonard, Testuggine, Davide, Thangarasa, Vithursan, Watkins, Elizabeth Anne, Weiss, Rebecca, Welty, Chris, Wilbers, Tyler, Williams, Adina, Wu, Carole-Jean, Yadav, Poonam, Yang, Xianjun, Zeng, Yi, Zhang, Wenhui, Zhdanov, Fedor, Zhu, Jiacheng, Liang, Percy, Mattson, Peter, Vanschoren, Joaquin
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
On-Demand Model and Client Deployment in Federated Learning with Deep Reinforcement Learning
Chahoud, Mario, Sami, Hani, Mourad, Azzam, Otrok, Hadi, Bentahar, Jamal, Guizani, Mohsen
Abstract--In Federated Learning (FL), the limited accessibility of data from diverse locations and user types poses a significant challenge due to restricted user participation. Expanding client access and diversifying data enhance models by incorporating diverse perspectives, thereby enhancing adaptability. However, challenges arise in dynamic and mobile environments where certain devices may become inaccessible as FL clients, impacting data availability and client selection methods. To address this, we propose an On-Demand solution, deploying new clients using Docker Containers on-the-fly. It employs an autonomous end-to-end solution for handling model deployment and client selection. Simulated tests show that our architecture can easily adjust to changes in the environment and respond to On-Demand requests. FL can enhance traffic prediction models using realtime data from vehicles moving on the road. Regulation in the European Union, aim to protect data privacy One of the main limitations in existing FL frameworks [1]. However, the stringency of these regulations varies is in accessing the full potential of available data due to globally. A study [2] revealed a notable increase in privacy reliance on static clients, leading to incomplete or biased requests from 2021 to 2022, indicating growing concerns about dataset representations and affecting model performance. Access and Deletion requests saw a today's digital landscape, acquiring more clients is about substantial peak, with a 72% year-over-year increase in data efficiency.
Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning
Comminiello, Danilo, Grassucci, Eleonora, Mandic, Danilo P., Uncini, Aurelio
Hypercomplex algebras have recently been gaining prominence in the field of deep learning owing to the advantages of their division algebras over real vector spaces and their superior results when dealing with multidimensional signals in real-world 3D and 4D paradigms. This paper provides a foundational framework that serves as a roadmap for understanding why hypercomplex deep learning methods are so successful and how their potential can be exploited. Such a theoretical framework is described in terms of inductive bias, i.e., a collection of assumptions, properties, and constraints that are built into training algorithms to guide their learning process toward more efficient and accurate solutions. We show that it is possible to derive specific inductive biases in the hypercomplex domains, which extend complex numbers to encompass diverse numbers and data structures. These biases prove effective in managing the distinctive properties of these domains, as well as the complex structures of multidimensional and multimodal signals. This novel perspective for hypercomplex deep learning promises to both demystify this class of methods and clarify their potential, under a unifying framework, and in this way promotes hypercomplex models as viable alternatives to traditional real-valued deep learning for multidimensional signal processing.
BLIP: Facilitating the Exploration of Undesirable Consequences of Digital Technologies
Pang, Rock Yuren, Santy, Sebastin, Just, Renรฉ, Reinecke, Katharina
Digital technologies have positively transformed society, but they have also led to undesirable consequences not anticipated at the time of design or development. We posit that insights into past undesirable consequences can help researchers and practitioners gain awareness and anticipate potential adverse effects. To test this assumption, we introduce BLIP, a system that extracts real-world undesirable consequences of technology from online articles, summarizes and categorizes them, and presents them in an interactive, web-based interface. In two user studies with 15 researchers in various computer science disciplines, we found that BLIP substantially increased the number and diversity of undesirable consequences they could list in comparison to relying on prior knowledge or searching online. Moreover, BLIP helped them identify undesirable consequences relevant to their ongoing projects, made them aware of undesirable consequences they "had never considered," and inspired them to reflect on their own experiences with technology.
Exploring the Potential of Human-LLM Synergy in Advancing Qualitative Analysis: A Case Study on Mental-Illness Stigma
Meng, Han, Yang, Yitian, Li, Yunan, Lee, Jungup, Lee, Yi-Chieh
Qualitative analysis is a challenging, yet crucial aspect of advancing research in the field of Human-Computer Interaction (HCI). Recent studies show that large language models (LLMs) can perform qualitative coding within existing schemes, but their potential for collaborative human-LLM discovery and new insight generation in qualitative analysis is still underexplored. To bridge this gap and advance qualitative analysis by harnessing the power of LLMs, we propose CHALET, a novel methodology that leverages the human-LLM collaboration paradigm to facilitate conceptualization and empower qualitative research. The CHALET approach involves LLM-supported data collection, performing both human and LLM deductive coding to identify disagreements, and performing collaborative inductive coding on these disagreement cases to derive new conceptual insights. We validated the effectiveness of CHALET through its application to the attribution model of mental-illness stigma, uncovering implicit stigmatization themes on cognitive, emotional and behavioral dimensions. We discuss the implications for future research, methodology, and the transdisciplinary opportunities CHALET presents for the HCI community and beyond.
Congratulations to the #ICLR2024 test of time and outstanding paper award winners
The Twelfth International Conference on Learning Representations (ICLR) is taking place this week in Vienna, Austria. During the opening of the conference, the outstanding paper award winners, and honourable mentions, were announced. The conference organisers also introduced a new award for this year: the test of time award. This award honours a paper from 2013/2014 that the programme chairs judge to have had a lasting impact. Abstract: How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets?
A Rationale-centric Counterfactual Data Augmentation Method for Cross-Document Event Coreference Resolution
Ding, Bowen, Min, Qingkai, Ma, Shengkun, Li, Yingjie, Yang, Linyi, Zhang, Yue
Based on Pre-trained Language Models (PLMs), event coreference resolution (ECR) systems have demonstrated outstanding performance in clustering coreferential events across documents. However, the state-of-the-art system exhibits an excessive reliance on the'triggers lexical matching' spurious pattern in the input mention pair text. We formalize the decision-making process of the baseline ECR system using a Structural Causal Model (SCM), aiming to identify spurious and causal associations (i.e., rationales) within the ECR task. Leveraging the debiasing capability of counterfactual data augmentation, we develop a rationale-centric counterfactual data augmentation method with LLM-in-the-loop. This method is specialized for pairwise input in the Figure 1: The distribution of'triggers lexical matching' ECR system, where we conduct direct interventions in mention pairs from ECB+ training set, along with a on triggers and context to mitigate the false negative example from Held et al.'s system which spurious association while emphasizing the causation.
Uncovering implementable dormant pruning decisions from three different stakeholder perspectives
Flynn, Deanna, Jain, Abhinav, Knight, Heather, Wilson, Cristina G., Grimm, Cindy
Dormant pruning, or the removal of unproductive portions of a tree while a tree is not actively growing, is an important orchard task to help maintain yield, requiring years to build expertise. Because of long training periods and an increasing labor shortage in agricultural jobs, pruning could benefit from robotic automation. However, to program robots to prune branches, we first need to understand how pruning decisions are made, and what variables in the environment (e.g., branch size and thickness) we need to capture. Working directly with three pruning stakeholders -- horticulturists, growers, and pruners -- we find that each group of human experts approaches pruning decision-making differently. To capture this knowledge, we present three studies and two extracted pruning protocols from field work conducted in Prosser, Washington in January 2022 and 2023. We interviewed six stakeholders (two in each group) and observed pruning across three cultivars -- Bing Cherries, Envy Apples, and Jazz Apples -- and two tree architectures -- Upright Fruiting Offshoot and V-Trellis. Leveraging participant interviews and video data, this analysis uses grounded coding to extract pruning terminology, discover horticultural contexts that influence pruning decisions, and find implementable pruning heuristics for autonomous systems. The results include a validated terminology set, which we offer for use by both pruning stakeholders and roboticists, to communicate general pruning concepts and heuristics. The results also highlight seven pruning heuristics utilizing this terminology set that would be relevant for use by future autonomous robot pruning systems, and characterize three discovered horticultural contexts (i.e., environmental management, crop-load management, and replacement wood) across all three cultivars.
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
Zheng, Danna, Liu, Danyang, Lapata, Mirella, Pan, Jeff Z.
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. However, concerns have arisen regarding the trustworthiness of LLMs' outputs, particularly in closed-book question-answering tasks, where non-experts may struggle to identify inaccuracies due to the absence of contextual or ground truth information. This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLM's response aligns with its intrinsic knowledge. Additionally, TrustScore can seamlessly integrate with factchecking methods, which assesses alignment with external knowledge sources. The experimental results show that TrustScore achieves strong correlations with human judgments, surpassing existing reference-free metrics, and achieving results on par with reference-based metrics. Large-scale language models (LLMs) have recently been in the spotlight due to their impressive performance in various NLP tasks, sparking enthusiasm for potential applications (Kaddour et al., 2023; Bubeck et al., 2023). However, a notable concern has emerged regarding the ability of LLMs to generate plausible yet incorrect responses (Tam et al., 2022; Liu et al., 2023; Devaraj et al., 2022), particularly challenging for users without specialized expertise. Consequently, users are often advised to employ LLMs in scenarios where they can confidently assess the information provided.