Generative AI
SoftBank's Son envisions AI running households in the next few years
SoftBank Group founder Masayoshi Son sketched out one of the most aggressive timelines for the adoption of artificial intelligence yet, envisioning a near future where the technology would run entire households. AI will soon be able to monitor the health of family members, call the doctor when needed, do grocery shopping, make reservations, judge optimal investments and tutor young children, Son said in a speech at an annual forum for enterprise clients on Thursday. He moved up his expectation for when artificial general intelligence -- the long-term goal for developers from OpenAI to Meta Platforms and Alphabet's Google -- would arrive to within the next two to three years.
Uncovering Regional Defaults from Photorealistic Forests in Text-to-Image Generation with DALL-E 2
Liu, Zilong, Janowicz, Krzysztof, Currier, Kitty, Shi, Meilin
Regional defaults describe the emerging phenomenon that text-to-image (T2I) foundation models used in generative AI are prone to over-proportionally depicting certain geographic regions to the exclusion of others. In this work, we introduce a scalable evaluation for uncovering such regional defaults. The evaluation consists of region hierarchy--based image generation and cross-level similarity comparisons. We carry out an experiment by prompting DALL-E 2, a state-of-the-art T2I generation model capable of generating photorealistic images, to depict a forest. We select forest as an object class that displays regional variation and can be characterized using spatial statistics. For a region in the hierarchy, our experiment reveals the regional defaults implicit in DALL-E 2, along with their scale-dependent nature and spatial relationships. In addition, we discover that the implicit defaults do not necessarily correspond to the most widely forested regions in reality. Our findings underscore a need for further investigation into the geography of T2I generation and other forms of generative AI.
Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise
Wang, Rose E., Ribeiro, Ana T., Robinson, Carly D., Loeb, Susanna, Demszky, Dora
Generative AI, particularly Language Models (LMs), has the potential to transform real-world domains with societal impact, particularly where access to experts is limited. For example, in education, training novice educators with expert guidance is important for effectiveness but expensive, creating significant barriers to improving education quality at scale. This challenge disproportionately harms students from under-served communities, who stand to gain the most from high-quality education. We introduce Tutor CoPilot, a novel Human-AI approach that leverages a model of expert thinking to provide expert-like guidance to tutors as they tutor. This study is the first randomized controlled trial of a Human-AI system in live tutoring, involving 900 tutors and 1,800 K-12 students from historically under-served communities. Following a preregistered analysis plan, we find that students working with tutors that have access to Tutor CoPilot are 4 percentage points (p.p.) more likely to master topics (p<0.01). Notably, students of lower-rated tutors experienced the greatest benefit, improving mastery by 9 p.p. We find that Tutor CoPilot costs only $20 per-tutor annually. We analyze 550,000+ messages using classifiers to identify pedagogical strategies, and find that tutors with access to Tutor CoPilot are more likely to use high-quality strategies to foster student understanding (e.g., asking guiding questions) and less likely to give away the answer to the student. Tutor interviews highlight how Tutor CoPilot's guidance helps tutors to respond to student needs, though they flag issues in Tutor CoPilot, such as generating suggestions that are not grade-level appropriate. Altogether, our study of Tutor CoPilot demonstrates how Human-AI systems can scale expertise in real-world domains, bridge gaps in skills and create a future where high-quality education is accessible to all students.
Transforming Teachers' Roles and Agencies in the Era of Generative AI: Perceptions, Acceptance, Knowledge, and Practices
This paper explores the transformative impact of Generative Artificial Intelligence (GenAI) on teachers' roles and agencies in education, presenting a comprehensive framework that addresses teachers' perceptions, knowledge, acceptance, and practices of GenAI. As GenAI technologies, such as ChatGPT, become increasingly integrated into educational settings, teachers are required to adapt to evolving classroom dynamics, where AI plays a significant role in content creation, personalized learning, and student engagement. However, existing literature often treats these factors in isolation, overlooking how they collectively influence teachers' ability to effectively integrate GenAI into their pedagogical practices. This paper fills this gap by proposing a framework that categorizes teachers into four roles -- Observer, Adopter, Collaborator, and Innovator -- each representing different levels of GenAI engagement, outlining teachers' agencies in GenAI classrooms. By highlighting the need for continuous professional development and institutional support, we demonstrate how teachers can evolve from basic GenAI users to co-creators of knowledge alongside GenAI systems. The findings emphasize that for GenAI to reach its full educational potential, teachers must not only accept and understand its capabilities but also integrate it deeply into their teaching strategies. This study contributes to the growing literature on GenAI in education, offering practical implications for supporting teachers in navigating the complexities of GenAI adoption.
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
McCoy, R. Thomas, Yao, Shunyu, Friedman, Dan, Hardy, Mathew D., Griffiths, Thomas L.
In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.
Integrating Natural Language Prompting Tasks in Introductory Programming Courses
Kerslake, Chris, Denny, Paul, Smith, David H IV, Prather, James, Leinonen, Juho, Luxton-Reilly, Andrew, MacNeil, Stephen
Introductory programming courses often emphasize mastering syntax and basic constructs before progressing to more complex and interesting programs. This bottom-up approach can be frustrating for novices, shifting the focus away from problem solving and potentially making computing less appealing to a broad range of students. The rise of generative AI for code production could partially address these issues by fostering new skills via interaction with AI models, including constructing high-level prompts and evaluating code that is automatically generated. In this experience report, we explore the inclusion of two prompt-focused activities in an introductory course, implemented across four labs in a six-week module. The first requires students to solve computational problems by writing natural language prompts, emphasizing problem-solving over syntax. The second involves students crafting prompts to generate code equivalent to provided fragments, to foster an understanding of the relationship between prompts and code. Most of the students in the course had reported finding programming difficult to learn, often citing frustrations with syntax and debugging. We found that self-reported difficulty with learning programming had a strong inverse relationship with performance on traditional programming assessments such as tests and projects, as expected. However, performance on the natural language tasks was less strongly related to self-reported difficulty, suggesting they may target different skills. Learning how to communicate with AI coding models is becoming an important skill, and natural language prompting tasks may appeal to a broad range of students.
Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors
Mavali, Sina, Ricker, Jonas, Pape, David, Sharma, Yash, Fischer, Asja, Schรถnherr, Lea
While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content. However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios. We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms. These attacks can significantly reduce detection accuracy to the extent that the risks of relying on detectors outweigh their benefits. Finally, we propose a simple defense mechanism to make CLIP-based detectors, which are currently the best-performing detectors, robust against these attacks.
LLaVA-Critic: Learning to Evaluate Multimodal Models
Xiong, Tianyi, Wang, Xiyao, Guo, Dong, Ye, Qinghao, Fan, Haoqi, Gu, Quanquan, Huang, Heng, Li, Chunyuan
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instructionfollowing dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (i) LMMas-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (ii) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs. The ability of learning to evaluate is increasingly taking on a pivotal role in the development of modern large multimodal models (LMMs), as pre-training on existing web data reaches maturity and the focus is shifting towards post-training with AI-enhanced synthetic data, which shows growing potential. Reliable AI evaluation is essential, not only for offering scalable solutions to reduce human labor in complex task assessments, but also for generating effective reward signals in reinforcement learning and guiding inference-time search (Ouyang et al., 2022; OpenAI, 2024a; Snell et al., 2024). It remains unexplored to develop open LMMs to play the role of a judge and evaluate the performance of multimodal models. For instance, a model can follow a well-designed, itemized evaluation criterion to provide a score between 1 and 10 for rating different model responses in a visual chat task (Liu et al., 2023b). Along with the score, it would also offer the associated reasoning behind the evaluation, ensuring transparency and consistency in assessing model performance. In this paper, we present the first attempt to curate the instruction-following data particularly for evaluation, based on which we develop a LMM, LLaVA-Critic. Two primary scenarios/goals of building LLaVA-Critic are highlighted: Scenario 1: LMM-as-a-Judge. Open-source LMMs that can deliver reliable evaluation scores, comparable to or surpassing proprietary models like GPT-4V (OpenAI, 2023)/GPT-4o (OpenAI, 2024b). These models can serve as a free alternative to replace commercial GPT models in various evaluation benchmarks. This approach enhances preference alignment with AI-generated feedback. In summary, our contributions are as follows: Critic Instruction-Following Data: We present a high-quality dataset tailored to follow instructions in complex evaluation setting to provide quantitative judgment and the corresponding reasoning process.
Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers
Chen, Shijie, Gutiรฉrrez, Bernal Jimรฉnez, Su, Yu
Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
Personalized Federated Learning for Generative AI-Assisted Semantic Communications
Peng, Yubo, Jiang, Feibo, Dong, Li, Wang, Kezhi, Yang, Kun
Semantic Communication (SC) focuses on transmitting only the semantic information rather than the raw data. This approach offers an efficient solution to the issue of spectrum resource utilization caused by the various intelligent applications on Mobile Users (MUs). Generative Artificial Intelligence (GAI) models have recently exhibited remarkable content generation and signal processing capabilities, presenting new opportunities for enhancing SC. Therefore, we propose a GAI-assisted SC (GSC) model deployed between MUs and the Base Station (BS). Then, to train the GSC model using the local data of MUs while ensuring privacy and accommodating heterogeneous requirements of MUs, we introduce Personalized Semantic Federated Learning (PSFL). This approach incorporates a novel Personalized Local Distillation (PLD) and Adaptive Global Pruning (AGP). In PLD, each MU selects a personalized GSC model as a mentor tailored to its local resources and a unified Convolutional Neural Networks (CNN)-based SC (CSC) model as a student. This mentor model is then distilled into the student model for global aggregation. In AGP, we perform network pruning on the aggregated global model according to real-time communication environments, reducing communication energy. Finally, numerical results demonstrate the feasibility and efficiency of the proposed PSFL scheme.