Media
AndroidGen: Building an Android Language Agent under Data Scarcity
Lai, Hanyu, Gao, Junjie, Liu, Xiao, Xu, Yifan, Zhang, Shudan, Dong, Yuxiao, Tang, Jie
Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at https://github.com/THUDM/AndroidGen.
Balancing Creativity and Automation: The Influence of AI on Modern Film Production and Dissemination
The integration of Artificial Intelligence(AI) into film production has revolutionized efficiency and creativity, yet it simultaneously raises critical ethical and practical challenges. This study explores the dual impact of AI on modern cinema through three objectives: defining the optimal human-AI relationship, balancing creativity with automation, and developing ethical guidelines. By employing a mixed-method approach combining theoretical frameworks (auteur theory, human-technology relations) and case studies (The Safe Zone, Fast & Furious 7, The Brutalist), the research reveals that positioning AI as an "embodiment tool" rather than an independent "alterity partner" preserves human authorship and artistic integrity. Key findings highlight the risks of surveillance capitalism in AI-driven markets and the ethical dilemmas of deepfake technology. The study concludes with actionable recommendations, including international regulatory frameworks and a Human Control Index (HCI) to quantify AI involvement. These insights aim to guide filmmakers, policymakers, and scholars in navigating the evolving AI-cinema landscape while safeguarding cultural diversity and ethical standards.
CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes
Nguyen, Tuan, Khan, Naseem, Khalil, Issa
Unlike traditional text-to-image generation, where the entire image is synthesized from scratch, instruction-guided editing targets real images and modifies specific semantic attributes (such as object identity, background context, or visual style) while preserving global visual coherence. These manipulations are particularly concerning from a cybersecurity standpoint because they maintain the illusion of authenticity while enabling adversaries to alter identity, fabricate visual evidence, or inject misinformation into trusted media pipelines. As illustrated in Figure 2, the instruction-guided image editing pipeline comprises three key AI components, each playing a distinct role in enabling semantically precise and visually coherent manipulations. 4 Figure 2: Malicious Image Manipulation Pipeline. A threat actor uses generative AI tools to manipulate specific elements of an image, leveraging image translation and understanding models to guide semantic edits. These capabilities facilitate identity obfuscation, impersonation, and disinformation. First, an image translation model is used to convert the raw source image into a descriptive textual caption that semantically captures its visual content. This step, commonly implemented with models like CLIP [22], or BLIP-2 [23], provides a language-based anchor that enables subsequent manipulation. For example, a facial image may be described as "a girl wearing a blue and white striped shirt", forming the basis for meaningful transformation prompts.
Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions
Abootorabi, Mohammad Mahdi, Ghahroodi, Omid, Zahraei, Pardis Sadat, Behzadasl, Hossein, Mirrokni, Alireza, Salimipanah, Mobina, Rasouli, Arash, Behzadipour, Bahar, Azarnoush, Sara, Maleki, Benyamin, Sadraiye, Erfan, Feriz, Kiarash Kiani, Nahad, Mahdi Teymouri, Moghadasi, Ali, Abianeh, Abolfazl Eshagh, Nazar, Nizi, Rabiee, Hamid R., Baghshah, Mahdieh Soleymani, Ahmadi, Meisam, Asgari, Ehsaneddin
Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
REED-VAE: RE-Encode Decode Training for Iterative Image Editing with Diffusion Models
Almog, Gal, Shamir, Ariel, Fried, Ohad
While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at https://github.com/galmog/REED-VAE
Unsupervised outlier detection to improve bird audio dataset labels
The Xeno -Canto bird audio repository is an invaluable resource for those interested in vocalizations and other sounds made by birds around the world. This is particularly the case for machine learning researchers attempting to improve on the bird species r ecognition accuracy of classification models. However, the task of extracting labeled datasets from th e recordings found in this crowd -sourced repository faces several challenges. One challenge of particular significance to machine learning practitioners i s that one bird species label is applied to each audio recording, but frequently other sounds are also captured including other bird species, other animal sounds, anthropogenic and other ambient sounds . These non -target bird species sounds can result in dataset labeling discrepanc ies referred to as label noise . In this work we present a cleaning process consisting of audio preprocessing followed by dimensionality reduction and unsupervised outlier detection (UOD) to reduce the label noise in a dataset derived from Xeno -Canto recordings . We investigate three neural network dimensionality reduction techniques: two flavors of convolutional autoencoder s and variational deep embedding (VaDE (Jiang, 2017)) . While both methods show some degree of effectiveness at detecting outliers for most bird species datasets, we f ound significant variation in the performance of the methods from one species to the next. We believe that the results of this investigation demonstrate that the application of our cleaning process can meaningfully reduce the label noise of bird species datasets derived from Xeno-Canto audio repository but results vary across species.
Speaker Diarization for Low-Resource Languages Through Wav2vec Fine-Tuning
Abdullah, Abdulhady Abas, Karim, Sarkhel H. Taher, Ahmed, Sara Azad, Tariq, Kanar R., Rashid, Tarik A.
Speaker diarization, a core problem in speech processing, entails partitioning a given audio stream according to the speakers. Even though progress has been made in the development of the models for high - resource languages, there is still a set of specific difficulties in going through a similar process for low - resource languages such as Kurdish: there are very few annotated datasets available; the language has dialects; speakers use code - switching a lot. These challenges are met in this study by training the Wav2V ec 2.0 SSL model on a Ku rdish dataset prepared for this purpose. Thanks to transfer learning, it was possible to transfer multiling ual representations learnt in other languages to the phonetic and acoustic features of Kurdish speech. The general Diarization Error Rate (DER) was reduced by 7.2%, and the cluster purity increased by 13% when compared to the baseline algorithm. They show that making improvements in any state - of - the - art model can help in enhancing the performance of under - resourced languages. Implications of this work include transcription services for Kurdish - language media programs, as well as speaker segmentation in multilingual call centers, teleconferencing, and videoconferencing systems. Therefore, this work demonstrates that self - supervised and transfer techniques can improve speaker diarization for Kurdish and other low - resource languages with diverse features. The approach provides a ba se for building effective diarization systems in other understudied languages, which remai ns essential for speech technology's equity.
Multi-view autoencoders for Fake News Detection
Pereira, Ingryd V. S. T., Cavalcanti, George D. C., Cruz, Rafael M. O.
Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: https://github.com/ingrydpereira/multiview-fake-news.
Application and Optimization of Large Models Based on Prompt Tuning for Fact-Check-Worthiness Estimation
Yu, Yinglong, Shen, Hao, Lyu, Zhengyi, He, Qi
Application and Optimization of Large Models Based on Prompt Tuning for Fact-Check-Worthiness Estimation Yinglong Y u 1, Hao Shen 2 Zhengyi Lyu 3 and Qi He 4 Communication University of China, Beijing, China 1 yuyingling@cuc.edu.cn 2 shenhao@cuc.edu.cn 3 lyuzhengyi@cuc.edu.cn 4 heqi654321@126.com Abstract --In response to the growing problem of misinformation in the context of globalization and informatization, this paper proposes a classification method for fact-check-worthiness estimation based on prompt tuning. We construct a model for fact-check-worthiness estimation at the methodological level using prompt tuning. By applying designed prompt templates to large language models, we establish in-context learning and leverage prompt tuning technology to improve the accuracy of determining whether claims have fact-check-worthiness, particularly when dealing with limited or unlabeled data. Through extensive experiments on public datasets, we demonstrate that the proposed method surpasses or matches multiple baseline methods in the classification task of fact-check-worthiness estimation assessment, including classical pre-trained models such as BERT, as well as recent popular large models like GPT - 3.5 and GPT -4. Experiments show that the prompt tuning-based method proposed in this study exhibits certain advantages in evaluation metrics such as F1 score and accuracy, thereby effectively validating its effectiveness and advancement in the task of fact-check-worthiness estimation. I NTRODUCTION In today's interconnected world characterized by globalization and informatization, the complexity of multilingual environments and the challenges posed by misinformation have become increasingly severe. With the deepening of international exchanges and the expanding influence of social media, rumors and false information spread rapidly across cyberspace, exacerbating the uncertainty in the global public discourse.
AI Ethics and Social Norms: Exploring ChatGPT's Capabilities From What to How
Veisi, Omid, Bahrami, Sasan, Englert, Roman, Müller, Claudia
Using LLMs in healthcare, Computer-Supported Cooperative Work, and Social Computing requires the examination of ethical and social norms to ensure safe incorporation into human life. We conducted a mixed-method study, including an online survey with 111 participants and an interview study with 38 experts, to investigate the AI ethics and social norms in ChatGPT as everyday life tools. This study aims to evaluate whether ChatGPT in an empirical context operates following ethics and social norms, which is critical for understanding actions in industrial and academic research and achieving machine ethics. The findings of this study provide initial insights into six important aspects of AI ethics, including bias, trustworthiness, security, toxicology, social norms, and ethical data. Significant obstacles related to transparency and bias in unsupervised data collection methods are identified as ChatGPT's ethical concerns.