Generative AI
Large Generative Model-assisted Talking-face Semantic Communication System
Jiang, Feibo, Tu, Siwei, Dong, Li, Pan, Cunhua, Wang, Jiangzhou, You, Xiaohu
The rapid development of generative Artificial Intelligence (AI) continually unveils the potential of Semantic Communication (SemCom). However, current talking-face SemCom systems still encounter challenges such as low bandwidth utilization, semantic ambiguity, and diminished Quality of Experience (QoE). This study introduces a Large Generative Model-assisted Talking-face Semantic Communication (LGM-TSC) System tailored for the talking-face video communication. Firstly, we introduce a Generative Semantic Extractor (GSE) at the transmitter based on the FunASR model to convert semantically sparse talking-face videos into texts with high information density. Secondly, we establish a private Knowledge Base (KB) based on the Large Language Model (LLM) for semantic disambiguation and correction, complemented by a joint knowledge base-semantic-channel coding scheme. Finally, at the receiver, we propose a Generative Semantic Reconstructor (GSR) that utilizes BERT-VITS2 and SadTalker models to transform text back into a high-QoE talking-face video matching the user's timbre. Simulation results demonstrate the feasibility and effectiveness of the proposed LGM-TSC system.
Understanding the Effects of Human-written Paraphrases in LLM-generated Text Detection
Lau, Hiu Ting, Zubiaga, Arkaitz
Natural Language Generation has been rapidly developing with the advent of large language models (LLMs). While their usage has sparked significant attention from the general public, it is important for readers to be aware when a piece of text is LLM-generated. This has brought about the need for building models that enable automated LLM-generated text detection, with the aim of mitigating potential negative outcomes of such content. Existing LLM-generated detectors show competitive performances in telling apart LLM-generated and human-written text, but this performance is likely to deteriorate when paraphrased texts are considered. In this study, we devise a new data collection strategy to collect Human & LLM Paraphrase Collection (HLPC), a first-of-its-kind dataset that incorporates human-written texts and paraphrases, as well as LLM-generated texts and paraphrases. With the aim of understanding the effects of human-written paraphrases on the performance of state-of-the-art LLM-generated text detectors OpenAI RoBERTa and watermark detectors, we perform classification experiments that incorporate human-written paraphrases, watermarked and non-watermarked LLM-generated documents from GPT and OPT, and LLM-generated paraphrases from DIPPER and BART. The results show that the inclusion of human-written paraphrases has a significant impact of LLM-generated detector performance, promoting TPR@1%FPR with a possible trade-off of AUROC and accuracy.
Debiasing Synthetic Data Generated by Deep Generative Models
Decruyenaere, Alexander, Dehaene, Heidelinde, Rabaey, Paloma, Polet, Christiaan, Decruyenaere, Johan, Demeester, Thomas, Vansteelandt, Stijn
While synthetic data hold great promise for privacy protection, their statistical analysis poses significant challenges that necessitate innovative solutions. The use of deep generative models (DGMs) for synthetic data generation is known to induce considerable bias and imprecision into synthetic data analyses, compromising their inferential utility as opposed to original data analyses. This bias and uncertainty can be substantial enough to impede statistical convergence rates, even in seemingly straightforward analyses like mean calculation. The standard errors of such estimators then exhibit slower shrinkage with sample size than the typical 1 over root-$n$ rate. This complicates fundamental calculations like p-values and confidence intervals, with no straightforward remedy currently available. In response to these challenges, we propose a new strategy that targets synthetic data created by DGMs for specific data analyses. Drawing insights from debiased and targeted machine learning, our approach accounts for biases, enhances convergence rates, and facilitates the calculation of estimators with easily approximated large sample variances. We exemplify our proposal through a simulation study on toy data and two case studies on real-world data, highlighting the importance of tailoring DGMs for targeted data analysis. This debiasing strategy contributes to advancing the reliability and applicability of synthetic data in statistical inference.
The Gap Between Open and Closed AI Models Might Be Shrinking. Here's Why That Matters
Today's best AI models, like OpenAI's ChatGPT and Anthropic's Claude, come with conditions: their creators control the terms on which they are accessed to prevent them being used in harmful ways. This is in contrast with'open' models, which can be downloaded, modified, and used by anyone for almost any purpose. A new report by non-profit research organization Epoch AI found that open models available today are about a year behind the top closed models. "The best open model today is on par with closed models in performance, but with a lag of about one year," says Ben Cottier, lead researcher on the report. Meta's Llama 3.1 405B, an open model released in July, took about 16 months to match the capabilities of the first version of GPT-4.
How ChatGPT search paves the way for AI agents
It's been a busy few weeks for the company. In London, OpenAI announced updates to its new Realtime API platform, which allows developers to build voice features into their applications. The company is rolling out new voices and a function that lets developers generate prompts, which will allow them to build apps and more helpful voice assistants more quickly. Meanwhile for consumers, OpenAI announced it was launching ChatGPT search, which allows users to search the internet using the chatbot. Both developments pave the way for the next big thing in AI: agents.
AI Horizon Scanning -- White Paper p3395, IEEE-SA. Part III: Technology Watch: a selection of key developments, emerging technologies, and industry trends in Artificial Intelligence
Tambouratzis, George, Cortês, Marina, Liddle, Andrew R.
Generative Artificial Intelligence (AI) technologies are in a phase of unprecedented rapid development following the landmark release of Chat-GPT, which brought the phenomenon to wide public attention. As the deployment of AI products rises geometrically, considerable attention is being given to the threats and opportunities that AI technologies offer, and to the need for regulatory and standards initiatives to ensure that use of the technology aligns with societal needs and generates broad benefits while mitigating risks and threats. This manuscript is the third of a series of White Papers informing the development of IEEE-SA's p3995 {\it `Standard for the Implementation of Safeguards, Controls, and Preventive Techniques for Artificial Intelligence Models'} \cite{P3395}, Chair Marina Cort\^{e}s. This part focuses on assessing calmly and objectively, as far as is possible, the current state of Artificial Intelligence (AI) technology development and identifying predominant trends, prospects, and ensuing risks. It necessarily forms a snapshot of the current instant of a rapidly-evolving landscape, with new products and innovations emerging continuously. While our main focus is on software and hardware developments and their corporate context, we also briefly review progress on robotics within the AI context and describe some implications of the substantial and growing AI energy demand.
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding
Lin, Junming, Fang, Zheng, Chen, Chi, Wan, Zihao, Luo, Fuwen, Li, Peng, Liu, Yang, Sun, Maosong
The rapid development of Multimodal Large Language Models (MLLMs) has expanded their capabilities from image comprehension to video understanding. However, most of these MLLMs focus primarily on offline video comprehension, necessitating extensive processing of all video frames before any queries can be made. This presents a significant gap compared to the human ability to watch, listen, think, and respond to streaming inputs in real time, highlighting the limitations of current MLLMs. In this paper, we introduce StreamingBench, the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs. The benchmark consists of 18 tasks, featuring 900 videos and 4,500 human-curated QA pairs. Each video features five questions presented at different time points to simulate a continuous streaming scenario. We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs like Gemini 1.5 Pro and GPT-4o perform significantly below human-level streaming video understanding capabilities. We hope our work can facilitate further advancements for MLLMs, empowering them to approach human-level video comprehension and interaction in more realistic scenarios. The rapid evolution of Multimodal Large Language Models (MLLMs) has significantly reshaped the field of Artificial Intelligence (Yang et al., 2023; Reid et al., 2024; Liu et al., 2024c;a). Current advanced MLLMs (Reid et al., 2024; Wang et al., 2024a; Yao et al., 2024) have already demonstrated exceptional performance in video understanding tasks, excelling on existing video benchmarks (Fu et al., 2024; Wang et al., 2024b; Zhou et al., 2024; Ataallah et al., 2024). Moreover, several pioneering studies (Chen et al., 2024a; Zhang et al., 2024a; Wu et al., 2024) have focused on improving the ability of MLLMs to comprehend real-time online video streams, pushing the boundaries of their applicability and efficiency in dynamic environments. In the industry, streaming video understanding has also attracted significant attention, with OpenAI's GPT-4o (OpenAI, 2024) as a prominent example that demonstrates human-like perception and understanding of streaming inputs. Despite the recognition of the importance of streaming video understanding for MLLMs, most existing video understanding benchmarks (Fu et al., 2024; Wang et al., 2024b; Zhou et al., 2024) are In offline video benchmarks, questions are designed based on the entire video being visible. In contrast, StreamingBench presents questions at specific moments, with three main task categories specifically designed to evaluate fundamental capabilities in streaming video understanding.
[Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI
Pielka, Maren, Schneider, Tobias, Terheyden, Jan, Sifa, Rafet
We present an outline of the first large language model (LLM) based chatbot application in the context of patient-reported outcome measures (PROMs) for diabetic retinopathy. By utilizing the capabilities of current LLMs, we enable patients to provide feedback about their quality of life and treatment progress via an interactive application. The proposed framework offers significant advantages over the current approach, which encompasses only qualitative collection of survey data or a static survey with limited answer options. Using the PROBot LLM-PROM application, patients will be asked tailored questions about their individual challenges, and can give more detailed feedback on the progress of their treatment. Based on this input, we will use machine learning to infer conventional PROM scores, which can be used by clinicians to evaluate the treatment status. The goal of the application is to improve adherence to the healthcare system and treatments, and thus ultimately reduce cases of subsequent vision impairment. The approach needs to be further validated using a survey and a clinical study.
The Impact of Generative Artificial Intelligence on Ideation and the performance of Innovation Teams (Preprint)
Gindert, Michael, Müller, Marvin Lutz
This study investigates the impact of Generative Artificial Intelligence (GenAI) on the dynam - ics and performance of innovation teams during the idea generation phase of the innovation process. Utilizing a custom AI - augmented ideation tool, the study appli es the Knowledge Spillover Theory of Entrepreneurship to understand the effects of AI on knowledge spillover, generation and application. Through a framed field experiment with participants divided into experimental and control groups, findings indicate th at AI - augmented teams generated higher quality ideas in less time. GenAI application led to improved efficiency, knowledge exchange, increased satisfaction and engagement as well as enhanced idea diversity. These results highlight the transformative role o f the field of AI within the innovation management domain and shows that GenAI has a positive impact on important elements of the Knowledge Spillover Theory of Entrepeneurship, emphasizing its potential impact on innovation, entrepreneurship, and economic growth. Future research should further explore the dynamic interaction between GenAI and creative processes.
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
Xu, Zhipei, Zhang, Xuanyu, Li, Runyi, Tang, Zecheng, Huang, Qing, Zhang, Jian
The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.