Generative AI
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Chen, Qiguang, Qin, Libo, Liu, Jinhao, Peng, Dengyun, Guan, Jiannan, Wang, Peng, Hu, Mengkang, Zhou, Yuhang, Gao, Te, Che, Wanxiang
Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "test-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and test-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models
Yang, Tsan-Tsung, Chen, I-Wei, Chen, Kuan-Ting, Chiang, Shang-Hsuan, Peng, Wen-Chih
With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in https://github.com/uuugaga/Defactify_4
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education
Singh, Shrutika, Alyakin, Anton, Alber, Daniel Alexander, Stryker, Jaden, Tong, Ai Phuong S, Sangwon, Karl, Goff, Nicolas, de la Paz, Mathew, Hernandez-Rovira, Miguel, Park, Ki Yun, Leuthardt, Eric Claude, Oermann, Eric Karl
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis
Liu, Ning-Yuan Georgia, Yang, Flower, Jalali, Mohammad S.
Causal graphs are commonly used to understand and model complex systems. Researchers often construct these graphs from different perspectives, leading to significant variations for the same problem. Comparing causal graphs is, therefore, essential for evaluating assumptions, integrating insights, and resolving disagreements. The rise of AI tools has further amplified this need, as they are increasingly used to generate hypothesized causal graphs by synthesizing information from various sources such as prior research and community inputs, providing the potential for automating and scaling causal modeling for complex systems. Similar to humans, these tools also produce inconsistent results across platforms, versions, and iterations. Despite its importance, research on causal graph comparison remains scarce. Existing methods often focus solely on structural similarities, assuming identical variable names, and fail to capture nuanced semantic relationships, which is essential for causal graph comparison. We address these gaps by investigating methods for comparing causal graphs from both semantic and structural perspectives. First, we reviewed over 40 existing metrics and, based on predefined criteria, selected nine for evaluation from two threads of machine learning: four semantic similarity metrics and five learning graph kernels. We discuss the usability of these metrics in simple examples to illustrate their strengths and limitations. We then generated a synthetic dataset of 2,000 causal graphs using generative AI based on a reference diagram. Our findings reveal that each metric captures a different aspect of similarity, highlighting the need to use multiple metrics.
Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression
Shahrokhi, Hooman, Roy, Devjeet Raj, Yan, Yan, Arnaoudova, Venera, Doppa, Janaradhan Rao
We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
Deng, Yufan, Guo, Xun, Wang, Yizhi, Fang, Jacob Zhiyuan, Wang, Angtian, Yuan, Shenghai, Yang, Yiding, Liu, Bo, Huang, Haibin, Ma, Chongyang
Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.
Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives
Romero-Arjona, Miguel, Valle, Pablo, Alonso, Juan C., Sรกnchez, Ana B., Ugarte, Miriam, Cazalilla, Antonia, Cambrรณn, Vicente, Parejo, Josรฉ A., Arrieta, Aitor, Segura, Sergio
The battle for AI leadership is on, with OpenAI in the United States and DeepSeek in China as key contenders. In response to these global trends, the Spanish government has proposed ALIA, a public and transparent AI infrastructure incorporating small language models designed to support Spanish and co-official languages such as Basque. This paper presents the results of Red Teaming sessions, where ten participants applied their expertise and creativity to manually test three of the latest models from these initiatives$\unicode{x2013}$OpenAI o3-mini, DeepSeek R1, and ALIA Salamandra$\unicode{x2013}$focusing on biases and safety concerns. The results, based on 670 conversations, revealed vulnerabilities in all the models under test, with biased or unsafe responses ranging from 29.5% in o3-mini to 50.6% in Salamandra. These findings underscore the persistent challenges in developing reliable and trustworthy AI systems, particularly those intended to support Spanish and Basque languages.
On the Limitations of Vision-Language Models in Understanding Image Transforms
Anis, Ahmad Mustafa, Ali, Hasnain, Sarfraz, Saquib
Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.
Netflix's first gaming boss has left the company
Mike Verdu has left Netflix, according to Game File with Stephen Totilo. Netflix brought the former Oculus and EA exec onboard to launch and lead its gaming efforts in 2021. Under Verdu's leadership, the company released a bunch of new and ported titles, as well as establishing an internal game development operation. In mid-2024, however, Netflix changed its gaming strategy and hired Alain Tascan, the executive vice president for game development at Epic Games, to lead its gaming efforts. Verdu still served as the VP for games until November 2024, after which he was named as the Vice President of generative AI for games. On LinkedIn, Verdu wrote that his role was about "driving a'once in a generation' inflection point for game development and player experiences using generative AI."
ChatGPT firm reveals AI model that is 'good at creative writing'
The chief executive of OpenAI, Sam Altman, said the unnamed model was the first time he had been "really struck" by the written output of one of the startup's products. In a post on the social media platform X, Altman wrote: "We trained a new model that is good at creative writing (not sure yet how/when it will get released). This is the first time i have been really struck by something written by AI." Make it fair, Sam," said Dan Conway, the organisation's chief executive. Altman posted an example of the model's output on X, after giving it the prompt: "Please write a metafictional literary short story about AI and grief." The story, narrated by an AI, begins with: "Before we go any further, I should admit this comes with instructions: be metafictional, be literary, be about AI and grief, and above all, be original.