Generative AI
Bias Analysis in Unconditional Image Generative Models
Zhang, Xiaofeng, Lin, Michelle, Lacoste-Julien, Simon, Courville, Aaron, Goyal, Yash
The widespread adoption of generative AI models has raised growing concerns about representational harm and potential discriminatory outcomes. Yet, despite growing literature on this topic, the mechanisms by which bias emerges - especially in unconditional generation - remain disentangled. We define the bias of an attribute as the difference between the probability of its presence in the observed distribution and its expected proportion in an ideal reference distribution. In our analysis, we train a set of unconditional image generative models and adopt a commonly used bias evaluation framework to study bias shift between training and generated distributions. Our experiments reveal that the detected attribute shifts are small. We find that the attribute shifts are sensitive to the attribute classifier used to label generated images in the evaluation framework, particularly when its decision boundaries fall in high-density regions. Our empirical analysis indicates that this classifier sensitivity is often observed in attributes values that lie on a spectrum, as opposed to exhibiting a binary nature. This highlights the need for more representative labeling practices, understanding the shortcomings through greater scrutiny of evaluation frameworks, and recognizing the socially complex nature of attributes when evaluating bias.
Assessment of Evolving Large Language Models in Upper Secondary Mathematics
Setรคlรค, Mika, Sikstrรถm, Pieta, Heilala, Ville, Kรคrkkรคinen, Tommi
Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.
Superstudent intelligence in thermodynamics
Loubet, Rebecca, Zittlau, Pascal, Hoffmann, Marco, Vollmer, Luisa, Fellenz, Sophie, Leitte, Heike, Jirasek, Fabian, Lenhard, Johannes, Hasse, Hans
In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.
Do Generative AI Tools Ensure Green Code? An Investigative Study
Sikand, Samarth, Mehra, Rohit, Sharma, Vibhu Saujanya, Kaulgud, Vikrant, Podder, Sanjay, Burden, Adam P.
Software sustainability is emerging as a primary concern, aiming to optimize resource utilization, minimize environmental impact, and promote a greener, more resilient digital ecosystem. The sustainability or "greenness" of software is typically determined by the adoption of sustainable coding practices. With a maturing ecosystem around generative AI, many software developers now rely on these tools to generate code using natural language prompts. Despite their potential advantages, there is a significant lack of studies on the sustainability aspects of AI-generated code. Specifically, how environmentally friendly is the AI-generated code based upon its adoption of sustainable coding practices? In this paper, we present the results of an early investigation into the sustainability aspects of AI-generated code across three popular generative AI tools - ChatGPT, BARD, and Copilot. The results highlight the default non-green behavior of tools for generating code, across multiple rules and scenarios. It underscores the need for further in-depth investigations and effective remediation strategies.
Diffusion-based Time Series Forecasting for Sewerage Systems
Pearson, Nicholas A., Cairoli, Francesca, Bortolussi, Luca, Russo, Davide, Zanello, Francesca
We introduce a novel deep learning approach that harnesses the power of generative artificial intelligence to enhance the accuracy of contextual forecasting in sewerage systems. By developing a diffusion - based model that processes multivariate time series data, our system exce ls at capturing complex correlations across diverse environmental signals, enabling robust prediction s even during extreme weather events. To strengthen the model's reliability, we further calibrate its predictions with a conformal inference technique, tailored for probabilistic time series data, ensuring that the resulting prediction intervals are statistically reliable and cover the true target va lues with a desired confidence level . Our empirical tests on real sewerage system data confirm the model's exceptional capability to deliver reliable contextual predictions, maintaining accuracy even under severe weather conditions.
Ego-centric Learning of Communicative World Models for Autonomous Driving
Wang, Hang, Gao, Dechen, Zhang, Junshan
We study multi-agent reinforcement learning (MARL) for tasks in complex high-dimensional environments, such as autonomous driving. MARL is known to suffer from the \textit{partial observability} and \textit{non-stationarity} issues. To tackle these challenges, information sharing is often employed, which however faces major hurdles in practice, including overwhelming communication overhead and scalability concerns. By making use of generative AI embodied in world model together with its latent representation, we develop {\it CALL}, \underline{C}ommunic\underline{a}tive Wor\underline{l}d Mode\underline{l}, for MARL, where 1) each agent first learns its world model that encodes its state and intention into low-dimensional latent representation with smaller memory footprint, which can be shared with other agents of interest via lightweight communication; and 2) each agent carries out ego-centric learning while exploiting lightweight information sharing to enrich her world model, and then exploits its generalization capacity to improve prediction for better planning. We characterize the gain on the prediction accuracy from the information sharing and its impact on performance gap. Extensive experiments are carried out on the challenging local trajectory planning tasks in the CARLA platform to demonstrate the performance gains of using \textit{CALL}.
ChatGPT goes down worldwide leaving users 'to type their own emails'
ChatGPT has been hit by a worldwide outage, sparking chaos in the corporate world. The AI has gained popularity in the workforce, helping employees draft the perfect email, research information and provide customer support. Students have also been left in the dark as they harness the intelligence to take exams and craft reports. 'We are observing elevated error rates and latency across ChatGPT and the API,' OpenAI shared on its sit. 'Our engineers have identified the root cause and are working as fast as possible to fix the issue.'
Yes, ChatGPT and Sora are down for users all around the world
BleepingComputer reports that AI company OpenAI has suffered a major outage today affecting several of the company's AI services, including AI chatbot ChatGPT and AI video generator Sora. Though you can still access ChatGPT, it currently takes an unusually long time to respond and may end up responding with error messages. Similar issues with error rates and increased latency are affecting Sora and OpenAI's API, which could affect third-party services. According to Downdetector, thousands of users have been reporting outages, and the rate of reports hasn't slowed as of this writing. According to OpenAI's status tracker, the problem was first acknowledged at 2:36 AM this morning, and the issue is now marked as "Identified" with the company "still working on implementing the mitigation for this issue."
WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code
Lin, Zhiyu, Zhou, Zhengda, Zhao, Zhiyuan, Wan, Tianrui, Ma, Yilun, Gao, Junyu, Li, Xuelong
With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.
Comparing Credit Risk Estimates in the Gen-AI Era
Lavecchia, Nicola, Fadanelli, Sid, Ricciuti, Federico, Aloe, Gennaro, Bagli, Enrico, Giuffrida, Pietro, Vergari, Daniele
Generative AI technologies have demonstrated significant potential across diverse applications. This study provides a comparative analysis of credit score modeling techniques, contrasting traditional approaches with those leveraging generative AI. Our findings reveal that current generative AI models fall short of matching the performance of traditional methods, regardless of the integration strategy employed. These results highlight the limitations in the current capabilities of generative AI for credit risk scoring, emphasizing the need for further research and development before the possibility of applying generative AI for this specific task, or equivalent ones.