Oceania
Are Large Vision Language Models Good Game Players?
Wang, Xinyu, Zhuang, Bohan, Wu, Qi
Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs' capabilities. These benchmarks are limited by issues such as inadequate assessment of detailed visual perception, data contamination, and a lack of focus on multi-turn reasoning. To address these challenges, we propose LVLM-Playground, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments. LVLM-Playground uses a set of games to evaluate LVLMs on four core tasks: Perceiving, Question Answering, Rule Following, and End-to-End Playing, with each target task designed to assess specific abilities, including visual perception, reasoning, decision-making, etc.
Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
Rubehn, Arne, Rzymski, Christoph, Ciucci, Luca, van Dam, Kellen Parker, Kuฤerovรก, Alลพbฤta, Bocklage, Katja, Snee, David, Stephen, Abishek, List, Johann-Mattis
Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.
AI Literacy in K-12 and Higher Education in the Wake of Generative AI: An Integrative Review
Gu, Xingjian, Ericson, Barbara J.
Accordingly, education researchers and practitioners have increasingly turned to AI literacy as an important learning objective. However, the definition of AI literacy remains vague. Researchers have used the term to describe learning interventions that differ by in school contexts, learning objectives, and types of AI technologies they use. Furthermore, the research of AI literacy is shifting significantly in the wake of generative AI. Thus, it is crucial to review the field and develop a conceptual framework that captures the diverse conceptualizations of AI literacy. The concept of AI literacy and recognition of its potential significance are well-established [75, 127]. One of the pioneering works by Touretzky et al. in 2019 laid out "five big ideas" for the AI4K12 initiative: "computers perceive the world using sensors", "agents maintain models/representations of the world and use them for reasoning", "computers can learn from data", "making agents interact with humans is a substantial challenge for AI developers", and "AI applications can impact society in both positive and negative ways" [127]. This paper had a major influence on subsequent AI literacy curriculum design. The next year, another prominent work by Long and Magerko defined AI literacy as "a set
Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support
Pu, Kevin, Lazaro, Daniel, Arawjo, Ian, Xia, Haijun, Xiao, Ziang, Grossman, Tovi, Chen, Yan
AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade-offs between increasingly salient AI support: prompt-only, proactive agent, and proactive agent with presence and context (Codellaborator). In a within-subject study (N=18), we find that proactive agents increase efficiency compared to prompt-only paradigm, but also incur workflow disruptions. However, presence indicators and interaction context support alleviated disruptions and improved users' awareness of AI processes. We underscore trade-offs of Codellaborator on user control, ownership, and code understanding, emphasizing the need to adapt proactivity to programming processes. Our research contributes to the design exploration and evaluation of proactive AI systems, presenting design implications on AI-integrated programming workflow.
Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model
Huang, Yaxuan, Dai, Xili, Wang, Jianan, Qi, Xianbiao, Yuan, Yixing, Yue, Xiangyu
Room layout estimation from multiple-perspective images is poorly investigated due to the complexities that emerge from multi-view geometry, which requires muti-step solutions such as camera intrinsic and extrinsic estimation, image matching, and triangulation. However, in 3D reconstruction, the advancement of recent 3D foundation models such as DUSt3R has shifted the paradigm from the traditional multi-step structure-from-motion process to an end-to-end single-step approach. To this end, we introduce Plane-DUSt3R, a novel method for multi-view room layout estimation leveraging the 3D foundation model DUSt3R. Plane-DUSt3R incorporates the DUSt3R framework and fine-tunes on a room layout dataset (Structure3D) with a modified objective to estimate structural planes. By generating uniform and parsimonious results, Plane-DUSt3R enables room layout estimation with only a single post-processing step and 2D detection results. Unlike previous methods that rely on single-perspective or panorama image, Plane-DUSt3R extends the setting to handle multiple-perspective images. Moreover, it offers a streamlined, end-to-end solution that simplifies the process and reduces error accumulation. Experimental results demonstrate that Plane-DUSt3R not only outperforms state-of-the-art methods on the synthetic dataset but also proves robust and effective on in the wild data with different image styles such as cartoon. Our code is available at: https://github.com/justacar/Plane-DUSt3R
AI Governance InternationaL Evaluation Index (AGILE Index)
Zeng, Yi, Lu, Enmeng, Guan, Xin, Huangfu, Cunqing, Ruan, Zizhe, Younas, Ammar, Sun, Kang, Tang, Xuan, Wang, Yuwei, Suo, Hongjie, Liang, Dongqi, Han, Zhengqiang, Bao, Aorigele, Guo, Xiaoyang, Wang, Jin, Xie, Jiawei, Liang, Yao
The rapid advancement of Artificial Intelligence (AI) technology is profoundly transforming human society and concurrently presenting a series of ethical, legal, and social issues. The effective governance of AI has become a crucial global concern. Since 2022, the extensive deployment of generative AI, particularly large language models, marked a new phase in AI governance. Continuous efforts are being made by the international community in actively addressing the novel challenges posed by these AI developments. As consensus on international governance continues to be established and put into action, the practical importance of conducting a global assessment of the state of AI governance is progressively coming to light. In this context, we initiated the development of the AI Governance InternationaL Evaluation Index (AGILE Index). Adhering to the design principle, "the level of governance should match the level of development," the inaugural evaluation of the AGILE Index commences with an exploration of four foundational pillars: the development level of AI, the AI governance environment, the AI governance instruments, and the AI governance effectiveness. It covers 39 indicators across 18 dimensions to comprehensively assess the AI governance level of 14 representative countries globally. The index is utilized to delve into the status of AI governance to date in 14 countries for the first batch of evaluation. The aim is to depict the current state of AI governance in these countries through data scoring, assist them in identifying their governance stage and uncovering governance issues, and ultimately offer insights for the enhancement of their AI governance systems.
Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees
Ceccherini, Emma, Gallagher, Ian, Jones, Andrew, Lawson, Daniel
While most existing network embedding techniques focus solely on the network features, nodes in real-world networks are associated with a rich set of attributes. For example, in a social network, the user's posts are significantly correlated with trust and following relationships, and it has been shown that jointly exploiting both information sources improves learning performance [Tang et al., 2013]. Network embeddings for static attributed networks include frameworks based on matrix factorisation [Yang et al., 2015], or deep learning [Gao and Huang, 2018, Tu et al., 2017, Tan et al., 2023, Sun et al., 2016, Zhang et al., 2018, Li et al., 2021]. Some existing dynamic network embeddings leverage node attributes, but their exploitation of node attributes is rather limited, as they are usually solely used to initialise the first layer [Sankar et al., 2020, Dwivedi et al., 2023, Liu et al., 2021, Xu et al., 2020b,a]. Approaches that purposefully exploit node attributes include frameworks based on matrix factorisation [Liu et al., 2020, Li et al., 2017], deep learning [Tang et al., 2022, Ahmed et al., 2024, Wei et al., 2019], or Bayesian modelling [Luodi et al., 2024]. However, to the best of our knowledge, none of these methods have stability guarantees, which ensure that if two node/time pairs "behave the same" in the network, their representation is the same up to noise. Stability allows for the comparison of embeddings over time because the embedding space has a consistent interpretation. Attributed unfolded adjacency spectral embedding (AUASE) is a framework for unsupervised dynamic attributed network embedding with stability guarantees.
Position: Don't use the CLT in LLM evals with fewer than a few hundred datapoints
Bowyer, Sam, Aitchison, Laurence, Ivanova, Desi R.
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .
Forthcoming machine learning and AI seminars: March 2025 edition
This post contains a list of the AI-related seminars that are scheduled to take place between 3 March and 30 April 2025. All events detailed here are free and open for anyone to attend virtually. Pareto sensitivity, most-changing sub-fronts, and optimal knee solutions Speaker: Luis Nunes Vicente (Lehigh University) Organised by: Association of European Operational Research Societies To receive the seminar link, sign up to the mailing list. Title to be confirmed Speaker: Maximilian Nickel (Meta AI) Organised by: Vanderbilt University Check the Google group for Zoom instructions. Unsupervised Discovery of Interpretable Structure in Complex Systems Speaker: Mark Hamilton (MIT/Microsoft) Organised by: EPFL Zoom link is here.
MoCFL: Mobile Cluster Federated Learning Framework for Highly Dynamic Network
Fang, Kai, Deng, Jiangtao, Dong, Chengzu, Naseem, Usman, Liu, Tongcun, Feng, Hailin, Wang, Wei
Frequent fluctuations of client nodes in highly dynamic mobile clusters can lead to significant changes in feature space distribution and data drift, posing substantial challenges to the robustness of existing federated learning (FL) strategies. To address these issues, we proposed a mobile cluster federated learning framework (MoCFL). MoCFL enhances feature aggregation by introducing an affinity matrix that quantifies the similarity between local feature extractors from different clients, addressing dynamic data distribution changes caused by frequent client churn and topology changes. Additionally, MoCFL integrates historical and current feature information when training the global classifier, effectively mitigating the catastrophic forgetting problem frequently encountered in mobile scenarios. This synergistic combination ensures that MoCFL maintains high performance and stability in dynamically changing mobile environments. Experimental results on the UNSW-NB15 dataset show that MoCFL excels in dynamic environments, demonstrating superior robustness and accuracy while maintaining reasonable training costs.