Overview
Arabic Multimodal Machine Learning: Datasets, Applications, Approaches, and Challenges
Haouhat, Abdelhamid, Bellaouar, Slimane, Nehar, Attia, Cherroun, Hadda, Abdelali, Ahmed
Multimodal Machine Learning (MML) aims to integrate and analyze information from diverse modalities, such as text, audio, and visuals, enabling machines to address complex tasks like sentiment analysis, emotion recognition, and multimedia retrieval. Recently, Arabic MML has reached a certain level of maturity in its foundational development, making it time to conduct a comprehensive survey. This paper explores Arabic MML by categorizing efforts through a novel taxonomy and analyzing existing research. Our taxonomy organizes these efforts into four key topics: datasets, applications, approaches, and challenges. By providing a structured overview, this survey offers insights into the current state of Arabic MML, highlighting areas that have not been investigated and critical research gaps. Researchers will be empowered to build upon the identified opportunities and address challenges to advance the field.
Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
Liu, Jing, Du, Yao, Yang, Kun, Wu, Jiaqi, Wang, Yan, Hu, Xiping, Wang, Zehua, Liu, Yang, Sun, Peng, Boukerche, Azzedine, Leung, Victor C. M.
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.
Cequel: Cost-Effective Querying of Large Language Models for Text Clustering
Wang, Hongtao, Zhang, Taiyan, Yang, Renchi, Xu, Jianliang
Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features. In the literature, this task is formulated either as metric clustering over pre-trained text embeddings or as graph clustering based on pairwise similarities derived from an oracle, e.g., a large machine learning model. Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings and accurate semantic similarity estimates. However, leveraging LLMs at scale introduces substantial computational and financial costs due to the large number of required API queries or inference calls. To address this issue, we propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries. At its core, Cequel constructs must-link and cannot-link constraints by selectively querying LLMs on informative text pairs or triplets, identified via our proposed algorithms, EdgeLLM and TriangleLLM. These constraints are then utilized in a weighted constrained clustering algorithm to form high-quality clusters. Specifically, EdgeLLM and TriangleLLM employ carefully designed greedy selection strategies and prompting techniques to identify and extract informative constraints efficiently. Experiments on multiple benchmark datasets demonstrate that Cequel consistently outperforms existing methods in unsupervised text clustering under the same query budget.
Tutorial on the Probabilistic Unification of Estimation Theory, Machine Learning, and Generative AI
Extracting meaning from uncertain, noisy data is a fundamental problem across time series analysis, pattern recognition, and language modeling. This survey presents a unified mathematical framework that connects classical estimation theory, statistical inference, and modern machine learning, including deep learning and large language models. By analyzing how techniques such as maximum likelihood estimation, Bayesian inference, and attention mechanisms address uncertainty, the paper illustrates that many AI methods are rooted in shared probabilistic principles. Through illustrative scenarios including system identification, image classification, and language generation, we show how increasingly complex models build upon these foundations to tackle practical challenges like overfitting, data sparsity, and interpretability. In other words, the work demonstrates that maximum likelihood, MAP estimation, Bayesian classification, and deep learning all represent different facets of a shared goal: inferring hidden causes from noisy and/or biased observations. It serves as both a theoretical synthesis and a practical guide for students and researchers navigating the evolving landscape of machine learning.
MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning
Yang, Lishan, Zhang, Wei Emma, Sheng, Quan Z., Yao, Lina, Chen, Weitong, Shakeri, Ali
In the era of big data, data mining has become indispensable for uncovering hidden patterns and insights from vast and complex datasets. The integration of multimodal data sources further enhances its potential. Multimodal Federated Learning (MFL) is a distributed approach that enhances the efficiency and quality of multimodal learning, ensuring collaborative work and privacy protection. However, missing modalities pose a significant challenge in MFL, often due to data quality issues or privacy policies across the clients. In this work, we present MMiC, a framework for Mitigating Modality incompleteness in MFL within the Clusters. MMiC replaces partial parameters within client models inside clusters to mitigate the impact of missing modalities. Furthermore, it leverages the Banzhaf Power Index to optimize client selection under these conditions. Finally, MMiC employs an innovative approach to dynamically control global aggregation by utilizing Markovitz Portfolio Optimization. Extensive experiments demonstrate that MMiC consistently outperforms existing federated learning architectures in both global and personalized performance on multimodal datasets with missing modalities, confirming the effectiveness of our proposed solution. Our code is available at https://github.com/gotobcn8/MMiC.
Sensing, Social, and Motion Intelligence in Embodied Navigation: A Comprehensive Survey
Xiong, Chaoran, Huang, Yulong, Yu, Fangwen, Chen, Changhao, Wang, Yue, Xia, Songpengchen, Pei, Ling
Embodied navigation (EN) advances traditional navigation by enabling robots to perform complex egocentric tasks through sensing, social, and motion intelligence. In contrast to classic methodologies that rely on explicit localization and pre-defined maps, EN leverages egocentric perception and human-like interaction strategies. This survey introduces a comprehensive EN formulation structured into five stages: Transition, Observation, Fusion, Reward-policy construction, and Action (TOFRA). The TOFRA framework serves to synthesize the current state of the art, provide a critical review of relevant platforms and evaluation metrics, and identify critical open research challenges. A list of studies is available at https://github.com/Franky-X/Awesome-Embodied-Navigation.
ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Han, Seungmin, Kwon, Haeun, Park, Ji-jun, Yoon, Taeyang
--Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (L VLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. T o address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Our extensive experiments on MMDR-Bench demonstrate that CoL VLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT -4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions. The rapid advancements in Large Language Models (LLMs) [1], [2] and Large Vision-Language Models (L VLMs) [3] have revolutionized our ability to understand and generate text and images, respectively. These models have demonstrated remarkable capabilities across a wide spectrum of tasks, from natural language understanding to complex image captioning and visual question answering. As these models become increasingly sophisticated, there is a growing demand for intelligent systems that can engage in more dynamic, nuanced, and context-aware interactions, mirroring human cognitive processes in real-world scenarios. This necessitates models that can not only comprehend textual instructions but also seamlessly integrate visual information, perform deep reasoning within evolving dialogue contexts, track entities, understand spatial relationships, and ultimately execute multi-step complex instructions [4], [5].
LLMs and Agentic AI in Insurance Decision-Making: Opportunities and Challenges For Africa
Hill, Graham, Gong, JingYuan, Babeli, Thulani, Mots'oehli, Moseli, Wanjiku, James Gachomo
In this work, we highlight the transformative potential of Artificial Intelligence (AI), particularly Large Language Models (LLMs) and agentic AI, in the insurance sector. We consider and emphasize the unique opportunities, challenges, and potential pathways in insurance amid rapid performance improvements, increased open-source access, decreasing deployment costs, and the complexity of LLM or agentic AI frameworks. To bring it closer to home, we identify critical gaps in the African insurance market and highlight key local efforts, players, and partnership opportunities. Finally, we call upon actuaries, insurers, regulators, and tech leaders to a collaborative effort aimed at creating inclusive, sustainable, and equitable AI strategies and solutions: by and for Africans.
Safe and Transparent Robots for Human-in-the-Loop Meat Processing
Parekh, Sagar, Grothoff, Casey, Wright, Ryan, White, Robin, Losey, Dylan P.
Labor shortages have severely affected the meat processing sector. Automated technology has the potential to support the meat industry, assist workers, and enhance job quality. However, existing automation in meat processing is highly specialized, inflexible, and cost intensive. Instead of forcing manufacturers to buy a separate device for each step of the process, our objective is to develop general-purpose robotic systems that work alongside humans to perform multiple meat processing tasks. Through a recently conducted survey of industry experts, we identified two main challenges associated with integrating these collaborative robots alongside human workers. First, there must be measures to ensure the safety of human coworkers; second, the coworkers need to understand what the robot is doing. This paper addresses both challenges by introducing a safety and transparency framework for general-purpose meat processing robots. For safety, we implement a hand-detection system that continuously monitors nearby humans. This system can halt the robot in situations where the human comes into close proximity of the operating robot. We also develop an instrumented knife equipped with a force sensor that can differentiate contact between objects such as meat, bone, or fixtures. For transparency, we introduce a method that detects the robot's uncertainty about its performance and uses an LED interface to communicate that uncertainty to the human. Additionally, we design a graphical interface that displays the robot's plans and allows the human to provide feedback on the planned cut. Overall, our framework can ensure safe operation while keeping human workers in-the-loop about the robot's actions which we validate through a user study.
CaTE Data Curation for Trustworthy AI
Clemens-Sewall, Mary Versa, Cervantes, Christopher, Rafkin, Emma, Otte, J. Neil, Magelinski, Tom, Lewis, Libby, Liu, Michelle, Udwin, Dana, Kirkman-Bey, Monique
This report provides practical guidance to teams designing or developing AI-enabled systems for how to promote trustworthiness during the data curation phase of development. In this report, the authors first define data, the data curation phase, and trustworthiness. We then describe a series of steps that the development team, especially data scientists, can take to build a trustworthy AI-enabled system. We enumerate the sequence of core steps and trace parallel paths where alternatives exist. The descriptions of these steps include strengths, weaknesses, preconditions, outcomes, and relevant open-source software tool implementations. In total, this report is a synthesis of data curation tools and approaches from relevant academic literature, and our goal is to equip readers with a diverse yet coherent set of practices for improving AI trustworthiness.