Goto

Collaborating Authors

 Overview


GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model

arXiv.org Artificial Intelligence

Autonomous driving training requires a diverse range of datasets encompassing various traffic conditions, weather scenarios, and road types. Traditional data augmentation methods often struggle to generate datasets that represent rare occurrences. To address this challenge, we propose GenDDS, a novel approach for generating driving scenarios generation by leveraging the capabilities of Stable Diffusion XL (SDXL), an advanced latent diffusion model. Our methodology involves the use of descriptive prompts to guide the synthesis process, aimed at producing realistic and diverse driving scenarios. With the power of the latest computer vision techniques, such as ControlNet and Hotshot-XL, we have built a complete pipeline for video generation together with SDXL. We employ the KITTI dataset, which includes real-world driving videos, to train the model. Through a series of experiments, we demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios. This research contributes to the development of sophisticated training data for autonomous driving systems and opens new avenues for creating virtual environments for simulation and validation purposes.


Short-Term Electricity-Load Forecasting by Deep Learning: A Comprehensive Survey

arXiv.org Artificial Intelligence

Short-Term Electricity-Load Forecasting (STELF) refers to the prediction of the immediate demand (in the next few hours to several days) for the power system. Various external factors, such as weather changes and the emergence of new electricity consumption scenarios, can impact electricity demand, causing load data to fluctuate and become non-linear, which increases the complexity and difficulty of STELF. In the past decade, deep learning has been applied to STELF, modeling and predicting electricity demand with high accuracy, and contributing significantly to the development of STELF. This paper provides a comprehensive survey on deep-learning-based STELF over the past ten years. It examines the entire forecasting process, including data pre-processing, feature extraction, deep-learning modeling and optimization, and results evaluation. This paper also identifies some research challenges and potential research directions to be further investigated in future work.


Articulation Work and Tinkering for Fairness in Machine Learning

arXiv.org Artificial Intelligence

The field of fair AI aims to counter biased algorithms through computational modelling. However, it faces increasing criticism for perpetuating the use of overly technical and reductionist methods. As a result, novel approaches appear in the field to address more socially-oriented and interdisciplinary (SOI) perspectives on fair AI. In this paper, we take this dynamic as the starting point to study the tension between computer science (CS) and SOI research. By drawing on STS and CSCW theory, we position fair AI research as a matter of 'organizational alignment': what makes research 'doable' is the successful alignment of three levels of work organization (the social world, the laboratory, and the experiment). Based on qualitative interviews with CS researchers, we analyze the tasks, resources, and actors required for doable research in the case of fair AI. We find that CS researchers engage with SOI research to some extent, but organizational conditions, articulation work, and ambiguities of the social world constrain the doability of SOI research for them. Based on our findings, we identify and discuss problems for aligning CS and SOI as fair AI continues to evolve.


Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings

arXiv.org Artificial Intelligence

Text classification, a classic task in natural language processing (NLP), involves assigning predefined categories to textual data and is crucial for applications ranging from sentiment analysis to spam detection. This thesis advances text classification by harnessing the intrinsic knowledge of Pretrained Language Models (PLMs) to address three challenging scenarios: distractor selection for multiple-choice cloze questions, improving robustness for prompt-based zero-shot text classification, and demonstration selection for retrieval-based in-context learning. Firstly, we focus on selecting distractors for multiple-choice cloze questions, ensuring that they are misleading yet incorrect. We assess the relationship between human experts' annotations (accept/reject) and various features, including context-free features (e.g., word frequency) and context-sensitive features (e.g., conditional probabilities of fillin-the-blank words). We utilize pretrained embeddings and follow annotation instructions for context-free feature design, and we find that using contextualized word representations from PLMs as features drastically improves performance over traditional feature-based models, even rivaling human performance (Chapter 3).


A Survey on Evaluation of Multimodal Large Language Models

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) "where to evaluate" that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) "how to evaluate" that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.


Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

arXiv.org Artificial Intelligence

The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.


Pixels to Prose: Understanding the art of Image Captioning

arXiv.org Artificial Intelligence

In the era of evolving artificial intelligence, machines are increasingly emulating human-like capabilities, including visual perception and linguistic expression. Image captioning stands at the intersection of these domains, enabling machines to interpret visual content and generate descriptive text. This paper provides a thorough review of image captioning techniques, catering to individuals entering the field of machine learning who seek a comprehensive understanding of available options, from foundational methods to state-of-the-art approaches. Beginning with an exploration of primitive architectures, the review traces the evolution of image captioning models to the latest cutting-edge solutions. By dissecting the components of these architectures, readers gain insights into the underlying mechanisms and can select suitable approaches tailored to specific problem requirements without duplicating efforts. The paper also delves into the application of image captioning in the medical domain, illuminating its significance in various real-world scenarios. Furthermore, the review offers guidance on evaluating the performance of image captioning systems, highlighting key metrics for assessment. By synthesizing theoretical concepts with practical application, this paper equips readers with the knowledge needed to navigate the complex landscape of image captioning and harness its potential for diverse applications in machine learning and beyond.


Adversarial Network Optimization under Bandit Feedback: Maximizing Utility in Non-Stationary Multi-Hop Networks

arXiv.org Artificial Intelligence

Stochastic Network Optimization (SNO) concerns scheduling in stochastic queueing systems. It has been widely studied in network theory. Classical SNO algorithms require network conditions to be stationary with time, which fails to capture the non-stationary components in many real-world scenarios. Many existing algorithms also assume knowledge of network conditions before decision, which rules out applications where unpredictability presents. Motivated by these issues, we consider Adversarial Network Optimization (ANO) under bandit feedback. Specifically, we consider the task of *i)* maximizing some unknown and time-varying utility function associated to scheduler's actions, where *ii)* the underlying network is a non-stationary multi-hop one whose conditions change arbitrarily with time, and *iii)* only bandit feedback (effect of actually deployed actions) is revealed after decisions. Our proposed `UMO2` algorithm ensures network stability and also matches the utility maximization performance of any "mildly varying" reference policy up to a polynomially decaying gap. To our knowledge, no previous ANO algorithm handled multi-hop networks or achieved utility guarantees under bandit feedback, whereas ours can do both. Technically, our method builds upon a novel integration of online learning into Lyapunov analyses: To handle complex inter-dependencies among queues in multi-hop networks, we propose meticulous techniques to balance online learning and Lyapunov arguments. To tackle the learning obstacles due to potentially unbounded queue sizes, we design a new online linear optimization algorithm that automatically adapts to loss magnitudes. To maximize utility, we propose a bandit convex optimization algorithm with novel queue-dependent learning rate scheduling that suites drastically varying queue lengths. Our new insights in online learning can be of independent interest.


Evaluating the Energy Consumption of Machine Learning: Systematic Literature Review and Experiments

arXiv.org Artificial Intelligence

Monitoring, understanding, and optimizing the energy consumption of Machine Learning (ML) are various reasons why it is necessary to evaluate the energy usage of ML. However, there exists no universal tool that can answer this question for all use cases, and there may even be disagreement on how to evaluate energy consumption for a specific use case. Tools and methods are based on different approaches, each with their own advantages and drawbacks, and they need to be mapped out and explained in order to select the most suitable one for a given situation. We address this challenge through two approaches. First, we conduct a systematic literature review of all tools and methods that permit to evaluate the energy consumption of ML (both at training and at inference), irrespective of whether they were originally designed for machine learning or general software. Second, we develop and use an experimental protocol to compare a selection of these tools and methods. The comparison is both qualitative and quantitative on a range of ML tasks of different nature (vision, language) and computational complexity. The systematic literature review serves as a comprehensive guide for understanding the array of tools and methods used in evaluating energy consumption of ML, for various use cases going from basic energy monitoring to consumption optimization. Two open-source repositories are provided for further exploration. The first one contains tools that can be used to replicate this work or extend the current review. The second repository houses the experimental protocol, allowing users to augment the protocol with new ML computing tasks and additional energy evaluation tools.


Optimizing Lung Cancer Detection in CT Imaging: A Wavelet Multi-Layer Perceptron (WMLP) Approach Enhanced by Dragonfly Algorithm (DA)

arXiv.org Artificial Intelligence

Early-stage detection is critical, as it significantly improves the five-year survival rate from a dismal 5% in late-stage diagnoses to over 50% [2]. The advent of advanced screening technologies promises to substantially improve patient prognoses. The field of medical imaging has been revolutionized by recent strides in deep learning, yielding significant enhancements in the detection and classification of lung cancer from CT images. Innovations such as the 3D Convolutional Neural Network (CNN) approach by Diviya et al. (2024) and the LCD-Capsule Network by Bushara et al. (2023) have demonstrated the potential of these models to transform early detection and diagnosis [3, 4]. X-ray and computed tomography (CT) scans are pivotal in lung cancer diagnostics, offering high-resolution imagery that outperforms traditional radiography in detecting small and low-contrast pulmonary nodules [5, 6, 7].