Goto

Collaborating Authors

 South America


Configura\c{c}\~ao e opera\c{c}\~ao da plataforma Clearpath Husky A200 e manipulador Cobot UR5 2-finger gripper

arXiv.org Artificial Intelligence

This article presents initial configuration work and use of the robotic platform and manipulator in question. The development of the ideal configuration for using this robot serves as a guide for new users and also validates its functionality for use in projects. Husky is a large payload capacity and power systems robotics development platform that accommodates a wide variety of payloads, customized to meet research needs. Together with the Cobot UR5 Manipulator attached to its base, it expands the application area of its capacity in projects. Advances in robots and mobile manipulators have revolutionized industries by automating tasks that previously required human intervention. These innovations alone increase productivity but also reduce operating costs, which makes the company more competitive in an evolving global market. Therefore, this article investigates the functionalities of this robot to validate its execution in robotics projects.


Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

arXiv.org Artificial Intelligence

One of the major challenges in training SAEs is the substantial storage and throughput required for latent activations. While text data requires only 2 bytes per token, latent activations occupy 8K bytes per token--resulting in a 4,096x increase in both storage needs and disk throughput. This, combined with the relatively fast training steps of shallow SAEs, means that data loading quickly becomes the main bottleneck in the training process. Due to these infrastructure constraints, we do not save activations in advance but instead generate them on-the-fly. This contrasts with the approach taken by Lieberum et al. (2024); Templeton et al. (2024b), where activations are pre-saved and a high-speed dataloading pipeline is built to keep up with training. To manage this, we adopt a producer-consumer model. Language Models (LMs) generate activations and store them in an activation buffer, while the SAEs consume the activations in random order. The process is serialized: once the buffer is full, SAE training begins, and when half the buffer is consumed, the LMs refill it. Each time the buffer is refilled, we shuffle it to introduce randomness into the training data without needing to save and shuffle all activations at once.


Exploring Welfare Maximization and Fairness in Participatory Budgeting

arXiv.org Artificial Intelligence

Participatory budgeting (PB) is a voting paradigm for distributing a divisible resource, usually called a budget, among a set of projects by aggregating the preferences of individuals over these projects. It is implemented quite extensively for purposes such as government allocating funds to public projects and funding agencies selecting research proposals to support. This PhD dissertation studies the welfare-related and fairness-related objectives for different PB models. Our contribution lies in proposing and exploring novel PB rules that maximize welfare and promote fairness, as well as, in introducing and investigating a range of novel utility notions, axiomatic properties, and fairness notions, effectively filling the gaps in the existing literature for each PB model. The thesis is divided into two main parts, the first focusing on dichotomous and the second focusing on ordinal preferences. Each part considers two cases: (i) the cost of each project is restricted to a single value and partial funding is not permitted and (ii) the cost of each project is flexible and may assume multiple values.


Your Image is Secretly the Last Frame of a Pseudo Video

arXiv.org Artificial Intelligence

Diffusion models, which can be viewed as a special case of hierarchical variational autoencoders (HVAEs), have shown profound success in generating photo-realistic images. In contrast, standard HVAEs often produce images of inferior quality compared to diffusion models. In this paper, we hypothesize that the success of diffusion models can be partly attributed to the additional self-supervision information for their intermediate latent states provided by corrupted images, which along with the original image form a pseudo video. Based on this hypothesis, we explore the possibility of improving other types of generative models with such pseudo videos. Specifically, we first extend a given image generative model to their video generative model counterpart, and then train the video generative model on pseudo videos constructed by applying data augmentation to the original images. Furthermore, we analyze the potential issues of first-order Markov data augmentation methods, which are typically used in diffusion models, and propose to use more expressive data augmentation to construct more useful information in pseudo videos. Our empirical results on the CIFAR10 and CelebA datasets demonstrate that improved image generation quality can be achieved with additional self-supervised information from pseudo videos.


GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks

arXiv.org Artificial Intelligence

Mathematical reasoning problems are among the most challenging, as they typically require an understanding of fundamental laws to solve. The laws are universal, but the derivation of the final answer changes depending on how a problem is approached. When training large language models (LLMs), learning the capability of generating such multiple solutions is essential to accelerate their use in mathematical education. To this end, we train LLMs using generative flow network (GFlowNet). Different from reward-maximizing reinforcement learning (RL), GFlowNet fine-tuning seeks to find diverse solutions by training the LLM whose distribution is proportional to a reward function. In numerical experiments, we evaluate GFlowNet fine-tuning and reward-maximizing RL in terms of accuracy and diversity. The results show that GFlowNet fine-tuning derives correct final answers from diverse intermediate reasoning steps, indicating the improvement of the capability of alternative solution generation.


A Survey of Large Language Models for Arabic Language and its Dialects

arXiv.org Artificial Intelligence

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.


NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes

arXiv.org Artificial Intelligence

Although modern imaging technologies allow us to study connectivity between two distinct brain regions in-vivo, an in-depth understanding of how anatomical structure supports brain function and how spontaneous functional fluctuations emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts have been made in the realm of machine learning to establish the nonlinear mapping between neuroimaging data and phenotypic traits. However, the absence of neuroscience insight in the current approaches poses significant challenges in understanding cognitive behavior from transient neural activities. To address this challenge, we put the spotlight on the coupling mechanism of structural connectivity (SC) and functional connectivity (FC) by formulating such network neuroscience question into an expressive graph representation learning problem for high-order topology. Specifically, we introduce the concept of topological detour to characterize how a ubiquitous instance of FC (direct link) is supported by neural pathways (detour) physically wired by SC, which forms a cyclic loop interacted by brain structure and function. In the cliché of machine learning, the multi-hop detour pathway underlying SC-FC coupling allows us to devise a novel multi-head self-attention mechanism within Transformer to capture multi-modal feature representation from paired graphs of SC and FC. Taken together, we propose a biological-inspired deep model, coined as NeuroPath, to find putative connectomic feature representations from the unprecedented amount of neuroimages, which can be plugged into various downstream applications such as task recognition and disease diagnosis. We have evaluated NeuroPath on large-scale public datasets including Human Connectome Project (HCP) and UK Biobank (UKB) under different experiment settings of supervised and zero-shot learning, where the state-of-the-art performance by our NeuroPath indicates great potential in network neuroscience.


Chemical Language Model Linker: blending text and molecules with modular adapters

arXiv.org Artificial Intelligence

The development of large language models and multi-modal models has enabled the appealing idea of generating novel molecules from text descriptions. Generative modeling would shift the paradigm from relying on large-scale chemical screening to find molecules with desired properties to directly generating those molecules. However, multi-modal models combining text and molecules are often trained from scratch, without leveraging existing high-quality pretrained models. That approach consumes more computational resources and prohibits model scaling. In contrast, we propose a lightweight adapter-based strategy named Chemical Language Model Linker (ChemLML). ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions while still operating in the specialized embedding spaces of the molecular domain. ChemLML can tailor diverse pretrained text models for molecule generation by training relatively few adapter parameters. We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance. SMILES is often preferable despite not guaranteeing valid molecules. We raise issues in using the large PubChem dataset of molecules and their associated descriptions for evaluating molecule generation and provide a filtered version of the dataset as a generation test set. To demonstrate how ChemLML could be used in practice, we generate candidate protein inhibitors and use docking to assess their quality.


Grounded GUI Understanding for Vision Based Spatial Intelligent Agent: Exemplified by Virtual Reality Apps

arXiv.org Artificial Intelligence

In recent years, spatial computing Virtual Reality (VR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with VR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in VR apps, due to a multitude of challenges including complexities posed by open-vocabulary and heterogeneous IGE categories, intricacies of context-sensitive interactability, and the necessities of precise spatial perception and visual-semantic alignment for accurate IGE detection results. Thus, it is necessary to embark on the IGE research tailored to VR apps. In this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter. By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection. The detection process is iterated within a feedback-directed validation and reflection loop. Specifically, Orienter contains three components, including (1) Semantic context comprehension, (2) Reflection-directed IGE candidate detection, and (3) Context-sensitive interactability classification. Extensive experiments demonstrate that Orienter is more effective than the state-of-the-art GUI element detection approaches.


Tweet round up from #ECAI2024: part 2

AIHub

The 27th European Conference on Artificial Intelligence (ECAI-2024) took place from 19-24 October. Held in Santiago de Compostela, Spain, the event featured a full programme of technical papers, keynote and invited talks, workshops and tutorials, and panels. We took a look at what participants got up to over the second half of the event. AI Regulation: The European Scenario shed light on policy shifts, and The Economic Impact of AI discussed the challenges and opportunities ahead. Huge thanks to everyone who came to my #ecai2024 presentation and made it such a rewarding experience with your insightful questions and discussions!