South America
Pushing the Boundaries of State Space Models for Image and Video Generation
Hong, Yicong, Mai, Long, Yao, Yuan, Liu, Feng
While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.
BrainOOD: Out-of-distribution Generalizable Brain Network Analysis
Xu, Jiaxing, Chen, Yongqiang, Dong, Xia, Lan, Mengcheng, Huang, Tiancheng, Bian, Qingtian, Cheng, James, Ke, Yiping
In neuroscience, identifying distinct patterns linked to neurological disorders, such as Alzheimer's and Autism, is critical for early diagnosis and effective intervention. Graph Neural Networks (GNNs) have shown promising in analyzing brain networks, but there are two major challenges in using GNNs: (1) distribution shifts in multi-site brain network data, leading to poor Out-of-Distribution (OOD) generalization, and (2) limited interpretability in identifying key brain regions critical to neurological disorders. To bridge these gaps, we introduce BrainOOD, a novel framework tailored for brain networks that enhances GNNs' OOD generalization and interpretability. BrainOOD framework consists of a feature selector and a structure extractor, which incorporates various auxiliary losses including an improved Graph Information Bottleneck (GIB) objective to recover causal subgraphs. By aligning structure selection across brain networks and filtering noisy features, BrainOOD offers reliable interpretations of critical brain regions. Our approach outperforms 16 existing methods and improves generalization to OOD subjects by up to 8.5%. Case studies highlight the scientific validity of the patterns extracted, which aligns with the findings in known neuroscience literature. We also propose the first OOD brain network benchmark, which provides a foundation for future research in this field. In neuroscience, a major goal is to identify distinct patterns linked to neurological disorders, such as Alzheimer's and Autism, by examining brain data of both healthy individuals and patients with these disorders (Poldrack et al., 2009). Among the neuroimaging techniques, resting-state functional magnetic resonance imaging (fMRI) is widely used to capture the functional connectivity between different brain regions (Worsley et al., 2002). These connections provide insights into how different brain regions co-activate or show correlated activities, offering a framework to study neurological systems through graphbased methods (Kawahara et al., 2017; Lanciano et al., 2020; Wang et al., 2023; Xu et al., 2024c). The most prevalent brain network analysis model is based on Graph Neural Networks (GNNs), which have recently shown promising results (Li et al., 2019; 2021; Xu et al., 2024a).
Secure & Personalized Music-to-Video Generation via CHARCHA
Agarwal, Mehul, Agarwal, Gauri, Benoit, Santiago, Lippman, Andrew, Oh, Jean
Music is a deeply personal experience and our aim is to enhance this with a fullyautomated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners' images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users' identity, we also introduce CHARCHA, a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos. Figure 1: Image stills and lyrics from generated music videos for Rick Astley's "Never Gonna Give You Up," with character reference from CHARCHA. The videos use Queratogray Sketch[1], Western Animation Diffusion[2], and Realistic Vision V5.1[3] checkpoint models .
Predicting potentially unfair clauses in Chilean terms of services with natural language processing
Loeffler, Christoffer, Freile, Andrea Martínez, Pizarro, Tomás Rey
This study addresses the growing concern of information asymmetry in consumer contracts, exacerbated by the proliferation of online services with complex Terms of Service that are rarely even read. Even though research on automatic analysis methods is conducted, the problem is aggravated by the general focus on English-language Machine Learning approaches and on major jurisdictions, such as the European Union. We introduce a new methodology and a substantial dataset addressing this gap. We propose a novel annotation scheme with four categories and a total of 20 classes, and apply it on 50 online Terms of Service used in Chile. Our evaluation of transformer-based models highlights how factors like language- and/or domain-specific pre-training, few-shot sample size, and model architecture affect the detection and classification of potentially abusive clauses. Results show a large variability in performance for the different tasks and models, with the highest macro-F1 scores for the detection task ranging from 79% to 89% and micro-F1 scores up to 96%, while macro-F1 scores for the classification task range from 60% to 70% and micro-F1 scores from 64% to 80%. Notably, this is the first Spanish-language multi-label classification dataset for legal clauses, applying Chilean law and offering a comprehensive evaluation of Spanish-language models in the legal domain. Our work lays the ground for future research in method development for rarely considered legal analysis and potentially leads to practical applications to support consumers in Chile and Latin America as a whole.
Fruit Fly Classification (Diptera: Tephritidae) in Images, Applying Transfer Learning
Flores, Erick Andrew Bustamante, Olivera, Harley Vera, Valencia, Ivan Cesar Medrano, Cubas, Carlos Fernando Montoya
This study develops a transfer learning model for the automated classification of two species of fruit flies, Anastrepha fraterculus and Ceratitis capitata, in a controlled laboratory environment. The research addresses the need to optimize identification and classification, which are currently performed manually by experts, being affected by human factors and facing time challenges. The methodological process of this study includes the capture of high-quality images using a mobile phone camera and a stereo microscope, followed by segmentation to reduce size and focus on relevant morphological areas. The images were carefully labeled and preprocessed to ensure the quality and consistency of the dataset used to train the pre-trained convolutional neural network models VGG16, VGG19, and Inception-v3. The results were evaluated using the F1-score, achieving 82% for VGG16 and VGG19, while Inception-v3 reached an F1-score of 93%. Inception-v3's reliability was verified through model testing in uncontrolled environments, with positive results, complemented by the Grad-CAM technique, demonstrating its ability to capture essential morphological features. These findings indicate that Inception-v3 is an effective and replicable approach for classifying Anastrepha fraterculus and Ceratitis capitata, with potential for implementation in automated monitoring systems.
Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling
Cai, Jianfeng, Zhu, Jinhua, Sun, Ruopei, Wang, Yue, Li, Li, Zhou, Wengang, Li, Houqiang
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model's scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model's capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive evaluations demonstrate that our approach substantially improves both preference modeling and length instruction compliance, with its effectiveness validated across various foundational models and preference datasets.
RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
Cao, Yuefan, Gong, Chengyue, Li, Xiaoyu, Liang, Yingyu, Sha, Zhizhou, Shi, Zhenmei, Song, Zhao
Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning
Jiang, Zeyu, Huang, Hai, Zuo, Xingquan
Deep reinforcement learning (DRL), through learning policies or values represented by neural networks, has successfully addressed many complex control problems. However, the neural networks introduced by DRL lack interpretability and transparency. Current DRL interpretability methods largely treat neural networks as black boxes, with few approaches delving into the internal mechanisms of policy/value networks. This limitation undermines trust in both the neural network models that represent policies and the explanations derived from them. In this work, we propose a novel concept-based interpretability method that provides fine-grained explanations of DRL models at the neuron level. Our method formalizes atomic concepts as binary functions over the state space and constructs complex concepts through logical operations. By analyzing the correspondence between neuron activations and concept functions, we establish interpretable explanations for individual neurons in policy/value networks. Experimental results on both continuous control tasks and discrete decision-making environments demonstrate that our method can effectively identify meaningful concepts that align with human understanding while faithfully reflecting the network's decision-making logic.
Explainability in Practice: A Survey of Explainable NLP Across Various Domains
Mohammadi, Hadi, Bagheri, Ayoub, Giachanou, Anastasia, Oberski, Daniel L.
Natural Language Processing (NLP) has become a cornerstone in many critical sectors, including healthcare, finance, and customer relationship management. This is especially true with the development and use of advanced models such as GPT-based architectures and BERT, which are widely used in decision-making processes. However, the black-box nature of these advanced NLP models has created an urgent need for transparency and explainability. This review explores explainable NLP (XNLP) with a focus on its practical deployment and real-world applications, examining its implementation and the challenges faced in domain-specific contexts. The paper underscores the importance of explainability in NLP and provides a comprehensive perspective on how XNLP can be designed to meet the unique demands of various sectors, from healthcare's need for clear insights to finance's emphasis on fraud detection and risk assessment. Additionally, this review aims to bridge the knowledge gap in XNLP literature by offering a domain-specific exploration and discussing underrepresented areas such as real-world applicability, metric evaluation, and the role of human interaction in model assessment. The paper concludes by suggesting future research directions that could enhance the understanding and broader application of XNLP.
Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale
Wong, Cliff, Preston, Sam, Liu, Qianchu, Gero, Zelalem, Bagga, Jass, Zhang, Sheng, Jain, Shrey, Zhao, Theodore, Gu, Yu, Xu, Yanbo, Kiblawi, Sid, Weerasinghe, Roshanthi, Leidner, Rom, Young, Kristina, Piening, Brian, Bifulco, Carlo, Naumann, Tristan, Wei, Mu, Poon, Hoifung
The vast majority of real-world patient information resides in unstructured clinical text, and the process of medical abstraction seeks to extract and normalize structured information from this unstructured input. However, traditional medical abstraction methods can require significant manual efforts that can include crafting rules or annotating training labels, limiting scalability. In this paper, we propose UniMedAbstractor (UMA), a zero-shot medical abstraction framework leveraging Large Language Models (LLMs) through a modular and customizable prompt template. We refer to our approach as universal abstraction as it can quickly scale to new attributes through its universal prompt template without curating attribute-specific training labels or rules. We evaluate UMA for oncology applications, focusing on fifteen key attributes representing the cancer patient journey, from short-context attributes (e.g., performance status, treatment) to complex long-context attributes requiring longitudinal reasoning (e.g., tumor site, histology, TNM staging). Experiments on real-world data show UMA's strong performance and generalizability. Compared to supervised and heuristic baselines, UMA with GPT-4o achieves on average an absolute 2-point F1/accuracy improvement for both short-context and long-context attribute abstraction. For pathologic T staging, UMA even outperforms the supervised model by 20 points in accuracy.