Goto

Collaborating Authors

 masc


MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

He, Lixuan, Zheng, Shikang, Zhang, Linfeng

arXiv.org Artificial Intelligence

Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.


A signal separation view of classification

Mhaskar, H. N., O'Dowd, Ryan

arXiv.org Machine Learning

The problem of classification in machine learning has often been approached in terms of function approximation. In this paper, we propose an alternative approach for classification in arbitrary compact metric spaces which, in theory, yields both the number of classes, and a perfect classification using a minimal number of queried labels. Our approach uses localized trigonometric polynomial kernels initially developed for the point source signal separation problem in signal processing. Rather than point sources, we argue that the various classes come from different probability distributions. The localized kernel technique developed for separating point sources is then shown to separate the supports of these distributions. This is done in a hierarchical manner in our MASC algorithm to accommodate touching/overlapping class boundaries. We illustrate our theory on several simulated and real life datasets, including the Salinas and Indian Pines hyperspectral datasets and a document dataset.


Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

Grigoryan, Lilit, Karpov, Nikolay, Albasiri, Enas, Lavrukhin, Vitaly, Ginsburg, Boris

arXiv.org Artificial Intelligence

Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language's complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.


Dynamic Residual Safe Reinforcement Learning for Multi-Agent Safety-Critical Scenarios Decision-Making

Wang, Kaifeng, Chen, Yinsong, Liu, Qi, Li, Xueyuan, Gao, Xin

arXiv.org Artificial Intelligence

Their interactions are characterized by significant dynamism and heterogeneity. To address these challenges, we propose a MADCZ modeling approach. By constructing dynamic topological structures and spatiotemporal conflict zones, the model attains precise conflict identification and delivers interpretable decision support. First, a joint state space is established, defined as S = S A Vs S BVs S Peds S Road, (2) where S A Vs, S BVs, S Peds and S Road represent the state subspaces of A Vs, BVs, Peds, and road network, respectively. Each subspace is specifically defined as S V ehs = [ x, y,θ, v,l,c, p ] R 22 S Peds = [ x, y,θ, v,l, c ] R 10 S Road = nullnull G(V,E) | V R n 22, E { 0, 1} n nnull, (3) where x and y denote the horizontal and vertical coordinates of the traffic participants, θ [0, 360) is the heading angle, v represents the longitudinal velocity, l and c represent the lane position and traffic participant type, respectively, each encoded as a three-dimensional one-hot vector. G represents the road network topology, where each traffic participant is modeled as a node v i V, and E represents the connections among participants, representing sensor perception or vehicle-to-vehicle (V2V) communication relationships. Additionally, for vehicles, p denotes the relative motion information with respect to surrounding vehicles, defined as p = [ d j, v j], j = {f, r, lf, lr,rf, rr }, (4) where d j and v j denote the relative longitudinal distance and the relative velocity between vehicles, and f, r, lf, lr, rf, rr represent the neighboring vehicles at the front, rear, left front, left rear, right front, and right rear, respectively. If no neighboring vehicle is detected in a given direction, the relative longitudinal distance is assigned the maximum perception range and the relative velocity is set to zero.


PTA: Enhancing Multimodal Sentiment Analysis through Pipelined Prediction and Translation-based Alignment

Song, Shezheng, Li, Shasha, Zhao, Shan, Wang, Chengyu, Li, Xiaopeng, Yu, Jie, Wan, Qian, Ma, Jun, Yan, Tianwei, Ma, Wentao, Mao, Xiaoguang

arXiv.org Artificial Intelligence

Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text tokens with image patches, leading to misalignment and ineffective image utilization. In contrast, a pipeline framework first identifies aspects through MATE (Multimodal Aspect Term Extraction) and then aligns these aspects with image patches for sentiment classification (MASC: Multimodal Aspect-Oriented Sentiment Classification). This method is better suited for multimodal scenarios where effective image use is crucial. We present three key observations: (a) MATE and MASC have different feature requirements, with MATE focusing on token-level features and MASC on sequence-level features; (b) the aspect identified by MATE is crucial for effective image utilization; and (c) images play a trivial role in previous MABSA methods due to high noise. Based on these observations, we propose a pipeline framework that first predicts the aspect and then uses translation-based alignment (TBA) to enhance multimodal semantic consistency for better image utilization. Our method achieves state-of-the-art (SOTA) performance on widely used MABSA datasets Twitter-15 and Twitter-17. This demonstrates the effectiveness of the pipeline approach and its potential to provide valuable insights for future MABSA research. For reproducibility, the code and checkpoint will be released.


Morphology Without Borders: Clause-Level Morphology

Goldman, Omer, Tsarfaty, Reut

arXiv.org Artificial Intelligence

Morphological tasks use large multi-lingual datasets that organize words into inflection tables, which then serve as training and evaluation data for various tasks. However, a closer inspection of these data reveals profound cross-linguistic inconsistencies, that arise from the lack of a clear linguistic and operational definition of what is a word, and that severely impair the universality of the derived tasks. To overcome this deficiency, we propose to view morphology as a clause-level phenomenon, rather than word-level. It is anchored in a fixed yet inclusive set of features, that encapsulates all functions realized in a saturated clause. We deliver MightyMorph, a novel dataset for clause-level morphology covering 4 typologically-different languages: English, German, Turkish and Hebrew. We use this dataset to derive 3 clause-level morphological tasks: inflection, reinflection and analysis. Our experiments show that the clause-level tasks are substantially harder than the respective word-level tasks, while having comparable complexity across languages. Furthermore, redefining morphology to the clause-level provides a neat interface with contextualized language models (LMs) and allows assessing the morphological knowledge encoded in these models and their usability for morphological tasks. Taken together, this work opens up new horizons in the study of computational morphology, leaving ample space for studying neural morphology cross-linguistically.


Machine-Assisted Script Curation

Ciosici, Manuel R., Cummings, Joseph, DeHaven, Mitchell, Hedges, Alex, Kankanampati, Yash, Lee, Dong-Ho, Weischedel, Ralph, Freedman, Marjorie

arXiv.org Artificial Intelligence

We describe Machine-Aided Script Curator (MASC), a system for human-machine collaborative script authoring. Scripts produced with MASC include (1) English descriptions of sub-events that comprise a larger, complex event; (2) event types for each of those events; (3) a record of entities expected to participate in multiple sub-events; and (4) temporal sequencing between the sub-events. MASC automates portions of the script creation process with suggestions for event types, links to Wikidata, and sub-events that may have been forgotten. We illustrate how these automations are useful to the script writer with a few case-study scripts.