mantis
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Yang, Yi, Li, Xueqi, Chen, Yiyang, Song, Jin, Wang, Yihan, Xiao, Zipeng, Su, Jiadi, Qiaoben, You, Liu, Pengfei, Deng, Zhijie
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
Leveraging Generic Time Series Foundation Models for EEG Classification
Gnassounou, Théo, Moakher, Yessin, Xie, Shifeng, Feofanov, Vasilii, Redko, Ievgen
Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.
- Asia > Middle East > Israel (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.89)
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Li, Zechen, Chen, Baiyu, Xue, Hao, Salim, Flora D.
Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.
- Oceania > Australia > New South Wales (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
CauKer: classification time series foundation models can be pretrained on synthetic data only
Xie, Shifeng, Feofanov, Vasilii, Alonso, Marius, Odonnat, Ambroise, Zhang, Jianfeng, Palpanas, Themis, Redko, Ievgen
Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.
- Information Technology > Data Science > Data Mining (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers
Roschmann, Simon, Bouniot, Quentin, Feofanov, Vasilii, Redko, Ievgen, Akata, Zeynep
Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at https://github.com/ExplainableML/TiViT.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification
Feofanov, Vasilii, Wen, Songkang, Alonso, Marius, Ilbert, Romain, Guo, Hongbo, Tiomoko, Malik, Pan, Lujia, Zhang, Jianfeng, Redko, Ievgen
In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Data Science > Data Mining (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Redefining in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
Feng, Fu, Xie, Yucheng, Yang, Xu, Wang, Jing, Geng, Xin
Given the challenge atively generated using
- Research Report > Promising Solution (0.46)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
Pasquini, Dario, Kornaropoulos, Evgenios M., Ateniese, Giuseppe
Large language models (LLMs) are increasingly being harnessed to automate cyberattacks, making sophisticated exploits more accessible and scalable. In response, we propose a new defense strategy tailored to counter LLM-driven cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs' susceptibility to adversarial inputs to undermine malicious operations. Upon detecting an automated cyberattack, Mantis plants carefully crafted inputs into system responses, leading the attacker's LLM to disrupt their own operations (passive defense) or even compromise the attacker's machine (active defense). By deploying purposefully vulnerable decoy services to attract the attacker and using dynamic prompt injections for the attacker's LLM, Mantis can autonomously hack back the attacker. In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks. To foster further research and collaboration, Mantis is available as an open-source tool: https://github.com/pasquini-dario/project_mantis
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Mexico > Gulf of Mexico (0.04)
- Asia (0.04)
- Information Technology > Security & Privacy (1.00)
- Government > Military > Cyberwarfare (1.00)
MANTIS: Interleaved Multi-Image Instruction Tuning
Jiang, Dongfu, He, Xuan, Zeng, Huaye, Wei, Cong, Ku, Max, Liu, Qian, Chen, Wenhu
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of models Mantis. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on five multi-image benchmarks and seven single-image benchmarks. Mantis-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. Notably, we found that Mantis can even match the performance of GPT-4V on multi-image benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning. Our work provides new perspectives on how to improve LMMs' multi-image abilities.
- Europe > Monaco (0.04)
- Asia > Middle East > Jordan (0.04)
MANTIS at #SMM4H 2023: Leveraging Hybrid and Ensemble Models for Detection of Social Anxiety Disorder on Reddit
Zanwar, Sourabh, Wiechmann, Daniel, Qiao, Yu, Kerz, Elma
This paper presents our system employed for the Social Media Mining for Health 2023 Shared Task 4: Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. We systematically investigate and contrast the efficacy of hybrid and ensemble models that harness specialized medical domain-adapted transformers in conjunction with BiLSTM neural networks. The evaluation results outline that our best performing model obtained 89.31% F1 on the validation set and 83.76% F1 on the test set.
- North America > United States (0.05)
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)