AITopics | mantis

Collaborating Authors

mantis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yang, Yi, Li, Xueqi, Chen, Yiyang, Song, Jin, Wang, Yihan, Xiao, Zipeng, Su, Jiadi, Qiaoben, You, Liu, Pengfei, Deng, Zhijie

arXiv.org Artificial IntelligenceNov-21-2025

Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

arxiv preprint arxiv, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.16175

Genre: Research Report (0.64)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Leveraging Generic Time Series Foundation Models for EEG Classification

Gnassounou, Théo, Moakher, Yessin, Xie, Shifeng, Feofanov, Vasilii, Redko, Ievgen

arXiv.org Artificial IntelligenceNov-3-2025

Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.27522

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Li, Zechen, Chen, Baiyu, Xue, Hao, Salim, Flora D.

arXiv.org Artificial IntelligenceAug-7-2025

Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.04038

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CauKer: classification time series foundation models can be pretrained on synthetic data only

Xie, Shifeng, Feofanov, Vasilii, Alonso, Marius, Odonnat, Ambroise, Zhang, Jianfeng, Palpanas, Themis, Redko, Ievgen

arXiv.org Artificial IntelligenceAug-7-2025

Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.

data mining, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.02879

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Time Series Representations for Classification Lie Hidden in Pretrained Vision Transformers

Roschmann, Simon, Bouniot, Quentin, Feofanov, Vasilii, Redko, Ievgen, Akata, Zeynep

arXiv.org Artificial IntelligenceJul-3-2025

Time series classification is a fundamental task in healthcare and industry, yet the development of time series foundation models (TSFMs) remains limited by the scarcity of publicly available time series datasets. In this work, we propose Time Vision Transformer (TiViT), a framework that converts time series into images to leverage the representational power of frozen Vision Transformers (ViTs) pretrained on large-scale image datasets. First, we theoretically motivate our approach by analyzing the 2D patching of ViTs for time series, showing that it can increase the number of label-relevant tokens and reduce the sample complexity. Second, we empirically demonstrate that TiViT achieves state-of-the-art performance on standard time series classification benchmarks by utilizing the hidden representations of large OpenCLIP models. We explore the structure of TiViT representations and find that intermediate layers with high intrinsic dimension are the most effective for time series classification. Finally, we assess the alignment between TiViT and TSFM representation spaces and identify a strong complementarity, with further performance gains achieved by combining their features. Our findings reveal a new direction for reusing vision representations in a non-visual domain. Code is available at https://github.com/ExplainableML/TiViT.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.08641

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

Feofanov, Vasilii, Wen, Songkang, Alonso, Marius, Ilbert, Romain, Guo, Hongbo, Tiomoko, Malik, Pan, Lujia, Zhang, Jianfeng, Redko, Ievgen

arXiv.org Machine LearningFeb-21-2025

In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

dataset, foundation model, mantis, (12 more...)

arXiv.org Machine Learning

2502.15637

Country: South America > Ecuador (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Redefining in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

Feng, Fu, Xie, Yucheng, Yang, Xu, Wang, Jing, Geng, Xin

arXiv.org Artificial IntelligenceNov-20-2024

Given the challenge atively generated using . Furthermore, this that diffusion models face in directly generating creativity, meta-creativity enables direct concept combinations without existing methods typically rely on synthesizing reference requiring additional training, much like generating "a prompts or images to achieve creative effects. This significantly reduces both time and computational instance, to combine "Lettuce" and "Mantis" creatively, complexity compared to state-of-the-art (SOTA) ConceptLab [43] merges tokens representing these concepts creative generation methods, such as ConceptLab [43] (4s into a new composite token, while BASS [22] uses predefined vs. 120s per image, 30 speedup) and BASS [22] (4s vs. sampling rules to search for creative outcomes from a 2400s per image, 600 speedup), while maintaining linguistic large pool of candidate images. Further each generation, which leads to high computational costs evaluations using GPT-4o [1] and user studies indicate superior and limited practicality for online applications. In contrast, performance of CreTok in terms of integration, originality, "a blue banana" can be generated directly without additional and aesthetics, underscoring its effectiveness in fostering training, due to its clear and concrete semantics, especially combinatorial creativity. Inspired by this, we may Our contributions are as follows: (1) We propose Cre-ask: Can we awaken the creativity of diffusion models by Tok, a method designed to enhance models' meta-ability enhancing their semantic understanding of "creative"? To by enabling a enhanced understanding of abstract and ambiguous achieve this, we propose CreTok, which redefines "creative" adjectives (e.g., "creative" or "beautiful") through as a new specialized token, , allowing it their redefinition as new tokens. This redefinition we redefine the abstract term "creative" within our proposed enhances the model's semantic understanding for CangJie dataset for the TP2O task, and introduce combinatorial creativity, as shown in Figure 1c. Specifically, text-to-image (T2I) models and creative generation methods CreTok builds on the definition of "creativity" from in terms of computational complexity, human preference the TP2O task [22] for combinatorial object generation, ratings, text-image alignment, and other key metrics. ") and human-like creativity, a critical yet underexplored aspect an adaptive prompt (e.g., "A photo of a mixture"). of AI research [28, 29].

creativity, diffusion model, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2410.2416

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre:

Research Report > Promising Solution (0.46)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks

Pasquini, Dario, Kornaropoulos, Evgenios M., Ateniese, Giuseppe

arXiv.org Artificial IntelligenceNov-18-2024

Large language models (LLMs) are increasingly being harnessed to automate cyberattacks, making sophisticated exploits more accessible and scalable. In response, we propose a new defense strategy tailored to counter LLM-driven cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs' susceptibility to adversarial inputs to undermine malicious operations. Upon detecting an automated cyberattack, Mantis plants carefully crafted inputs into system responses, leading the attacker's LLM to disrupt their own operations (passive defense) or even compromise the attacker's machine (active defense). By deploying purposefully vulnerable decoy services to attract the attacker and using dynamic prompt injections for the attacker's LLM, Mantis can autonomously hack back the attacker. In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks. To foster further research and collaboration, Mantis is available as an open-source tool: https://github.com/pasquini-dario/project_mantis

attacker, llm-agent, mantis, (13 more...)

arXiv.org Artificial Intelligence

2410.20911

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

MANTIS: Interleaved Multi-Image Instruction Tuning

Jiang, Dongfu, He, Xuan, Zeng, Huaye, Wei, Cong, Ku, Max, Liu, Qian, Chen, Wenhu

arXiv.org Artificial IntelligenceMay-23-2024

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of models Mantis. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on five multi-image benchmarks and seven single-image benchmarks. Mantis-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. Notably, we found that Mantis can even match the performance of GPT-4V on multi-image benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning. Our work provides new perspectives on how to improve LMMs' multi-image abilities.

arxiv, benchmark, dataset, (15 more...)

arXiv.org Artificial Intelligence

2405.01483

Country:

Europe > Monaco (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

MANTIS at #SMM4H 2023: Leveraging Hybrid and Ensemble Models for Detection of Social Anxiety Disorder on Reddit

Zanwar, Sourabh, Wiechmann, Daniel, Qiao, Yu, Kerz, Elma

arXiv.org Artificial IntelligenceNov-28-2023

This paper presents our system employed for the Social Media Mining for Health 2023 Shared Task 4: Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. We systematically investigate and contrast the efficacy of hybrid and ensemble models that harness specialized medical domain-adapted transformers in conjunction with BiLSTM neural networks. The evaluation results outline that our best performing model obtained 89.31% F1 on the validation set and 83.76% F1 on the test set.

leveraging hybrid and ensemble model, proceedings, social anxiety disorder, (10 more...)

arXiv.org Artificial Intelligence

2312.09451

Country:

North America > United States (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)

Genre: Research Report (0.51)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Anxiety Disorder (0.63)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback