AITopics | Tschannen, Michael

Collaborating Authors

Tschannen, Michael

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Quantization-Free Autoregressive Action Transformer

Sheebaelhamd, Ziyad, Tschannen, Michael, Muehlebach, Michael, Vernade, Claire

arXiv.org Artificial IntelligenceMar-18-2025

Psenka et al., 2023), which will be discussed in the next two paragraphs. Current transformer-based imitation learning approaches introduce discrete action representations Existing autoregressive policies, on the one hand, sidestep and train an autoregressive transformer decoder the challenge of learning in a continuous domain by discretizing on the resulting latent code. However, the initial the actions (Lee et al., 2024; Shafiullah et al., quantization breaks the continuous structure of the 2022). This discretization can introduce several drawbacks: action space thereby limiting the capabilities of It discards the inherent structure of the continuous space, the generative model. We propose a quantizationfree increases complexity by adding a separate quantization step, method instead that leverages Generative and may limit expressiveness or accuracy when fine-grained Infinite-Vocabulary Transformers (GIVT) as a direct, control is required.

artificial intelligence, machine learning, natural language, (12 more...)

arXiv.org Artificial Intelligence

2503.14259

Country: Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, Michael, Gritsenko, Alexey, Wang, Xiao, Naeem, Muhammad Ferjad, Alabdulmohsin, Ibrahim, Parthasarathy, Nikhil, Evans, Talfan, Beyer, Lucas, Xia, Ye, Mustafa, Basil, Hénaff, Olivier, Harmsen, Jeremiah, Steiner, Andreas, Zhai, Xiaohua

arXiv.org Artificial IntelligenceFeb-20-2025

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.14786

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Add feedback

Jet: A Modern Transformer-Based Normalizing Flow

Kolesnikov, Alexander, Pinto, André Susano, Tschannen, Michael

arXiv.org Artificial IntelligenceDec-19-2024

In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.15129

Genre: Research Report > New Finding (0.46)

Industry: Energy > Oil & Gas > Downstream (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

JetFormer: An Autoregressive Generative Model of Raw Images and Text

Tschannen, Michael, Pinto, André Susano, Kolesnikov, Alexander

arXiv.org Artificial IntelligenceNov-29-2024

Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer--JetFormer--which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-toimage generation quality competitive with recent VQVAE-and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating highfidelity images and producing strong log-likelihood bounds. The "Bitter lesson" (Sutton, 2019) has been the prime force behind the recent progress in machine learning and artificial intelligence research. It suggests that general-purpose methods which effectively leverage large amounts of compute and data prevail over specialized techniques designed by domain experts.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.19722

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

PaliGemma: A versatile 3B VLM for transfer

Beyer, Lucas, Steiner, Andreas, Pinto, André Susano, Kolesnikov, Alexander, Wang, Xiao, Salz, Daniel, Neumann, Maxim, Alabdulmohsin, Ibrahim, Tschannen, Michael, Bugliarello, Emanuele, Unterthiner, Thomas, Keysers, Daniel, Koppula, Skanda, Liu, Fangyu, Grycner, Adam, Gritsenko, Alexey, Houlsby, Neil, Kumar, Manoj, Rong, Keran, Eisenschlos, Julian, Kabra, Rishabh, Bauer, Matthias, Bošnjak, Matko, Chen, Xi, Minderer, Matthias, Voigtlaender, Paul, Bica, Ioana, Balazevic, Ivana, Puigcerver, Joan, Papalampidi, Pinelopi, Henaff, Olivier, Xiong, Xi, Soricut, Radu, Harmsen, Jeremiah, Zhai, Xiaohua

arXiv.org Artificial IntelligenceJul-10-2024

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2407.07726

Country:

Europe (1.00)
North America > United States > Hawaii (0.14)
North America > United States > Oregon (0.14)
(3 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Education (0.45)
Energy (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Stanić, Aleksandar, Caelles, Sergi, Tschannen, Michael

arXiv.org Artificial IntelligenceJan-3-2024

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

imagepatch, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2401.01974

Country: Europe > Netherlands (0.14)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Finite Scalar Quantization: VQ-VAE Made Simple

Mentzer, Fabian, Minnen, David, Agustsson, Eirikur, Tschannen, Michael

arXiv.org Artificial IntelligenceOct-12-2023

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations. Vector quantization (VQ), initially introduced by Gray (1984), has recently seen a renaissance in the context of learning discrete representations with neural networks. Spurred by the success of VQ-VAE (Van Den Oord et al., 2017), Esser et al. (2020) and Villegas et al. (2022) showed that training an autoregressive transformer on the representations of a VQ-VAE trained with a GAN loss enables powerful image and video generation models, respectively.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

2309.15505

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Chen, Xi, Djolonga, Josip, Padlewski, Piotr, Mustafa, Basil, Changpinyo, Soravit, Wu, Jialin, Ruiz, Carlos Riquelme, Goodman, Sebastian, Wang, Xiao, Tay, Yi, Shakeri, Siamak, Dehghani, Mostafa, Salz, Daniel, Lucic, Mario, Tschannen, Michael, Nagrani, Arsha, Hu, Hexiang, Joshi, Mandar, Pang, Bo, Montgomery, Ceslee, Pietrzyk, Paulina, Ritter, Marvin, Piergiovanni, AJ, Minderer, Matthias, Pavetic, Filip, Waters, Austin, Li, Gang, Alabdulmohsin, Ibrahim, Beyer, Lucas, Amelot, Julien, Lee, Kenton, Steiner, Andreas Peter, Li, Yang, Keysers, Daniel, Arnab, Anurag, Xu, Yuanzhong, Rong, Keran, Kolesnikov, Alexander, Seyedhosseini, Mojtaba, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, Soricut, Radu

arXiv.org Artificial IntelligenceMay-29-2023

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2305.18565

Country: North America > United States > Louisiana (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.92)
Media (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.55)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.46)
(2 more...)

Add feedback

M2T: Masking Transformers Twice for Faster Decoding

Mentzer, Fabian, Agustsson, Eirikur, Tschannen, Michael

arXiv.org Artificial IntelligenceApr-14-2023

In MaskGIT [11], the authors (see Figure 1) use a VQ-GAN [16] to map images to vector-quantized tokens, Motivated by this, we aim to employ masked transformers and learn a transformer to predict the distribution of for neural image compression. Previous work has these tokens. The key novelty of the approach was to use used masked and unmasked transformers in the entropy BERT-like [13] random masks during training to then predict model for video compression [37, 25] and image compression tokens in groups during inference, sampling tokens in [29, 22, 15]. However, these models are often either the same group in parallel at each inference step. Thereby, prohibitively slow [22], or lag in rate-distortion performance each inference step is conditioned on the tokens generated [29, 15]. In this paper, we show a conceptually in previous steps. A big advantage of BERT-like training simple transformer-based approach that is state-of-the-art in with grouped inference versus prior state-of-the-art is that neural image compression, at practical runtimes. The model considerably fewer steps are required to produce realistic is using off-the-shelf transformers, and does not rely on images (typically 10-20, rather than one per token).

artificial intelligence, machine learning, transformer, (16 more...)

arXiv.org Artificial Intelligence

2304.07313

Genre: Research Report > Promising Solution (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

FlexiViT: One Model for All Patch Sizes

Beyer, Lucas, Izmailov, Pavel, Kolesnikov, Alexander, Caron, Mathilde, Kornblith, Simon, Zhai, Xiaohua, Minderer, Matthias, Tschannen, Michael, Alabdulmohsin, Ibrahim, Pavetic, Filip

arXiv.org Artificial IntelligenceMar-23-2023

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2212.08013

Country:

North America > United States (0.28)
North America > Canada (0.28)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback