AITopics | Virmaux, Aladin

Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

Ilbert, Romain, Odonnat, Ambroise, Feofanov, Vasilii, Virmaux, Aladin, Paolo, Giuseppe, Palpanas, Themis, Redko, Ievgen

arXiv.org Artificial IntelligenceFeb-19-2024

Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses the current state-of-the-art model TSMixer by 14.33% on average, while having ~4 times fewer parameters. The code is available at https://github.com/romilbert/samformer.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.10198

Country:

Europe (0.45)
North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry: Banking & Finance (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Random Matrix Analysis to Balance between Supervised and Unsupervised Learning under the Low Density Separation Assumption

Feofanov, Vasilii, Tiomoko, Malik, Virmaux, Aladin

arXiv.org Machine LearningOct-20-2023

We propose a theoretical framework to analyze semi-supervised classification under the low density separation assumption in a high-dimensional regime. In particular, we introduce QLDS, a linear classification model, where the low density separation assumption is implemented via quadratic margin maximization. The algorithm has an explicit solution with rich theoretical properties, and we show that particular cases of our algorithm are the least-square support vector machine in the supervised case, the spectral clustering in the fully unsupervised regime, and a class of semi-supervised graph-based approaches. As such, QLDS establishes a smooth bridge between these supervised and unsupervised learning methods. Using recent advances in the random matrix theory, we formally derive a theoretical evaluation of the classification error in the asymptotic regime. As an application, we derive a hyperparameter selection policy that finds the best balance between the supervised and the unsupervised terms of our learning criterion. Finally, we provide extensive illustrations of our framework, as well as an experimental study on several benchmarks to demonstrate that QLDS, while being computationally more efficient, improves over cross-validation for hyperparameter selection, indicating a high promise of the usage of random matrix theory for semi-supervised model selection.

artificial intelligence, machine learning, qld, (14 more...)

arXiv.org Machine Learning

2310.13434

Country:

North America > United States > New York (0.14)
North America > United States > Hawaii (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)

Add feedback

Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

Dasoulas, George, Scaman, Kevin, Virmaux, Aladin

arXiv.org Machine LearningMar-8-2021

Attention based neural networks are state of the art in a large range of applications. However, their performance tends to degrade when the number of layers increases. In this work, we show that enforcing Lipschitz continuity by normalizing the attention scores can significantly improve the performance of deep attention models. First, we show that, for deep graph attention networks (GAT), gradient explosion appears during training, leading to poor performance of gradient-based training algorithms. To address this issue, we derive a theoretical analysis of the Lipschitz continuity of attention modules and introduce LipschitzNorm, a simple and parameter-free normalization for self-attention mechanisms that enforces the model to be Lipschitz continuous. We then apply LipschitzNorm to GAT and Graph Transformers and show that their performance is substantially improved in the deep setting (10 to 30 layers). More specifically, we show that a deep GAT model with LipschitzNorm achieves state of the art results for node label prediction tasks that exhibit long-range dependencies, while showing consistent improvements over their unnormalized counterparts in benchmark node classification tasks.

deep learning, lipschitz normalization, neural network, (15 more...)

arXiv.org Machine Learning

2103.04886

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Ego-based Entropy Measures for Structural Representations

Dasoulas, George, Nikolentzos, Giannis, Scaman, Kevin, Virmaux, Aladin, Vazirgiannis, Michalis

arXiv.org Machine LearningMar-1-2020

In complex networks, nodes that share similar structural characteristics often exhibit similar roles (e.g type of users in a social network or the hierarchical position of employees in a company). In order to leverage this relationship, a growing literature proposed latent representations that identify structurally equivalent nodes. However, most of the existing methods require high time and space complexity. In this paper, we propose VNEstruct, a simple approach for generating low-dimensional structural node embeddings, that is both time efficient and robust to perturbations of the graph structure. The proposed approach focuses on the local neighborhood of each node and employs the Von Neumann entropy, an information-theoretic tool, to extract features that capture the neighborhood's topology. Moreover, on graph classification tasks, we suggest the utilization of the generated structural embeddings for the transformation of an attributed graph structure into a set of augmented node attributes. Empirically, we observe that the proposed approach exhibits robustness on structural role identification tasks and state-of-the-art performance on graph classification tasks, while maintaining very high computational speed.

artificial intelligence, neural network, representation, (19 more...)

arXiv.org Machine Learning

2003.00553

Country: North America > United States (0.30)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Coloring graph neural networks for node disambiguation

Dasoulas, George, Santos, Ludovic Dos, Scaman, Kevin, Virmaux, Aladin

arXiv.org Machine LearningDec-12-2019

Learning good representations is seen by many machine learning researchers as the main reason behind the tremendous successes of the field in recent years (Bengio et al., 2013). In image analysis (Krizhevsky et al., 2012), natural language processing (V aswani et al., 2017) or reinforcement learning (Mnih et al., 2015), groundbreaking results rely on efficient and flexible deep learning Despite a large literature and state-of-the-art performance on benchmark graph classification datasets, graph neural networks yet lack a similar theoretical foundation (Xu et al., 2019). Defferrard et al., 2016; Kipf and Welling, 2017) that perform convolution on the Fourier domain of Recently, (Xu et al., 2019) showed that MPNNs were, at most, as expressive as the Weisfeiler-Lehman (WL) test for graph isomorphism (Weisfeiler and Lehman, 1968). Other recent approaches (Maron et al., 2019c) implies quadratic order of tensors in the size of In this section we present the theoretical tools used to design our universal graph representation. This assumption is rather weak (e.g. Figure 2: Universal representations can easily be created by combining a separable representation with an MLP .

deep learning, neural network, representation, (20 more...)

arXiv.org Machine Learning

1912.06058

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Lipschitz regularity of deep neural networks: analysis and efficient estimation

Virmaux, Aladin, Scaman, Kevin

Neural Information Processing SystemsDec-31-2018

Deep neural networks are notorious for being sensitive to small well-chosen perturbations, and estimating the regularity of such architectures is of utmost importance for safe and robust practical applications. In this paper, we investigate one of the key characteristics to assess the regularity of such methods: the Lipschitz constant of deep learning architectures. First, we show that, even for two layer neural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improve previous estimation methods by providing AutoLip, the first generic algorithm for upper bounding the Lipschitz constant of any automatically differentiable function. We provide a power method algorithm working with automatic differentiation, allowing efficient computations even on large convolutions. Second, for sequential neural networks, we propose an improved algorithm named SeqLip that takes advantage of the linear computation graph to split the computation per pair of consecutive layers. Third we propose heuristics on SeqLip in order to tackle very large networks. Our experiments show that SeqLip can significantly improve on the existing upper bounds. Finally, we provide an implementation of AutoLip in the PyTorch environment that may be used to better estimate the robustness of a given neural network to small perturbations or regularize it using more precise Lipschitz estimations. These results also hint at the difficulty to estimate the Lipschitz constant of deep networks.

deep learning, lipschitz constant, neural network, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lipschitz regularity of deep neural networks: analysis and efficient estimation

Virmaux, Aladin, Scaman, Kevin

Neural Information Processing SystemsDec-31-2018

Deep neural networks are notorious for being sensitive to small well-chosen perturbations, and estimating the regularity of such architectures is of utmost importance for safe and robust practical applications. In this paper, we investigate one of the key characteristics to assess the regularity of such methods: the Lipschitz constant of deep learning architectures. First, we show that, even for two layer neural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improve previous estimation methods by providing AutoLip, the first generic algorithm for upper bounding the Lipschitz constant of any automatically differentiable function. We provide a power method algorithm working with automatic differentiation, allowing efficient computations even on large convolutions. Second, for sequential neural networks, we propose an improved algorithm named SeqLip that takes advantage of the linear computation graph to split the computation per pair of consecutive layers. Third we propose heuristics on SeqLip in order to tackle very large networks. Our experiments show that SeqLip can significantly improve on the existing upper bounds. Finally, we provide an implementation of AutoLip in the PyTorch environment that may be used to better estimate the robustness of a given neural network to small perturbations or regularize it using more precise Lipschitz estimations. These results also hint at the difficulty to estimate the Lipschitz constant of deep networks.

deep learning, lipschitz constant, neural network, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lipschitz regularity of deep neural networks: analysis and efficient estimation

Scaman, Kevin, Virmaux, Aladin

arXiv.org Machine LearningMay-28-2018

Deep neural networks are notorious for being sensitive to small well-chosen perturbations, and estimating the regularity of such architectures is of utmost importance for safe and robust practical applications. In this paper, we investigate one of the key characteristics to assess the regularity of such methods: the Lipschitz constant of deep learning architectures. First, we show that, even for two layer neural networks, the exact computation of this quantity is NP-hard and state-of-art methods may significantly overestimate it. Then, we both extend and improve previous estimation methods by providing AutoLip, the first generic algorithm for upper bounding the Lipschitz constant of any automatically differentiable function. We provide a power method algorithm working with automatic differentiation, allowing efficient computations even on large convolutions. Second, for sequential neural networks, we propose an improved algorithm named SeqLip that takes advantage of the linear computation graph to split the computation per pair of consecutive layers. Third we propose heuristics on SeqLip in order to tackle very large networks. Our experiments show that SeqLip can significantly improve on the existing upper bounds.

deep learning, lipschitz constant, neural network, (18 more...)

arXiv.org Machine Learning

1805.10965

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Virmaux, Aladin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

Random Matrix Analysis to Balance between Supervised and Unsupervised Learning under the Low Density Separation Assumption

Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

Ego-based Entropy Measures for Structural Representations

Coloring graph neural networks for node disambiguation

Lipschitz regularity of deep neural networks: analysis and efficient estimation

Lipschitz regularity of deep neural networks: analysis and efficient estimation

Lipschitz regularity of deep neural networks: analysis and efficient estimation