Kopp, Michael
xLSTM: Extended Long Short-Term Memory
Beck, Maximilian, Pöppel, Korbinian, Spanring, Markus, Auer, Andreas, Prudnikova, Oleksandra, Kopp, Michael, Klambauer, Günter, Brandstetter, Johannes, Hochreiter, Sepp
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
The impact of the AI revolution on asset management
Kopp, Michael
Recent progress in deep learning, a special form of machine learning, has led to remarkable capabilities machines can now be endowed with: they can read and understand free flowing text, reason and bargain with human counterparts, translate texts between languages, learn how to take decisions to maximize certain outcomes, etc. Today, machines have revolutionized the detection of cancer, the prediction of protein structures, the design of drugs, the control of nuclear fusion reactors etc. Although these capabilities are still in their infancy, it seems clear that their continued refinement and application will result in a technological impact on nearly all social and economic areas of human activity, the likes of which we have not seen before. In this article, I will share my view as to how AI will likely impact asset management in general and I will provide a mental framework that will equip readers with a simple criterion to assess whether and to what degree a given fund really exploits deep learning and whether a large disruption risk from deep learning exist.
Traffic4cast at NeurIPS 2022 -- Predict Dynamics along Graph Edges from Sparse Node Data: Whole City Traffic and ETA from Stationary Vehicle Detectors
Neun, Moritz, Eichenberger, Christian, Martin, Henry, Spanring, Markus, Siripurapu, Rahul, Springer, Daniel, Deng, Leyan, Wu, Chenwang, Lian, Defu, Zhou, Min, Lumiste, Martin, Ilie, Andrei, Wu, Xinhua, Lyu, Cheng, Lu, Qing-Long, Mahajan, Vishal, Lu, Yichao, Li, Jiezhang, Li, Junjun, Gong, Yue-Jiao, Grötschla, Florian, Mathys, Joël, Wei, Ye, Haitao, He, Fang, Hui, Malm, Kevin, Tang, Fei, Kopp, Michael, Kreil, David, Hochreiter, Sepp
The global trends of urbanization and increased personal mobility force us to rethink the way we live and use urban space. The Traffic4cast competition series tackles this problem in a data-driven way, advancing the latest methods in machine learning for modeling complex spatial systems over time. In this edition, our dynamic road graph data combine information from road maps, $10^{12}$ probe data points, and stationary vehicle detectors in three cities over the span of two years. While stationary vehicle detectors are the most accurate way to capture traffic volume, they are only available in few locations. Traffic4cast 2022 explores models that have the ability to generalize loosely related temporal vertex data on just a few nodes to predict dynamic future traffic states on the edges of the entire road graph. In the core challenge, participants are invited to predict the likelihoods of three congestion classes derived from the speed levels in the GPS data for the entire road graph in three cities 15 min into the future. We only provide vehicle count data from spatially sparse stationary vehicle detectors in these three cities as model input for this task. The data are aggregated in 15 min time bins for one hour prior to the prediction time. For the extended challenge, participants are tasked to predict the average travel times on super-segments 15 min into the future - super-segments are longer sequences of road segments in the graph. The competition results provide an important advance in the prediction of complex city-wide traffic states just from publicly available sparse vehicle data and without the need for large amounts of real-time floating vehicle data.
Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report
Svogor, Ivan, Eichenberger, Christian, Spanring, Markus, Neun, Moritz, Kopp, Michael
A growing number of Machine Learning Frameworks recently made Deep Learning accessible to a wider audience of engineers, scientists, and practitioners, by allowing straightforward use of complex neural network architectures and algorithms. However, since deep learning is rapidly evolving, not only through theoretical advancements but also with respect to hardware and software engineering, ML frameworks often lose backward compatibility and introduce technical debt that can lead to bottlenecks and sub-optimal resource utilization. Moreover, the focus is in most cases not on deep learning engineering, but rather on new models and theoretical advancements. In this work, however, we focus on engineering, more specifically on the data loading pipeline in the PyTorch Framework. We designed a series of benchmarks that outline performance issues of certain steps in the data loading process. Our findings show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved. With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X. This allows for the use of the cloud-based, S3-like object storage for datasets, and have comparable training time as if datasets are stored on local drives.
Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge network
Krenn, Mario, Buffoni, Lorenzo, Coutinho, Bruno, Eppel, Sagi, Foster, Jacob Gates, Gritsevskiy, Andrew, Lee, Harlin, Lu, Yichao, Moutinho, Joao P., Sanjabi, Nima, Sonthalia, Rishi, Tran, Ngoc Mai, Valente, Francisco, Xie, Yangxinyu, Yu, Rose, Kopp, Michael
A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human researchers to keep track of the progress. Here, we use AI techniques to predict the future research directions of AI itself. We develop a new graph-based benchmark based on real-world data -- the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 100,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. It indicates a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools.
A remark on a paper of Krotov and Hopfield [arXiv:2008.06996]
Tang, Fei, Kopp, Michael
In their recent paper titled "Large Associative Memory Problem in Neurobiology and Machine Learning" [arXiv:2008.06996] the authors gave a biologically plausible microscopic theory from which one can recover many dense associative memory models discussed in the literature. We show that the layers of the recent "MLP-mixer" [arXiv:2105.01601] as well as the essentially equivalent model in [arXiv:2105.02723] are amongst them.
Hopfield Networks is All You Need
Ramsauer, Hubert, Schäfl, Bernhard, Lehner, Johannes, Seidl, Philipp, Widrich, Michael, Gruber, Lukas, Holzleitner, Markus, Pavlović, Milena, Sandve, Geir Kjetil, Greiff, Victor, Kreil, David, Kopp, Michael, Klambauer, Günter, Brandstetter, Johannes, Hochreiter, Sepp
We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: https://github.com/ml-jku/hopfield-layers