### Injecting Hierarchy with U-Net Transformers

The Transformer architecture has become increasingly popular over the past couple of years, owing to its impressive performance on a number of natural language processing (NLP) tasks. However, it may be argued that the Transformer architecture lacks an explicit hierarchical representation, as all computations occur on word-level representations alone, and therefore, learning structure poses a challenge for Transformer models. In the present work, we introduce hierarchical processing into the Transformer model, taking inspiration from the U-Net architecture, popular in computer vision for its hierarchical view of natural images. We propose a novel architecture that combines ideas from Transformer and U-Net models to incorporate hierarchy at multiple levels of abstraction. We empirically demonstrate that the proposed architecture outperforms the vanilla Transformer and strong baselines in the chit-chat dialogue and machine translation domains.

### Complex Transformer: A Framework for Modeling Complex-Valued Sequence

ABSTRACT While deep learning has received a surge of interest in a variety of fields in recent years, major deep learning models barely use complex numbers. However, speech, signal and audio data are naturally complex-valued after Fourier Transform, and studies have shown a potentially richer representation of complex nets. In this paper, we propose a Complex Transformer, which incorporates the transformer model as a backbone for sequence modeling; we also develop attention and encoder-decoder network operating for complex input. The model achieves state-of-the-art performance on the MusicNet dataset and an In-phase Quadrature (IQ) signal dataset. The GitHub implementation to reproduce the experimental results is available at https://github.com/

### You May Not Need Order in Time Series Forecasting

Time series forecasting with limited data is a challenging yet critical task. While transformers have achieved outstanding performances in time series forecasting, they often require many training samples due to the large number of trainable parameters. In this paper, we propose a training technique for transformers that prepares the training windows through random sampling. As input time steps need not be consecutive, the number of distinct samples increases from linearly to combinatorially many. By breaking the temporal order, this technique also helps transformers to capture dependencies among time steps in finer granularity. We achieve competitive results compared to the state-of-the-art on real-world datasets.

### Learning Set-equivariant Functions with SWARM Mappings

In this work we propose a new neural network architecture that efficiently implements and learns general purpose set-equivariant functions. Such a function $f$ maps a set of entities $x=\left\{ x_{1},\ldots,x_{n}\right\}$ from one domain to a set of same cardinality $y=f\left(x\right)=\left\{ y_{1},\ldots,y_{n}\right\}$ in another domain regardless of the ordering of the entities. The architecture is based on a gated recurrent network which is iteratively applied to all entities individually and at the same time syncs with the progression of the whole population. In reminiscence to this pattern, which can be frequently observable in nature, we call our approach SWARM mapping. Set-equivariant and generally permutation invariant functions are important building blocks for many state of the art machine learning approaches. Even in application where the permutation invariance is not of primary interest, as to be seen in the recent success of attention based transformer models (Vaswani et. al. 2017). Accordingly, we demonstrate the power and usefulness of SWARM mappings in different applications. We compare the performance of our approach with another recently proposed set-equivariant function, the SetTransformer (Lee et.al. 2018) and we demonstrate that transformer solely based on SWARM layers gives state of the art results.