Goto

Collaborating Authors

 attn



LearningtoReasonIterativelyandParallellyfor ComplexVisualReasoningScenarios

Neural Information Processing Systems

Meanwhile, its"parallel" computation allowsforthesimultaneous explorationofdifferent reasoning paths andbenefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query:"determine the maximum occurring color amongst all t-shirts"). We design IPRM as a lightweight and fully-differentiable neural module thatcanbeconveniently applied toboth transformer and non-transformer vision-language backbones.







DelayedPropagationTransformer: AUniversalComputationEnginetowardsPractical ControlinCyber-PhysicalSystems

Neural Information Processing Systems

DePT induces a cone-shaped spatial-temporal attention prior,which injects theinformation propagation and aggregation principles and enables a global view. With physical constraint inductive bias baked into its design, our DePT is ready to plug and play for a broad class of multi-agent systems. The experimental results on one of the most challenging CPS - network-scale traffic signal control system in the open world - show that our model outperformed the state-of-the-art expert methods on synthetic and real-world datasets.


3ca6d336ddaa316a6ae953a20b9477cf-Supplemental-Conference.pdf

Neural Information Processing Systems

Totackle with arange of noise levels, the training images are corrupted by Gaussian noisewithσ randomly chosefrom[0,50]. Swin transformer: Hierarchical vision transformer using shifted windows.


Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Kawata, Ryotaro, Suzuki, Taiji

arXiv.org Machine Learning

Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $μ^{(i^*)}$ and (ii) prediction from $(μ_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.