Shehu, Amarda
Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula
Blouir, Sam, Smith, Jimmy T. H., Anastasopoulos, Antonios, Shehu, Amarda
Efficient state space models (SSMs), such as linear recurrent neural networks and linear attention variants, offer computational advantages over Transformers but struggle with tasks requiring long-range in-context retrieval-like text copying, associative recall, and question answering over long contexts. Previous efforts to address these challenges have focused on architectural modifications, often reintroducing computational inefficiencies. In this paper, we propose a novel training procedure, Birdie, that significantly enhances the in-context retrieval capabilities of SSMs without altering their architecture. Our approach combines bidirectional input processing with dynamic mixtures of specialized pre-training objectives, optimized via reinforcement learning. We introduce a new bidirectional SSM architecture that seamlessly transitions from bidirectional context processing to causal generation. Experimental evaluations demonstrate that Birdie markedly improves performance on retrieval-intensive tasks such as multi-number phone book lookup, long paragraph question-answering, and infilling. This narrows the performance gap with Transformers, while retaining computational efficiency. Our findings highlight the importance of training procedures in leveraging the fixed-state capacity of SSMs, offering a new direction to advance their capabilities. All code and pre-trained models are available at https://www.github.com/samblouir/birdie, with support for JAX and PyTorch.
Accounting for Work Zone Disruptions in Traffic Flow Forecasting
Lu, Yuanjie, Shehu, Amarda, Lattanzi, David
Traffic speed forecasting is an important task in intelligent transportation system management. The objective of much of the current computational research is to minimize the difference between predicted and actual speeds, but information modalities other than speed priors are largely not taken into account. In particular, though state of the art performance is achieved on speed forecasting with graph neural network methods, these methods do not incorporate information on roadway maintenance work zones and their impacts on predicted traffic flows; yet, the impacts of construction work zones are of significant interest to roadway management agencies, because they translate to impacts on the local economy and public well-being. In this paper, we build over the convolutional graph neural network architecture and present a novel ``Graph Convolutional Network for Roadway Work Zones" model that includes a novel data fusion mechanism and a new heterogeneous graph aggregation methodology to accommodate work zone information in spatio-temporal dependencies among traffic states. The model is evaluated on two data sets that capture traffic flows in the presence of work zones in the Commonwealth of Virginia. Extensive comparative evaluation and ablation studies show that the proposed model can capture complex and nonlinear spatio-temporal relationships across a transportation corridor, outperforming baseline models, particularly when predicting traffic flow during a workzone event.
Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms
Inan, Toki Tahmid, Liu, Mingrui, Shehu, Amarda
Despite an extensive body of literature on deep learning optimization, our current understanding of what makes an optimization algorithm effective is fragmented. In particular, we do not understand well whether enhanced optimization translates to improved generalizability. Current research overlooks the inherent stochastic nature of stochastic gradient descent (SGD) and its variants, resulting in a lack of comprehensive benchmarking and insight into their statistical performance. This paper aims to address this gap by adopting a novel approach. Rather than solely evaluating the endpoint of individual optimization trajectories, we draw from an ensemble of trajectories to estimate the stationary distribution of stochastic optimizers. Our investigation encompasses a wide array of techniques, including SGD and its variants, flat-minima optimizers, and new algorithms we propose under the Basin Hopping framework. Through our evaluation, which encompasses synthetic functions with known minima and real-world problems in computer vision and natural language processing, we emphasize fair benchmarking under a statistical framework, comparing stationary distributions and establishing statistical significance. Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD, noise-enabled variants, and novel optimizers utilizing the BH framework. Notably, these algorithms demonstrate performance on par with flat-minima optimizers like SAM, albeit with half the gradient evaluations. We anticipate that our work will catalyze further exploration in deep learning optimization, encouraging a shift away from single-model approaches towards methodologies that acknowledge and leverage the stochastic nature of optimizers.
Traffic Flow Forecasting with Maintenance Downtime via Multi-Channel Attention-Based Spatio-Temporal Graph Convolutional Networks
Lu, Yuanjie, Kamranfar, Parastoo, Lattanzi, David, Shehu, Amarda
Forecasting traffic flows is a central task in intelligent transportation system management. Graph structures have shown promise as a modeling framework, with recent advances in spatio-temporal modeling via graph convolution neural networks, improving the performance or extending the prediction horizon on traffic flows. However, a key shortcoming of state-of-the-art methods is their inability to take into account information of various modalities, for instance the impact of maintenance downtime on traffic flows. This is the issue we address in this paper. Specifically, we propose a novel model to predict traffic speed under the impact of construction work. The model is based on the powerful attention-based spatio-temporal graph convolution architecture but utilizes various channels to integrate different sources of information, explicitly builds spatio-temporal dependencies among traffic states, captures the relationships between heterogeneous roadway networks, and then predicts changes in traffic flow resulting from maintenance downtime events. The model is evaluated on two benchmark datasets and a novel dataset we have collected over the bustling Tyson's corner region in Northern Virginia. Extensive comparative experiments and ablation studies show that the proposed model can capture complex and nonlinear spatio-temporal relationships across a transportation corridor, outperforming baseline models.
Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder
Guo, Xiaojie, Du, Yuanqi, Tadepalli, Sivani, Zhao, Liang, Shehu, Amarda
Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrella of stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function, though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, this paper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein. Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challenge via graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structural meaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative, deep generative models becoming viable and informative complementary approaches to protein structure prediction.
Interpretable Deep Graph Generation with Node-Edge Co-Disentanglement
Guo, Xiaojie, Zhao, Liang, Qin, Zhao, Wu, Lingfei, Shehu, Amarda, Ye, Yanfang
Disentangled representation learning has recently attracted a significant amount of attention, particularly in the field of image representation learning. However, learning the disentangled representations behind a graph remains largely unexplored, especially for the attributed graph with both node and edge features. Disentanglement learning for graph generation has substantial new challenges including 1) the lack of graph deconvolution operations to jointly decode node and edge attributes; and 2) the difficulty in enforcing the disentanglement among latent factors that respectively influence: i) only nodes, ii) only edges, and iii) joint patterns between them. To address these challenges, we propose a new disentanglement enhancement framework for deep generative models for attributed graphs. In particular, a novel variational objective is proposed to disentangle the above three types of latent factors, with novel architecture for node and edge deconvolutions. Moreover, within each type, individual-factor-wise disentanglement is further enhanced, which is shown to be a generalization of the existing framework for images. Qualitative and quantitative experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed model and its extensions.
The AAAI-13 Conference Workshops
Agrawal, Vikas (IBM Research-India) | Archibald, Christopher (Mississippi State University) | Bhatt, Mehul (University of Bremen) | Bui, Hung (Nuance) | Cook, Diane J. (Washington State University) | Cortés, Juan (University of Toulouse) | Geib, Christopher (Drexel University) | Gogate, Vibhav (University of Texas at Dallas) | Guesgen, Hans W. (Massey University) | Jannach, Dietmar (TU Dortmund) | Johanson, Michael (University of Alberta) | Kersting, Kristian (University of Bonn) | Konidaris, George (Massachusetts Institute of Technology) | Kotthoff, Lars (University College Cork) | Michalowski, Martin (Adventium Labs) | Natarajan, Sriraam (Indiana University) | O'Sullivan, Barry (University College Cork) | Pickett, Marc (Naval Research Laboratory) | Podobnik, Vedran (University of Zagreb) | Poole, David (University of British Columbia) | Shastri, Lokendra (GM Research, India) | Shehu, Amarda (George Mason University) | Sukthankar, Gita (University of Central Florida)
The AAAI-13 Conference Workshops
Agrawal, Vikas (IBM Research-India) | Archibald, Christopher (Mississippi State University) | Bhatt, Mehul (University of Bremen) | Bui, Hung (Nuance) | Cook, Diane J. (Washington State University) | Cortés, Juan (University of Toulouse) | Geib, Christopher (Drexel University) | Gogate, Vibhav (University of Texas at Dallas) | Guesgen, Hans W. (Massey University) | Jannach, Dietmar (TU Dortmund) | Johanson, Michael (University of Alberta) | Kersting, Kristian (University of Bonn) | Konidaris, George (Massachusetts Institute of Technology) | Kotthoff, Lars (University College Cork) | Michalowski, Martin (Adventium Labs) | Natarajan, Sriraam (Indiana University) | O' (University College Cork) | Sullivan, Barry (Naval Research Laboratory) | Pickett, Marc (University of Zagreb) | Podobnik, Vedran (University of British Columbia) | Poole, David (GM Research, India) | Shastri, Lokendra (George Mason University) | Shehu, Amarda (University of Central Florida) | Sukthankar, Gita
Benjamin Grosof (Coherent Knowledge from episodic memory to great progress is being made on methods Systems) on representing activity create semantic memory, using a combination to solve problems related to structure context through semantic rule methods, of semantic memory and prediction, motion simulation, deriving from experience in the episodic memory to guide users?