Banff
Sharp bounds for the number of regions of maxout networks and vertices of Minkowski sums
Montúfar, Guido, Ren, Yue, Zhang, Leon
We present results on the number of linear regions of the functions that can be represented by artificial feedforward neural networks with maxout units. A rank-k maxout unit is a function computing the maximum of $k$ linear functions. For networks with a single layer of maxout units, the linear regions correspond to the upper vertices of a Minkowski sum of polytopes. We obtain face counting formulas in terms of the intersection posets of tropical hypersurfaces or the number of upper faces of partial Minkowski sums, along with explicit sharp upper bounds for the number of regions for any input dimension, any number of units, and any ranks, in the cases with and without biases. Based on these results we also obtain asymptotically sharp upper bounds for networks with multiple layers.
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
Hegde, Sindhu B, Prajwal, K R, Mukhopadhyay, Rudrabha, Namboodiri, Vinay P, Jawahar, C. V.
In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baselines by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on $4\times$ more data. We conduct numerous ablation studies to analyze the effect of different modules of our architecture. We also provide a demo video that demonstrates several qualitative results along with the code and trained models on our website: \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/lip-to-speech-synthesis}}
Large-Scale Auto-Regressive Modeling Of Street Networks
Birsak, Michael, Kelly, Tom, Para, Wamiq, Wonka, Peter
We present a novel generative method for the creation of city-scale road layouts. While the output of recent methods is limited in both size of the covered area and diversity, our framework produces large traversable graphs of high quality consisting of vertices and edges representing complete street networks covering 400 square kilometers or more. While our framework can process general 2D embedded graphs, we focus on street networks due to the wide availability of training data. Our generative framework consists of a transformer decoder that is used in a sliding window manner to predict a field of indices, with each index encoding a representation of the local neighborhood. The semantics of each index is determined by a dictionary of context vectors. The index field is then input to a decoder to compute the street graph. Using data from OpenStreetMap, we train our system on whole cities and even across large countries such as the US, and finally compare it to the state of the art.
A Deep Neural Networks ensemble workflow from hyperparameter search to inference leveraging GPU clusters
Pochelu, Pierrick, Petiton, Serge G., Conche, Bruno
Automated Machine Learning with ensembling (or AutoML with ensembling) seeks to automatically build ensembles of Deep Neural Networks (DNNs) to achieve qualitative predictions. Ensemble of DNNs are well known to avoid over-fitting but they are memory and time consuming approaches. Therefore, an ideal AutoML would produce in one single run time different ensembles regarding accuracy and inference speed. While previous works on AutoML focus to search for the best model to maximize its generalization ability, we rather propose a new AutoML to build a larger library of accurate and diverse individual models to then construct ensembles. First, our extensive benchmarks show asynchronous Hyperband is an efficient and robust way to build a large number of diverse models to combine them. Then, a new ensemble selection method based on a multi-objective greedy algorithm is proposed to generate accurate ensembles by controlling their computing cost. Finally, we propose a novel algorithm to optimize the inference of the DNNs ensemble in a GPU cluster based on allocation optimization. The produced AutoML with ensemble method shows robust results on two datasets using efficiently GPU clusters during both the training phase and the inference phase. Deep Neural networks (DNNs) are notoriously difficult to tune, train, and ensemble to achieve state-of-the-art results. Automatic machine learning with ensembling or "AutoML+ensembling" tools provide a simple interface to train and evaluate many ensembles of DNNs to achieve high accuracy by reducing overfitting. Nowadays, multiple researchers and practitioners have well understood the benefit of ensembling DNNs. Further, several winners and top performers on challenges routinely use ensembles to improve accuracy. However, ensembles of DNNs suffer from three main limitations to be widely deployed in research and industrial applications.
Robust 3D Vision for Autonomous Robots
This paper presents a fault-tolerant 3D vision system for autonomous robotic operation. In particular, pose estimation of space objects is achieved using 3D vision data in an integrated Kalman filter (KF) and an Iterative Closest Point (ICP) algorithm in a closed-loop configuration. The initial guess for the internal ICP iteration is provided by the state estimate propagation of the Kalman filer. The Kalman filter is capable of not only estimating the target's states but also its inertial parameters. This allows the motion of the target to be predictable as soon as the filter converges. Consequently, the ICP can maintain pose tracking over a wider range of velocity due to the increased precision of ICP initialization. Furthermore, incorporation of the target's dynamics model in the estimation process allows the estimator continuously provide pose estimation even when the sensor temporally loses its signal namely due to obstruction. The capabilities of the pose estimation methodology is demonstrated by a ground testbed for Automated Rendezvous & Docking. In this experiment, Neptec's Laser Camera System (LCS) is used for real-time scanning of a satellite model attached to a manipulator arm, which is driven by a simulator according to orbital and attitude dynamics. The results showed that robust tracking of the free-floating tumbling satellite can be achieved only if the Kalman filter and ICP are in a closed-loop configuration.
Latent Heterogeneous Graph Network for Incomplete Multi-View Learning
Zhu, Pengfei, Yao, Xinjie, Wang, Yu, Cao, Meng, Hui, Binyuan, Zhao, Shuai, Hu, Qinghua
Multi-view learning has progressed rapidly in recent years. Although many previous studies assume that each instance appears in all views, it is common in real-world applications for instances to be missing from some views, resulting in incomplete multi-view data. To tackle this problem, we propose a novel Latent Heterogeneous Graph Network (LHGN) for incomplete multi-view learning, which aims to use multiple incomplete views as fully as possible in a flexible manner. By learning a unified latent representation, a trade-off between consistency and complementarity among different views is implicitly realized. To explore the complex relationship between samples and latent representations, a neighborhood constraint and a view-existence constraint are proposed, for the first time, to construct a heterogeneous graph. Finally, to avoid any inconsistencies between training and test phase, a transductive learning technique is applied based on graph learning for classification tasks. Extensive experimental results on real-world datasets demonstrate the effectiveness of our model over existing state-of-the-art approaches.
A Review of Knowledge Graph Completion
Zamini, Mohamad, Reza, Hassan, Rabiei, Minou
Information extraction methods proved to be effective at triple extraction from structured or unstructured data. The organization of such triples in the form of (head entity, relation, tail entity) is called the construction of Knowledge Graphs (KGs). Most of the current knowledge graphs are incomplete. In order to use KGs in downstream tasks, it is desirable to predict missing links in KGs. Different approaches have been recently proposed for representation learning of KGs by embedding both entities and relations into a low-dimensional vector space aiming to predict unknown triples based on previously visited triples. According to how the triples will be treated independently or dependently, we divided the task of knowledge graph completion into conventional and graph neural network representation learning and we discuss them in more detail. In conventional approaches, each triple will be processed independently and in GNN-based approaches, triples also consider their local neighborhood. View Full-Text
High-quality Task Division for Large-scale Entity Alignment
Liu, Bing, Hua, Wen, Zuccon, Guido, Zhao, Genghong, Zhang, Xia
Entity Alignment (EA) aims to match equivalent entities that refer to the same real-world objects and is a key step for Knowledge Graph (KG) fusion. Most neural EA models cannot be applied to large-scale real-life KGs due to their excessive consumption of GPU memory and time. One promising solution is to divide a large EA task into several subtasks such that each subtask only needs to match two small subgraphs of the original KGs. However, it is challenging to divide the EA task without losing effectiveness. Existing methods display low coverage of potential mappings, insufficient evidence in context graphs, and largely differing subtask sizes. In this work, we design the DivEA framework for large-scale EA with high-quality task division. To include in the EA subtasks a high proportion of the potential mappings originally present in the large EA task, we devise a counterpart discovery method that exploits the locality principle of the EA task and the power of trained EA models. Unique to our counterpart discovery method is the explicit modelling of the chance of a potential mapping. We also introduce an evidence passing mechanism to quantify the informativeness of context entities and find the most informative context graphs with flexible control of the subtask size. Extensive experiments show that DivEA achieves higher EA performance than alternative state-of-the-art solutions.
Comparison-based Conversational Recommender System with Relative Bandit Feedback
Xie, Zhihui, Yu, Tong, Zhao, Canzhe, Li, Shuai
With the recent advances of conversational recommendations, the recommender system is able to actively and dynamically elicit user preference via conversational interactions. To achieve this, the system periodically queries users' preference on attributes and collects their feedback. However, most existing conversational recommender systems only enable the user to provide absolute feedback to the attributes. In practice, the absolute feedback is usually limited, as the users tend to provide biased feedback when expressing the preference. Instead, the user is often more inclined to express comparative preferences, since user preferences are inherently relative. To enable users to provide comparative preferences during conversational interactions, we propose a novel comparison-based conversational recommender system. The relative feedback, though more practical, is not easy to be incorporated since its feedback scale is always mismatched with users' absolute preferences. With effectively collecting and understanding the relative feedback from an interactive manner, we further propose a new bandit algorithm, which we call RelativeConUCB. The experiments on both synthetic and real-world datasets validate the advantage of our proposed method, compared to the existing bandit algorithms in the conversational recommender systems.
A Generic Self-Supervised Framework of Learning Invariant Discriminative Features
Ntelemis, Foivos, Jin, Yaochu, Thomas, Spencer A.
Self-supervised learning (SSL) has become a popular method for generating invariant representations without the need for human annotations. Nonetheless, the desired invariant representation is achieved by utilising prior online transformation functions on the input data. As a result, each SSL framework is customised for a particular data type, e.g., visual data, and further modifications are required if it is used for other dataset types. On the other hand, autoencoder (AE), which is a generic and widely applicable framework, mainly focuses on dimension reduction and is not suited for learning invariant representation. This paper proposes a generic SSL framework based on a constrained self-labelling assignment process that prevents degenerate solutions. Specifically, the prior transformation functions are replaced with a self-transformation mechanism, derived through an unsupervised training process of adversarial training, for imposing invariant representations. Via the self-transformation mechanism, pairs of augmented instances can be generated from the same input data. Finally, a training objective based on contrastive learning is designed by leveraging both the self-labelling assignment and the self-transformation mechanism. Despite the fact that the self-transformation process is very generic, the proposed training strategy outperforms a majority of state-of-the-art representation learning methods based on AE structures. To validate the performance of our method, we conduct experiments on four types of data, namely visual, audio, text, and mass spectrometry data, and compare them in terms of four quantitative metrics. Our comparison results indicate that the proposed method demonstrate robustness and successfully identify patterns within the datasets.