Country
Deep learning methods in speaker recognition: a review
Sztahó, Dávid, Szaszák, György, Beke, András
This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technolog y. Many research works have been carried out and little progress has been achieved in the past 5 - 6 years. However, as deep learning techniques do advance in most machine learning fields, the former state - of - the - art methods are getting replaced by them in s peaker recognition too. It seems that DL becomes the now state - of - the - art solution for both speaker verification and identification. The standard x - vectors, additional to i - vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to DL, where they are the most effective.
Sequential Recommendation with Relation-Aware Kernelized Self-Attention
Ji, Mingi, Joo, Weonyoung, Song, Kyungwoo, Kim, Yoon-Yeong, Moon, Il-Chul
Recent studies identified that sequential Recommendation is improved by the attention mechanism. By following this development, we propose Relation-Aware Kernelized Self-Attention (RKSA) adopting a self-attention mechanism of the Transformer with augmentation of a probabilistic model. The original self-attention of Transformer is a deterministic measure without relation-awareness. Therefore, we introduce a latent space to the self-attention, and the latent space models the recommendation context from relation as a multivariate skew-normal distribution with a kernelized covariance matrix from co-occurrences, item characteristics, and user information. This work merges the self-attention of the Transformer and the sequential recommendation by adding a probabilistic model of the recommendation task specifics. We experimented RKSA over the benchmark datasets, and RKSA shows significant improvements compared to the recent baseline models. Also, RKSA were able to produce a latent space model that answers the reasons for recommendation.
ASCAI: Adaptive Sampling for acquiring Compact AI
Javaheripi, Mojan, Samragh, Mohammad, Javidi, Tara, Koushanfar, Farinaz
This paper introduces ASCAI, a novel adaptive sampling methodology that can learn how to effectively compress Deep Neural Networks (DNNs) for accelerated inference on resource-constrained platforms. Modern DNN compression techniques comprise various hyperparameters that require per-layer customization to ensure high accuracy. Choosing such hyperparameters is cumbersome as the pertinent search space grows exponentially with the number of model layers. To effectively traverse this large space, we devise an intelligent sampling mechanism that adapts the sampling strategy using customized operations inspired by genetic algorithms. As a special case, we consider the space of model compression as a vector space. The adaptively selected samples enable ASCAI to automatically learn how to tune per-layer compression hyperparameters to optimize the accuracy/model-size trade-off. Our extensive evaluations show that ASCAI outperforms rule-based and reinforcement learning methods in terms of compression rate and/or accuracy
$\ell_{\infty}$ Vector Contraction for Rademacher Complexity
Foster, Dylan J., Rakhlin, Alexander
Rademacher complexity plays a fundamental role in learning theory, where it tightly bounds the supremum of the empirical process ( Koltchinskii and Panchenko, 2000; Bartlett and Mendelson, 2003) and is used to prove generalization guarantees for empiric al risk minimization and other learning rules.
Optimal Mini-Batch Size Selection for Fast Gradient Descent
Perrone, Michael P., Khan, Haidar, Kim, Changhoan, Kyrillidis, Anastasios, Quinn, Jerry, Salapura, Valentina
Jerry Quinn IBM T.J. Watson Research Center Y orktown Heights, NY 10598 V alentina Salapura IBM T.J. Watson Research Center Y orktown Heights, NY 10598 Abstract This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By de-coupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold. Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning. We demonstrate the inverse law empirically, on both image recognition (MNIST, CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic justification via proving a novel bound on mini-batch SGD training. Introduction In this paper, we present an empirical law, with theoretical justification, linking the number of learning iterations to the mini-batch size. From this result, we derive a principled methodology for selecting mini-batch size w.r.t. This methodology saves training time and provides both intuition and a principled approach for optimizing machine learning algorithms and machine learning hardware system design. Further, we use our methodology to show that focusing on weak scaling can lead to suboptimal training times because, by neglecting the dependence of convergence time on the size of the mini-batch used, weak scaling does not always minimize the training time.
Modelling EHR timeseries by restricting feature interaction
Zhang, Kun, Xue, Yuan, Flores, Gerardo, Rajkomar, Alvin, Cui, Claire, Dai, Andrew M.
Time series data are prevalent in electronic health records, mostly in the form of physiological parameters such as vital signs and lab tests. The patterns of these values may be significant indicators of patients' clinical states and there might be patterns that are unknown to clinicians but are highly predictive of some outcomes. Many of these values are also missing which makes it difficult to apply existing methods like decision trees. We propose a recurrent neural network model that reduces overfitting to noisy observations by limiting interactions between features. We analyze its performance on mortality, ICD-9 and AKI prediction from observational values on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset. Our models result in an improvement of 1.1% [p<0.01] in AU-ROC for mortality prediction under the MetaVision subset and 1.0% and 2.2% [p<0.01] respectively for mortality and AKI under the full MIMIC-III dataset compared to existing state-of-the-art interpolation, embedding and decay-based recurrent models.
Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach
Kim, Hyungsul, El-Kishky, Ahmed, Ren, Xiang, Han, Jiawei
We present ProxiModel, a novel event mining framework for extracting high-quality structured event knowledge from large, redundant, and noisy news data sources. The proposed model differentiates itself from other approaches by modeling both the event correlation within each individual document as well as across the corpus. To facilitate this, we introduce the concept of a proximity-network, a novel space-efficient data structure to facilitate scalable event mining. This proximity network captures the corpus-level co-occurence statistics for candidate event descriptors, event attributes, as well as their connections. We probabilistically model the proximity network as a generative process with sparsity-inducing regularization. This allows us to efficiently and effectively extract high-quality and interpretable news events. Experiments on three different news corpora demonstrate that the proposed method is effective and robust at generating high-quality event descriptors and attributes. We briefly detail many interesting applications from our proposed framework such as news summarization, event tracking and multi-dimensional analysis on news. Finally, we explore a case study on visualizing the events for a Japan Tsunami news corpus and demonstrate ProxiModel's ability to automatically summarize emerging news events.
Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling
Stoller, Daniel, Tian, Mi, Ewert, Sebastian, Dixon, Simon
Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, prohibiting the use of longer receptive fields in practice. To increase efficiency, we make use of the "slow feature" hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model ("Seq-U-Net") to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance in all tasks.
Solving Inverse Problems by Joint Posterior Maximization with a VAE Prior
González, Mario, Almansa, Andrés, Delbracio, Mauricio, Musé, Pablo, Tan, Pauline
In this paper we address the problem of solving ill-posed inverse problems in imaging where the prior is a neural generative model. Specifically we consider the decoupled case where the prior is trained once and can be reused for many different log-concave degradation models without retraining. Whereas previous MAP-based approaches to this problem lead to highly non-convex optimization algorithms, our approach computes the joint (space-latent) MAP that naturally leads to alternate optimization algorithms and to the use of a stochastic encoder to accelerate computations. The resulting technique is called JPMAP because it performs Joint Posterior Maximization using an Autoencoding Prior. We show theoretical and experimental evidence that the proposed objective function is quite close to bi-convex. Indeed it satisfies a weak bi-convexity property which is sufficient to guarantee that our optimization scheme converges to a stationary point. Experimental results also show the higher quality of the solutions obtained by our JPMAP approach with respect to other non-convex MAP approaches which more often get stuck in spurious local optima.
Predicting Drug-Drug Interactions from Molecular Structure Images
Dhami, Devendra Singh, Kunapuli, Gautam, Page, David, Natarajan, Sriraam
Adverse drug events (ADEs) are "injuries resulting from medical intervention related to a drug" (Nebeker, Barach, and Samore 2004), and are distinct from medication errors (inappropriate prescription, dispensing, usage etc.) as they are caused by drugs at normal dosages. According to the National Center for Health Statistics (NCHS 2014), 48.9% of Americans took at least one prescription drug in the last 30 days, 23.1% took at least three, and 11.9% took at least