vcc
SupplementaryMaterial AutoSync: LearningtoSynchronizeforData-Parallel DistributedDeepLearning
Asδ is relatively slow compared with other time consumption, we treat it asaconstant that does not scale with variable size. The second process introduces network overhead (e.g.,latency) and network communication. For vi VCC, we model 5 mostly used collective primitives:AllReduce, ReduceScatter, AllGather, Broadcast and Reduce [12]. I1,I2,I3 are true whenAllReduce, ReduceScatter and AllGather, Broadcast and Reduce are activated, respectively. Note thattheLSTM walks through eachvi V0G,θ strictly following their original forward (backward) order in the computational graph, so as to inject this information intothemodeling. A global load balancer (clb) and group assigner (cam) assign their values using randomized and approximate solutions, illustrated in Algorithm 1 and Algorithm 2, respectively.
VCC: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens
Transformers are central in modern natural language processing and computer vision applications. Despite recent works devoted to reducing the quadratic cost of such models with respect to sequence length, dealing with ultra long sequences (e.g., $> $16K tokens) remains challenging. Applications such as answering questions based on a book or summarizing a scientific article are inefficient or infeasible. Here, we propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens, which we call VIP-tokens, are most relevant to the final prediction, we propose a VIP-token centric compression (VCC) scheme which selectively compresses the sequence based on their impact on approximating the representation of the VIP-tokens. Compared with competitive baselines, our algorithm is not only efficient (achieving more than $3\times$ compute efficiency gain compared to baselines on 4K and 16K lengths), but also offers competitive/better performance on a large number of tasks. Further, we show that our algorithm scales to 128K tokens (or more) while consistently offering accuracy improvement. Code is available at https://github.com/mlpen/VCC.
VCC: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens
Transformers are central in modern natural language processing and computer vision applications. Despite recent works devoted to reducing the quadratic cost of such models with respect to sequence length, dealing with ultra long sequences (e.g., 16K tokens) remains challenging. Applications such as answering questions based on a book or summarizing a scientific article are inefficient or infeasible. Here, we propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Specifically, by exploiting the fact that in many tasks, only a small subset of special tokens, which we call VIP-tokens, are most relevant to the final prediction, we propose a VIP-token centric compression (VCC) scheme which selectively compresses the sequence based on their impact on approximating the representation of the VIP-tokens.
VCC-INFUSE: Towards Accurate and Efficient Selection of Unlabeled Examples in Semi-supervised Learning
Fang, Shijie, Feng, Qianhan, Lin, Tong
Despite the progress of Semi-supervised Learning (SSL), existing methods fail to utilize unlabeled data effectively and efficiently. Many pseudo-label-based methods select unlabeled examples based on inaccurate confidence scores from the classifier. Most prior work also uses all available unlabeled data without pruning, making it difficult to handle large amounts of unlabeled data. To address these issues, we propose two methods: Variational Confidence Calibration (VCC) and Influence-Function-based Unlabeled Sample Elimination (INFUSE). VCC is an universal plugin for SSL confidence calibration, using a variational autoencoder to select more accurate pseudo labels based on three types of consistency scores. INFUSE is a data pruning method that constructs a core dataset of unlabeled examples under SSL. Our methods are effective in multiple datasets and settings, reducing classification errors rates and saving training time. Together, VCC-INFUSE reduces the error rate of FlexMatch on the CIFAR-100 dataset by 1.08% while saving nearly half of the training time.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (18 more...)
Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests
Wang, Siqi, Yu, Guang, Cai, Zhiping, Liu, Xinwang, Zhu, En, Yin, Jianping, Liao, Qing
Video abnormal event detection (VAD) is a vital semi-supervised task that requires learning with only roughly labeled normal videos, as anomalies are often practically unavailable. Although deep neural networks (DNNs) enable great progress in VAD, existing solutions typically suffer from two issues: (1) The precise and comprehensive localization of video events is ignored. (2) The video semantics and temporal context are under-explored. To address those issues, we are motivated by the prevalent cloze test in education and propose a novel approach named visual cloze completion (VCC), which performs VAD by learning to complete "visual cloze tests" (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To achieve both precise and comprehensive localization, appearance and motion are used as mutually complementary cues to mark the object region associated with each video event. For each marked region, a normalized patch sequence is extracted from temporally adjacent frames and stacked into the STC. By comparing each patch and the patch sequence of a STC to a visual "word" and "sentence" respectively, we can deliberately erase a certain "word" (patch) to yield a VCT. DNNs are then trained to infer the erased patch by video semantics, so as to complete the VCT. To fully exploit the temporal context, each patch in STC is alternatively erased to create multiple VCTs, and the erased patch's optical flow is also inferred to integrate richer motion clues. Meanwhile, a new DNN architecture is designed as a model-level solution to utilize video semantics and temporal context. Extensive experiments demonstrate that VCC achieves state-of-the-art VAD performance. Our codes and results are open at \url{https://github.com/yuguangnudt/VEC_VAD/tree/VCC}
- Asia > China > Hunan Province > Changsha (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
Very Compact Clusters with Structural Regularization via Similarity and Connectivity
Clustering algorithms have significantly improved along with Deep Neural Networks which provide effective representation of data. Existing methods are built upon deep autoencoder and self-training process that leverages the distribution of cluster assignments of samples. However, as the fundamental objective of the autoencoder is focused on efficient data reconstruction, the learnt space may be sub-optimal for clustering. Moreover, it requires highly effective codes (i.e., representation) of data, otherwise the initial cluster centers often cause stability issues during self-training. Many state-of-the-art clustering algorithms use convolution operation to extract efficient codes but their applications are limited to image data. In this regard, we propose an end-to-end deep clustering algorithm, i.e., Very Compact Clusters (VCC), for the general datasets, which takes advantage of distributions of local relationships of samples near the boundary of clusters, so that they can be properly separated and pulled to cluster centers to form compact clusters. Experimental results on various datasets illustrate that our proposed approach achieves better clustering performance over most of the state-of-the-art clustering methods, and the data embeddings learned by VCC without convolution for image data are even comparable with specialized convolutional methods.
- North America > United States > Texas (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- Asia > South Korea (0.04)
A Model of Fast Concept Inference with Object-Factorized Cognitive Programs
Sawyer, Daniel P., Lázaro-Gredilla, Miguel, George, Dileep
The ability of humans to quickly identify general concepts from a handful of images has proven difficult to emulate with robots. Recently, a computer architecture was developed that allows robots to mimic some aspects of this human ability by modeling concepts as cognitive programs using an instruction set of primitive cognitive functions. This allowed a robot to emulate human imagination by simulating candidate programs in a world model before generalizing to the physical world. However, this model used a naive search algorithm that required 30 minutes to discover a single concept, and became intractable for programs with more than 20 instructions. To circumvent this bottleneck, we present an algorithm that emulates the human cognitive heuristics of object factorization and sub-goaling, allowing human-level inference speed, improving accuracy, and making the output more explainable.
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.90)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)
Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs
Lázaro-Gredilla, Miguel, Lin, Dianhuan, Guntupalli, J. Swaroop, George, Dileep
Humans can infer concepts from image pairs and apply those in the physical world in a completely different setting, enabling tasks like IKEA assembly from diagrams. If robots could represent and infer high-level concepts, it would significantly improve their ability to understand our intent and to transfer tasks between different environments. To that end, we introduce a computational framework that replicates aspects of human concept learning. Concepts are represented as programs on a novel computer architecture consisting of a visual perception system, working memory, and action controller. The instruction set of this "cognitive computer" has commands for parsing a visual scene, directing gaze and attention, imagining new objects, manipulating the contents of a visual working memory, and controlling arm movement. Inferring a concept corresponds to inducing a program that can transform the input to the output. Some concepts require the use of imagination and recursion. Previously learned concepts simplify the learning of subsequent more elaborate concepts, and create a hierarchy of abstractions. We demonstrate how a robot can use these abstractions to interpret novel concepts presented to it as schematic images, and then apply those concepts in dramatically different situations. By bringing cognitive science ideas on mental imagery, perceptual symbols, embodied cognition, and deictic mechanisms into the realm of machine learning, our work brings us closer to the goal of building robots that have interpretable representations and commonsense.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Mexico > Gulf of Mexico (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Agrawal, Aishwarya, Batra, Dhruv, Parikh, Devi, Kembhavi, Aniruddha
A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model -- Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)
A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment
Kinnunen, Tomi, Lorenzo-Trueba, Jaime, Yamagishi, Junichi, Toda, Tomoki, Saito, Daisuke, Villavicencio, Fernando, Ling, Zhenhua
Voice conversion (VC) aims at conversion of speaker characteristic without altering content. Due to training data limitations and modeling imperfections, it is difficult to achieve believable speaker mimicry without introducing processing artifacts; performance assessment of VC, therefore, usually involves both speaker similarity and quality evaluation by a human panel. As a time-consuming, expensive, and non-reproducible process, it hinders rapid prototyping of new VC technology. We address artifact assessment using an alternative, objective approach leveraging from prior work on spoofing countermeasures (CMs) for automatic speaker verification. Therein, CMs are used for rejecting `fake' inputs such as replayed, synthetic or converted speech but their potential for automatic speech artifact assessment remains unknown. This study serves to fill that gap. As a supplement to subjective results for the 2018 Voice Conversion Challenge (VCC'18) data, we configure a standard constant-Q cepstral coefficient CM to quantify the extent of processing artifacts. Equal error rate (EER) of the CM, a confusability index of VC samples with real human speech, serves as our artifact measure. Two clusters of VCC'18 entries are identified: low-quality ones with detectable artifacts (low EERs), and higher quality ones with less artifacts. None of the VCC'18 systems, however, is perfect: all EERs are < 30 % (the `ideal' value would be 50 %). Our preliminary findings suggest potential of CMs outside of their original application, as a supplemental optimization and benchmarking tool to enhance VC technology.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (12 more...)
- Information Technology > Security & Privacy (0.47)
- Education (0.34)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)