Goto

Collaborating Authors

 signature matrix


Text Similarity using K-Shingling, Minhashing and LSH(Locality Sensitive Hashing)

#artificialintelligence

Text similarity plays an important role in Natural Language Processing (NLP) and there are several areas where this has been utilized extensively. Some of the applications include Information retrieval, text categorization, topic detection, machine translation, text summarization, document clustering, plagiarism detection, news recommendation, etc. encompassing almost all domains. But sometimes, it becomes difficult to understand the concept behind the Text Similarity Algorithms. This write-up will show an implementation of Text Similarity along with an explanation of the required concepts. But before I start, let me tell you that there might be several ways and several algorithms to perform the same task.


Scalable Signal Reconstruction for a Broad Range of Applications

Communications of the ACM

Signal reconstruction problem (SRP) is an important optimization problem where the objective is to identify a solution to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas, such as network traffic engineering, medical image reconstruction, acoustics, astronomy, and many more. Unfortunately, most of the common approaches for solving SRP do not scale to large problem sizes. We propose a novel and scalable algorithm for solving this critical problem. Specifically, we make four major contributions. First, we propose a dual formulation of the problem and develop the DIRECT algorithm that is significantly more efficient than the state of the art. Second, we show how adapting database techniques developed for scalable similarity joins provides a substantial speedup over DIRECT. Third, we describe several practical techniques that allow our algorithm to scale--on a single machine--to settings that are orders of magnitude larger than previously studied. Finally, we use the database techniques of materialization and reuse to extend our result to dynamic settings where the input to the SRP changes. Extensive experiments on real-world and synthetic data confirm the efficiency, effectiveness, and scalability of our proposal. The database community has been at the forefront of grappling with challenges of big data and has developed numerous techniques for the scalable processing and analysis of massive datasets. These techniques often originate from solving core data management challenges but then find their way into effectively addressing the needs of big data analytics. We study how database techniques can benefit large-scale signal reconstruction,13 which is of interest to research communities as diverse as computer networks,15 medical imaging,7 etc. We demonstrate that the scalability of existing solutions can be significantly improved using ideas originally developed for similarity joins5 and selectivity estimation for set similarity queries.3 Signal reconstruction problem (SRP): The essence of SRP is to solve a linear system of the form AX b, where X is a high-dimensional unknown signal (represented by an m-d vector in Rm), b is a low-dimensional projection of X that can be observed in practice (represented by an n-d vector in Rn with n m), and A is an n m matrix that captures the linear relationship between X and b.


A Deep Neural Network for Unsupervised Anomaly Detection and Diagnosis in Multivariate Time Series Data

arXiv.org Machine Learning

Nowadays, multivariate time series data are increasingly collected in various real world systems, e.g., power plants, wearable devices, etc. Anomaly detection and diagnosis in multivariate time series refer to identifying abnormal status in certain time steps and pinpointing the root causes. Building such a system, however, is challenging since it not only requires to capture the temporal dependency in each time series, but also need encode the inter-correlations between different pairs of time series. In addition, the system should be robust to noise and provide operators with different levels of anomaly scores based upon the severity of different incidents. Despite the fact that a number of unsupervised anomaly detection algorithms have been developed, few of them can jointly address these challenges. In this paper, we propose a Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED), to perform anomaly detection and diagnosis in multivariate time series data. Specifically, MSCRED first constructs multi-scale (resolution) signature matrices to characterize multiple levels of the system statuses in different time steps. Subsequently, given the signature matrices, a convolutional encoder is employed to encode the inter-sensor (time series) correlations and an attention based Convolutional Long-Short Term Memory (ConvLSTM) network is developed to capture the temporal patterns. Finally, based upon the feature maps which encode the inter-sensor correlations and temporal information, a convolutional decoder is used to reconstruct the input signature matrices and the residual signature matrices are further utilized to detect and diagnose anomalies. Extensive empirical studies based on a synthetic dataset and a real power plant dataset demonstrate that MSCRED can outperform state-of-the-art baseline methods.