Goto

Collaborating Authors

 constant time



Fitting Low-Rank Tensors in Constant Time

Neural Information Processing Systems

In this paper, we develop an algorithm that approximates the residual error of Tucker decomposition, one of the most popular tensor decomposition methods, with a provable guarantee.




Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups

Dehghankar, Mohsen, Asudeh, Abolfazl

arXiv.org Artificial Intelligence

Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.


Convergent Bounds on the Euclidean Distance

Neural Information Processing Systems

Given a set V of n vectors in d-dimensional space, we provide an efficient method for computing quality upper and lower bounds of the Euclidean distances between a pair of vectors in V. For this purpose, we define a distance measure, called the MS-distance, by using the mean and the standard deviation values of vectors in V. Once we compute the mean and the standard deviation values of vectors in V in O(dn) time, the MS-distance provides upper and lower bounds of Euclidean distance between any pair of vectors in V in constant time. Furthermore, these bounds can be refined further in such a way to converge monotonically to the exact Euclidean distance within d refinement steps. An analysis on a random sequence of refinement steps shows that the MS-distance provides very tight bounds in only a few refinement steps. The MS-distance can be used to various applications where the Euclidean distance is used to measure the proximity or similarity between objects. We provide experimental results on the nearest and the farthest neighbor searches.


(1,1)-Cluster Editing is Polynomial-time Solvable

Gutin, Gregory, Yeo, Anders

arXiv.org Artificial Intelligence

A graph $H$ is a clique graph if $H$ is a vertex-disjoin union of cliques. Abu-Khzam (2017) introduced the $(a,d)$-{Cluster Editing} problem, where for fixed natural numbers $a,d$, given a graph $G$ and vertex-weights $a^*:\ V(G)\rightarrow \{0,1,\dots, a\}$ and $d^*{}:\ V(G)\rightarrow \{0,1,\dots, d\}$, we are to decide whether $G$ can be turned into a cluster graph by deleting at most $d^*(v)$ edges incident to every $v\in V(G)$ and adding at most $a^*(v)$ edges incident to every $v\in V(G)$. Results by Komusiewicz and Uhlmann (2012) and Abu-Khzam (2017) provided a dichotomy of complexity (in P or NP-complete) of $(a,d)$-{Cluster Editing} for all pairs $a,d$ apart from $a=d=1.$ Abu-Khzam (2017) conjectured that $(1,1)$-{Cluster Editing} is in P. We resolve Abu-Khzam's conjecture in affirmative by (i) providing a serious of five polynomial-time reductions to $C_3$-free and $C_4$-free graphs of maximum degree at most 3, and (ii) designing a polynomial-time algorithm for solving $(1,1)$-{Cluster Editing} on $C_3$-free and $C_4$-free graphs of maximum degree at most 3.


Data Structures Related to Machine Learning Algorithms - KDnuggets

#artificialintelligence

The Statsbot team has invited Peter Mills to tell you about data structures for machine learning approaches. So you've decided to move beyond canned algorithms and start to code your own machine learning methods. Maybe you've got an idea for a cool new way of clustering data, or maybe you are frustrated by the limitations in your favorite statistical classification package. In either case, the better your knowledge of data structures and algorithms, the easier time you'll have when it comes time to code up. I don't think the data structures used in machine learning are significantly different than those used in other areas of software development.


k\=oan: A Corrected CBOW Implementation

İrsoy, Ozan, Benton, Adrian, Stratos, Karl

arXiv.org Machine Learning

It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, k\=oan, at https://github.com/bloomberg/koan.


Three Success Stories About Compact Data Structures

Communications of the ACM

Technology evolution is no longer keeping pace with the growth of data. We are facing problems storing and processing the huge amounts of data produced every day. People rely on data-intensive applications and new paradigms (for example, edge computing) to try to keep computation closer to where data is produced and needed. Thus, the need to store and query data in devices where capacity is surpassed by data volume is routine today, ranging from astronomy data to be processed by supercomputers, to personal data to be processed by wearable sensors. The scale is different, yet the underlying problem is the same.