constant time
Mining the Minoria: Unknown, Under-represented, and Under-performing Minority Groups
Dehghankar, Mohsen, Asudeh, Abolfazl
Due to a variety of reasons, such as privacy, data in the wild often misses the grouping information required for identifying minorities. On the other hand, it is known that machine learning models are only as good as the data they are trained on and, hence, may underperform for the under-represented minority groups. The missing grouping information presents a dilemma for responsible data scientists who find themselves in an unknown-unknown situation, where not only do they not have access to the grouping attributes but do not also know what groups to consider. This paper is an attempt to address this dilemma. Specifically, we propose a minority mining problem, where we find vectors in the attribute space that reveal potential groups that are under-represented and under-performing. Technically speaking, we propose a geometric transformation of data into a dual space and use notions such as the arrangement of hyperplanes to design an efficient algorithm for the problem in lower dimensions. Generalizing our solution to the higher dimensions is cursed by dimensionality. Therefore, we propose a solution based on smart exploration of the search space for such cases. We conduct comprehensive experiments using real-world and synthetic datasets alongside the theoretical analysis. Our experiment results demonstrate the effectiveness of our proposed solutions in mining the unknown, under-represented, and under-performing minorities.
- North America > United States > Illinois > Cook County > Chicago (0.06)
- North America > United States > California (0.04)
- Europe > Italy > Liguria (0.04)
Convergent Bounds on the Euclidean Distance
Given a set V of n vectors in d-dimensional space, we provide an efficient method for computing quality upper and lower bounds of the Euclidean distances between a pair of vectors in V. For this purpose, we define a distance measure, called the MS-distance, by using the mean and the standard deviation values of vectors in V. Once we compute the mean and the standard deviation values of vectors in V in O(dn) time, the MS-distance provides upper and lower bounds of Euclidean distance between any pair of vectors in V in constant time. Furthermore, these bounds can be refined further in such a way to converge monotonically to the exact Euclidean distance within d refinement steps. An analysis on a random sequence of refinement steps shows that the MS-distance provides very tight bounds in only a few refinement steps. The MS-distance can be used to various applications where the Euclidean distance is used to measure the proximity or similarity between objects. We provide experimental results on the nearest and the farthest neighbor searches.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
(1,1)-Cluster Editing is Polynomial-time Solvable
A graph $H$ is a clique graph if $H$ is a vertex-disjoin union of cliques. Abu-Khzam (2017) introduced the $(a,d)$-{Cluster Editing} problem, where for fixed natural numbers $a,d$, given a graph $G$ and vertex-weights $a^*:\ V(G)\rightarrow \{0,1,\dots, a\}$ and $d^*{}:\ V(G)\rightarrow \{0,1,\dots, d\}$, we are to decide whether $G$ can be turned into a cluster graph by deleting at most $d^*(v)$ edges incident to every $v\in V(G)$ and adding at most $a^*(v)$ edges incident to every $v\in V(G)$. Results by Komusiewicz and Uhlmann (2012) and Abu-Khzam (2017) provided a dichotomy of complexity (in P or NP-complete) of $(a,d)$-{Cluster Editing} for all pairs $a,d$ apart from $a=d=1.$ Abu-Khzam (2017) conjectured that $(1,1)$-{Cluster Editing} is in P. We resolve Abu-Khzam's conjecture in affirmative by (i) providing a serious of five polynomial-time reductions to $C_3$-free and $C_4$-free graphs of maximum degree at most 3, and (ii) designing a polynomial-time algorithm for solving $(1,1)$-{Cluster Editing} on $C_3$-free and $C_4$-free graphs of maximum degree at most 3.
- North America > United States > California > Orange County > Laguna Hills (0.04)
- Europe > Denmark > Southern Denmark (0.04)
- Africa > South Africa > Gauteng > Johannesburg (0.04)
Data Structures Related to Machine Learning Algorithms - KDnuggets
The Statsbot team has invited Peter Mills to tell you about data structures for machine learning approaches. So you've decided to move beyond canned algorithms and start to code your own machine learning methods. Maybe you've got an idea for a cool new way of clustering data, or maybe you are frustrated by the limitations in your favorite statistical classification package. In either case, the better your knowledge of data structures and algorithms, the easier time you'll have when it comes time to code up. I don't think the data structures used in machine learning are significantly different than those used in other areas of software development.
k\=oan: A Corrected CBOW Implementation
İrsoy, Ozan, Benton, Adrian, Stratos, Karl
It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, k\=oan, at https://github.com/bloomberg/koan.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Middle East > Malta > Port Region > Southern Harbour District > Valletta (0.04)
Three Success Stories About Compact Data Structures
Technology evolution is no longer keeping pace with the growth of data. We are facing problems storing and processing the huge amounts of data produced every day. People rely on data-intensive applications and new paradigms (for example, edge computing) to try to keep computation closer to where data is produced and needed. Thus, the need to store and query data in devices where capacity is surpassed by data volume is routine today, ranging from astronomy data to be processed by supercomputers, to personal data to be processed by wearable sensors. The scale is different, yet the underlying problem is the same.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- South America > Chile > Biobío Region > Concepción Province > Concepción (0.05)
- (4 more...)