AITopics

2502.11324

Country:

North America > United States (1.00)
Asia (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(6 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(7 more...)

arXiv.org Artificial IntelligenceFeb-7-2024

No Dimensional Sampling Coresets for Classification

Alishahi, Meysam, Phillips, Jeff M.

We refine and generalize what is known about coresets for classification problems via the sensitivity sampling framework. Such coresets seek the smallest possible subsets of input data, so one can optimize a loss function on the coreset and ensure approximation guarantees with respect to the original data. Our analysis provides the first no dimensional coresets, so the size does not depend on the dimension. Moreover, our results are general, apply for distributional input and can use iid samples, so provide sample complexity bounds, and work for a variety of loss functions. A key tool we develop is a Radamacher complexity version of the main sensitivity sampling approach, which can be of independent interest.

artificial intelligence, machine learning, nullw null 2, (13 more...)

2402.0528

Country:

Europe (0.67)
North America > United States > Utah (0.14)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

arXiv.org Artificial IntelligenceDec-20-2023

Linear Distance Metric Learning with Noisy Labels

Alishahi, Meysam, Little, Anna, Phillips, Jeff M.

In linear distance metric learning, we are given data in one Euclidean metric space and the goal is to find an appropriate linear map to another Euclidean metric space which respects certain distance conditions as much as possible. In this paper, we formalize a simple and elegant method which reduces to a general continuous convex loss optimization problem, and for different noise models we derive the corresponding loss functions. We show that even if the data is noisy, the ground truth linear metric can be learned with any precision provided access to enough samples, and we provide a corresponding sample complexity bound. Moreover, we present an effective way to truncate the learned model to a low-rank model that can provably maintain the accuracy in the loss function and in parameters - the first such results of this type. Several experimental observations on synthetic and real data sets support and inform our theoretical results.

artificial intelligence, machine learning, noise, (15 more...)

2306.03173

Country:

North America > United States > Utah (0.14)
North America > United States > California (0.14)
Europe > United Kingdom > Scotland (0.14)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.68)

arXiv.org Artificial IntelligenceDec-7-2023

Sketching Multidimensional Time Series for Fast Discord Mining

Yeh, Chin-Chia Michael, Zheng, Yan, Pan, Menghai, Chen, Huiyuan, Zhuang, Zhongfang, Wang, Junpeng, Wang, Liang, Zhang, Wei, Phillips, Jeff M., Keogh, Eamonn

Time series discords are a useful primitive for time series anomaly detection, and the matrix profile is capable of capturing discord effectively. There exist many research efforts to improve the scalability of discord discovery with respect to the length of time series. However, there is surprisingly little work focused on reducing the time complexity of matrix profile computation associated with dimensionality of a multidimensional time series. In this work, we propose a sketch for discord mining among multi-dimensional time series. After an initial pre-processing of the sketch as fast as reading the data, the discord mining has runtime independent of the dimensionality of the original data. On several real world examples from water treatment and transportation, the proposed algorithm improves the throughput by at least an order of magnitude (50X) and only has minimal impact on the quality of the approximated solution. Additionally, the proposed method can handle the dynamic addition or deletion of dimensions inconsequential overhead. This allows a data analyst to consider "what-if" scenarios in real time while exploring the data.

artificial intelligence, data mining, machine learning, (17 more...)

2311.03393

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Water & Waste Management > Water Management > Lifecycle > Treatment (0.66)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.56)

arXiv.org Artificial IntelligenceOct-5-2023

An Efficient Content-based Time Series Retrieval System

Yeh, Chin-Chia Michael, Chen, Huiyuan, Dai, Xin, Zheng, Yan, Wang, Junpeng, Lai, Vivian, Fan, Yujie, Der, Audrey, Zhuang, Zhongfang, Wang, Liang, Zhang, Wei, Phillips, Jeff M.

A Content-based Time Series Retrieval (CTSR) system is an information retrieval system for users to interact with time series emerged from multiple domains, such as finance, healthcare, and manufacturing. For example, users seeking to learn more about the source of a time series can submit the time series as a query to the CTSR system and retrieve a list of relevant time series with associated metadata. By analyzing the retrieved metadata, users can gather more information about the source of the time series. Because the CTSR system is required to work with time series data from diverse domains, it needs a high-capacity model to effectively measure the similarity between different time series. On top of that, the model within the CTSR system has to compute the similarity scores in an efficient manner as the users interact with the system in real-time. In this paper, we propose an effective and efficient CTSR model that outperforms alternative models, while still providing reasonable inference runtimes. To demonstrate the capability of the proposed method in solving business problems, we compare it against alternative models using our in-house transaction data. Our findings reveal that the proposed model is the most suitable solution compared to others for our transaction data problem.

artificial intelligence, content-based time sery retrieval system, natural language, (1 more...)

2310.03919

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.53)

arXiv.org Artificial IntelligenceJun-28-2023

For Kernel Range Spaces a Constant Number of Queries Are Sufficient

Phillips, Jeff M., Pourmahmood-Aghababa, Hasan

We introduce the notion of an $\varepsilon$-cover for a kernel range space. A kernel range space concerns a set of points $X \subset \mathbb{R}^d$ and the space of all queries by a fixed kernel (e.g., a Gaussian kernel $K(p,\cdot) = \exp(-\|p-\cdot\|^2)$). For a point set $X$ of size $n$, a query returns a vector of values $R_p \in \mathbb{R}^n$, where the $i$th coordinate $(R_p)_i = K(p,x_i)$ for $x_i \in X$. An $\varepsilon$-cover is a subset of points $Q \subset \mathbb{R}^d$ so for any $p \in \mathbb{R}^d$ that $\frac{1}{n} \|R_p - R_q\|_1\leq \varepsilon$ for some $q \in Q$. This is a smooth analog of Haussler's notion of $\varepsilon$-covers for combinatorial range spaces (e.g., defined by subsets of points within a ball query) where the resulting vectors $R_p$ are in $\{0,1\}^n$ instead of $[0,1]^n$. The kernel versions of these range spaces show up in data analysis tasks where the coordinates may be uncertain or imprecise, and hence one wishes to add some flexibility in the notion of inside and outside of a query range. Our main result is that, unlike combinatorial range spaces, the size of kernel $\varepsilon$-covers is independent of the input size $n$ and dimension $d$. We obtain a bound of $(1/\varepsilon)^{\tilde O(1/\varepsilon^2)}$, where $\tilde{O}(f(1/\varepsilon))$ hides log factors in $(1/\varepsilon)$ that can depend on the kernel. This implies that by relaxing the notion of boundaries in range queries, eventually the curse of dimensionality disappears, and may help explain the success of machine learning in very high-dimensions. We also complement this result with a lower bound of almost $(1/\varepsilon)^{\Omega(1/\varepsilon)}$, showing the exponential dependence on $1/\varepsilon$ is necessary.

artificial intelligence, kernel, machine learning, (16 more...)

2306.16516

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

arXiv.org Artificial IntelligenceJul-10-2021

Practical and Configurable Network Traffic Classification Using Probabilistic Machine Learning

Chen, Jiahui, Breen, Joe, Phillips, Jeff M., Van der Merwe, Jacobus

Network traffic classification that is widely applicable and highly accurate is valuable for many network security and management tasks. A flexible and easily configurable classification framework is ideal, as it can be customized for use in a wide variety of networks. In this paper, we propose a highly configurable and flexible machine learning traffic classification method that relies only on statistics of sequences of packets to distinguish known, or approved, traffic from unknown traffic. Our method is based on likelihood estimation, provides a measure of certainty for classification decisions, and can classify traffic at adjustable certainty levels. Our classification method can also be applied in different classification scenarios, each prioritizing a different classification goal. We demonstrate how our classification scheme and all its configurations perform well on real-world traffic from a high performance computing network environment.

artificial intelligence, classification, machine learning, (17 more...)

doi: 10.1007/s10586-021-03393-2

2107.0608

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Telecommunications > Networks (0.69)
Information Technology > Security & Privacy (0.66)
Information Technology > Networks (0.48)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

arXiv.org Machine LearningJun-13-2019

The Kernel Spatial Scan Statistic

Han, Mingxuan, Matheny, Michael, Phillips, Jeff M.

Kulldorff's (1997) seminal paper on spatial scan statistics (SSS) has led to many methods considering different regions of interest, different statistical models, and different approximations while also having numerous applications in epidemiology, environmental monitoring, and homeland security. SSS provides a way to rigorously test for the existence of an anomaly and provide statistical guarantees as to how "anomalous" that anomaly is. However, these methods rely on defining specific regions where the spatial information a point contributes is limited to binary 0 or 1, of either inside or outside the region, while in reality anomalies will tend to follow smooth distributions with decaying density further from an epicenter. In this work, we propose a method that addresses this shortcoming through a continuous scan statistic that generalizes SSS by allowing the point contribution to be defined by a kernel. We provide extensive experimental and theoretical results that shows our methods can be computed efficiently while providing high statistical power for detecting anomalous regions.

algorithm, oncology, us government, (20 more...)

1906.09381

Country: North America > United States (0.95)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)

arXiv.org Machine LearningNov-9-2018

Relative Error RKHS Embeddings for Gaussian Kernels

Phillips, Jeff M., Tai, Wai Ming

We show how to obliviously embed into the reproducing kernel Hilbert space associated with Gaussian kernels, so that distance in this space (the kernel distance) only has $(1+\varepsilon)$-relative error. This only holds in comparing any point sets at a kernel distance at least $\alpha$; this parameter only shows up as a poly-logarithmic factor of the dimension of an intermediate embedding, but not in the final embedding. The main insight is to effectively modify the well-traveled random Fourier features to be slightly biased and have higher variance, but so they can be defined as a convolution over the function space. This result provides the first guaranteed algorithmic results for LSH of kernel distance on point sets and low-dimensional shapes and distributions, and for relative error bounds on the kernel two-sample test.

artificial intelligence, exp, machine learning, (16 more...)

1811.04136

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

arXiv.org Machine LearningJun-4-2018

Absolute Orientation for Word Embedding Alignment

Dev, Sunipa, Hassan, Safia, Phillips, Jeff M.

We propose a new technique to align word embeddings which are derived from different source datasets or created using different mechanisms (e.g., GloVe or word2vec). We design a simple, closed-form solution to find the optimal rotation and optionally scaling which minimizes the root mean squared error or maximizes the average cosine similarity between two embeddings of the same vocabulary into the same dimensional space. Our methods extend approaches known as Absolute Orientation, which are popular for aligning objects in three-dimensions. We extend them to arbitrary dimensions, and show that a simple scaling solution can be derived independent of the rotation, and also that it optimizes cosine similarity. Then we demonstrate how to evaluate the similarity of embeddings from different sources or mechanisms, and that certain properties like synonyms and analogies are preserved across the embeddings and can be enhanced by simply aligning and averaging ensembles of embeddings.

alignment, artificial intelligence, text processing, (20 more...)

1806.0133

Country:

North America > United States (0.14)
Europe > Spain (0.14)
Europe > Middle East > Malta (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)