AITopics | rptree

Collaborating Authors

rptree

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Random Projection Forest Initialization for Graph Convolutional Networks

Alshammari, Mashaan, Stavrakakis, John, Ahmed, Adel F., Takatsuka, Masahiro

arXiv.org Artificial IntelligenceOct-22-2023

Graph convolutional networks (GCNs) were a great step towards extending deep learning to unstructured data such as graphs. But GCNs still need a constructed graph to work with. To solve this problem, classical graphs such as $k$-nearest neighbor are usually used to initialize the GCN. Although it is computationally efficient to construct $k$-nn graphs, the constructed graph might not be very useful for learning. In a $k$-nn graph, points are restricted to have a fixed number of edges, and all edges in the graph have equal weights. We present a new way to construct the graph and initialize the GCN. It is based on random projection forest (rpForest). rpForest enables us to assign varying weights on edges indicating varying importance, which enhanced the learning. The number of trees is a hyperparameter in rpForest. We performed spectral analysis to help us setting this parameter in the right range. In the experiments, initializing the GCN using rpForest provides better results compared to $k$-nn initialization.

adjacency matrix, graph, random projection forest initialization, (9 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.mex.2023.102315

2302.12001

Country:

Asia > Middle East > Saudi Arabia > Eastern Province > Dhahran (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Saudi Arabia > Riyadh Province > Riyadh (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

The Effect of Points Dispersion on the $k$-nn Search in Random Projection Forests

Alshammari, Mashaan, Stavrakakis, John, Ahmed, Adel F., Takatsuka, Masahiro

arXiv.org Artificial IntelligenceFeb-25-2023

Partitioning trees are efficient data structures for $k$-nearest neighbor search. Machine learning libraries commonly use a special type of partitioning trees called $k$d-trees to perform $k$-nn search. Unfortunately, $k$d-trees can be ineffective in high dimensions because they need more tree levels to decrease the vector quantization (VQ) error. Random projection trees rpTrees solve this scalability problem by using random directions to split the data. A collection of rpTrees is called rpForest. $k$-nn search in an rpForest is influenced by two factors: 1) the dispersion of points along the random direction and 2) the number of rpTrees in the rpForest. In this study, we investigate how these two factors affect the $k$-nn search with varying $k$ values and different datasets. We found that with larger number of trees, the dispersion of points has a very limited effect on the $k$-nn search. One should use the original rpTree algorithm by picking a random direction regardless of the dispersion of points.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACCESS.2022.3195488

2302.1316

Country:

Asia > Middle East > Saudi Arabia > Eastern Province > Dhahran (0.14)
Oceania > Australia > New South Wales > Sydney (0.14)
North America > United States > Pennsylvania (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.33)

Add feedback

$DC^2$: A Divide-and-conquer Algorithm for Large-scale Kernel Learning with Application to Clustering

Wang, Ke Alexander, Bian, Xinran, Liu, Pan, Yan, Donghui

arXiv.org Machine LearningNov-15-2019

Divide-and-conquer is a general strategy to deal with large scale problems. It is typically applied to generate ensemble instances, which potentially limits the problem size it can handle. Additionally, the data are often divided by random sampling which may be suboptimal. To address these concerns, we propose the $DC^2$ algorithm. Instead of ensemble instances, we produce structure-preserving signature pieces to be assembled and conquered. $DC^2$ achieves the efficiency of sampling-based large scale kernel methods while enabling parallel multicore or clustered computation. The data partition and subsequent compression are unified by recursive random projections. Empirically dividing the data by random projections induces smaller mean squared approximation errors than conventional random sampling. The power of $DC^2$ is demonstrated by our clustering algorithm $rpfCluster^+$, which is as accurate as some fastest approximate spectral clustering algorithms while maintaining a running time close to that of K-means clustering. Analysis on $DC^2$ when applied to spectral clustering shows that the loss in clustering accuracy due to data division and reduction is upper bounded by the data approximation error which would vanish with recursive random projections. Due to its easy implementation and flexibility, we expect $DC^2$ to be applicable to general large scale learning problems.

algorithm, partition, random projection, (15 more...)

arXiv.org Machine Learning

1911.06944

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
Asia > Middle East > Jordan (0.05)
North America > United States > New York > Tompkins County > Ithaca (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Education (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.78)

Add feedback

Learning over inherently distributed data

Yan, Donghui, Xu, Ying

arXiv.org Machine LearningJul-30-2019

The recent decades have seen a surge of interests in distributed computing. Existing work focus primarily on either distributed computing platforms, data query tools, or, algorithms to divide big data and conquer at individual machines etc. It is, however, increasingly often that the data of interest are inherently distributed, i.e., data are stored at multiple distributed sites due to diverse collection channels, business operations etc. We propose to enable learning and inference in such a setting via a general framework based on the distortion minimizing local transformations. This framework only requires a small amount of local signatures to be shared among distributed sites, eliminating the need of having to transmitting big data. Computation can be done very efficiently via parallel local computation. The error incurred due to distributed computing vanishes when increasing the size of local signatures. As the shared data need not be in their original form, data privacy may also be preserved. Experiments on linear (logistic) regression and Random Forests have shown promise of this approach. This framework is expected to apply to a general class of tools in learning and inference with the continuity property.

data mining, machine learning, regression, (20 more...)

arXiv.org Machine Learning

1907.13208

Country: North America > United States > Massachusetts (0.46)

Genre: Research Report > New Finding (0.49)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.51)

Add feedback

Fast communication-efficient spectral clustering over distributed data

Yan, Donghui, Wang, Yingjie, Wang, Jin, Wu, Guodong, Wang, Honggang

arXiv.org Machine LearningMay-5-2019

The last decades have seen a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. Existing distributed algorithms typically assume {\it all the data are already in one place}, and divide the data and conquer on multiple machines. However, it is increasingly often that the data are located at a number of distributed sites, and one wishes to compute over all the data with low communication overhead. For spectral clustering, we propose a novel framework that enables its computation over such distributed data, with "minimal" communications while a major speedup in computation. The loss in accuracy is negligible compared to the non-distributed setting. Our approach allows local parallel computing at where the data are located, thus turns the distributed nature of the data into a blessing; the speedup is most substantial when the data are evenly distributed across sites. Experiments on synthetic and large UC Irvine datasets show almost no loss in accuracy with our approach while about 2x speedup under various settings with two distributed sites. As the transmitted data need not be in their original form, our framework readily addresses the privacy concern for data sharing in distributed computing.

data mining, machine learning, spectral, (21 more...)

arXiv.org Machine Learning

doi: 10.1109/TBDATA.2019.2907985

1905.01596

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
Asia > Middle East > Jordan (0.05)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Communications (1.00)
(4 more...)

Add feedback

K-nearest Neighbor Search by Random Projection Forests

Yan, Donghui, Wang, Yingjie, Wang, Jin, Wang, Honggang, Li, Zhenpeng

arXiv.org Machine LearningDec-30-2018

K-nearest neighbor (kNN) search refers to the problem of finding K points closest toa given data point on a distance metric of interest. It is an important task in a wide range of applications, including similarity search in data mining [15,19], fast kernel methods in machine learning [17, 30, 38], nonparametric density estimation [5, 29, 31] and intrinsic dimension estimation [6, 26] in statistics, aswell as anomaly detection algorithms [2, 10, 37]. Numerous algorithms have been proposed for kNN search; the readers are referred to [35, 46] and references therein. Our interest is kNN search in emerging applications. Two 1 salient features of such applications are the expected scalability of the algorithms andtheir ability to handle data of high dimensionality. Additionally, such applications often desire more accurate kNN search. For example, robotic route planning [23] and face-based surveillance systems [34] require a high accuracy forthe robust execution of tasks. However, most existing work on kNN search [1, 4, 12, 15] have focused mainly on the fast computation and accuracy isofalessconcern.

knn search, projection, rptree, (15 more...)

arXiv.org Machine Learning

1812.11689

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Wisconsin (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (1.00)

Add feedback

Incorporating Deep Features in the Analysis of Tissue Microarray Images

Yan, Donghui, Randolph, Timothy W., Zou, Jian, Gong, Peng

arXiv.org Machine LearningNov-25-2018

Tissue microarray (TMA) images have been used increasingly often in cancer studies and the validation of biomarkers. TACOMA---a cutting-edge automatic scoring algorithm for TMA images---is comparable to pathologists in terms of accuracy and repeatability. Here we consider how this algorithm may be further improved. Inspired by the recent success of deep learning, we propose to incorporate representations learnable through computation. We explore representations of a group nature through unsupervised learning, e.g., hierarchical clustering and recursive space partition. Information carried by clustering or spatial partitioning may be more concrete than the labels when the data are heterogeneous, or could help when the labels are noisy. The use of such information could be viewed as regularization in model fitting. It is motivated by major challenges in TMA image scoring---heterogeneity and label noise, and the cluster assumption in semi-supervised learning. Using this information on TMA images of breast cancer, we have reduced the error rate of TACOMA by about 6%. Further simulations on synthetic data provide insights on when such representations would likely help. Although we focus on TMAs, learnable representations of this type are expected to be applicable in other settings.

artificial intelligence, machine learning, tma image, (16 more...)

arXiv.org Machine Learning

1812.00887

Country: North America > United States > California (0.67)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback