AITopics | Yan, Donghui

Collaborating Authors

Yan, Donghui

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Deep Neural Network Based Approach to Building Budget-Constrained Models for Big Data Analysis

Ming, Rui, Xu, Haiping, Gibbs, Shannon E., Yan, Donghui, Shao, Ming

arXiv.org Artificial IntelligenceFeb-22-2023

Deep learning approaches require collection of data on many different input features or variables for accurate model training and prediction. Since data collection on input features could be costly, it is crucial to reduce the cost by selecting a subset of features and developing a budget-constrained model (BCM). In this paper, we introduce an approach to eliminating less important features for big data analysis using Deep Neural Networks (DNNs). Once a DNN model has been developed, we identify the weak links and weak neurons, and remove some input features to bring the model cost within a given budget. The experimental results show our approach is feasible and supports user selection of a suitable BCM within a given budget.

artificial intelligence, machine learning, neuron, (18 more...)

arXiv.org Artificial Intelligence

2302.11707

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Therapeutic Area > Endocrinology (0.95)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.69)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Improving Short Text Classification With Augmented Data Using GPT-3

Balkus, Salvador, Yan, Donghui

arXiv.org Artificial IntelligenceMay-22-2022

GPT-3 is a large-scale natural language model developed by OpenAI that can perform many different tasks, including topic classification. Although researchers claim that it requires only a small number of in-context examples to learn a task, in practice GPT-3 requires these training examples to be either of exceptional quality or a higher quantity than easily created by hand. To address this issue, this study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples generated by GPT-3 itself. This study compares two classifiers: the GPT-3 Classification Endpoint with augmented examples, and the GPT-3 Completion Endpoint with an optimal training set chosen using a genetic algorithm. We find that while the augmented Completion Endpoint achieves upwards of 80 percent validation accuracy, using the augmented Classification Endpoint yields more consistent accuracy on unseen examples. In this way, giving large-scale machine learning models like GPT-3 the ability to propose their own additional training examples can result in improved classification performance.

large language model, natural language, short text classification, (4 more...)

arXiv.org Artificial Intelligence

doi: 10.1017/S1351324923000438

2205.10981

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning Low-dimensional Manifolds for Scoring of Tissue Microarray Images

Yan, Donghui, Zou, Jian, Li, Zhenpeng

arXiv.org Artificial IntelligenceFeb-22-2021

Tissue microarray (TMA) images have emerged as an important high-throughput tool for cancer study and the validation of biomarkers. Efforts have been dedicated to further improve the accuracy of TACOMA, a cutting-edge automatic scoring algorithm for TMA images. One major advance is due to deepTacoma, an algorithm that incorporates suitable deep representations of a group nature. Inspired by the recent advance in semi-supervised learning and deep learning, we propose mfTacoma to learn alternative deep representations in the context of TMA image scoring. In particular, mfTacoma learns the low-dimensional manifolds, a common latent structure in high dimensional data. Deep representation learning and manifold learning typically requires large data. By encoding deep representation of the manifolds as regularizing features, mfTacoma effectively leverages the manifold information that is potentially crude due to small data. Our experiments show that deep features by manifolds outperforms two alternatives -- deep features by linear manifolds with principal component analysis or by leveraging the group property.

deep learning, manifold, neural network, (20 more...)

arXiv.org Artificial Intelligence

2102.11396

Country:

North America > United States > Wisconsin (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

Similarity Kernel and Clustering via Random Projection Forests

Yan, Donghui, Gu, Songxiang, Xu, Ying, Qin, Zhiwei

arXiv.org Machine LearningAug-27-2019

Similarity plays a fundamental role in many areas, including data mining, machine learning, statistics and various applied domains. Inspired by the success of ensemble methods and the flexibility of trees, we propose to learn a similarity kernel called rpf-kernel through random projection forests (rpForests). Our theoretical analysis reveals a highly desirable property of rpf-kernel: far-away (dissimilar) points have a low similarity value while nearby (similar) points would have a high similarity}, and the similarities have a native interpretation as the probability of points remaining in the same leaf nodes during the growth of rpForests. The learned rpf-kernel leads to an effective clustering algorithm--rpfCluster. On a wide variety of real and benchmark datasets, rpfCluster compares favorably to K-means clustering, spectral clustering and a state-of-the-art clustering ensemble algorithm--Cluster Forests. Our approach is simple to implement and readily adapt to the geometry of the underlying data. Given its desirable theoretical property and competitive empirical performance when applied to clustering, we expect rpf-kernel to be applicable to many problems of an unsupervised nature or as a regularizer in some supervised or weakly supervised settings.

health & medicine, similarity kernel, survey article, (20 more...)

arXiv.org Machine Learning

1908.10506

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.89)

Add feedback

Learning over inherently distributed data

Yan, Donghui, Xu, Ying

arXiv.org Machine LearningJul-30-2019

The recent decades have seen a surge of interests in distributed computing. Existing work focus primarily on either distributed computing platforms, data query tools, or, algorithms to divide big data and conquer at individual machines etc. It is, however, increasingly often that the data of interest are inherently distributed, i.e., data are stored at multiple distributed sites due to diverse collection channels, business operations etc. We propose to enable learning and inference in such a setting via a general framework based on the distortion minimizing local transformations. This framework only requires a small amount of local signatures to be shared among distributed sites, eliminating the need of having to transmitting big data. Computation can be done very efficiently via parallel local computation. The error incurred due to distributed computing vanishes when increasing the size of local signatures. As the shared data need not be in their original form, data privacy may also be preserved. Experiments on linear (logistic) regression and Random Forests have shown promise of this approach. This framework is expected to apply to a general class of tools in learning and inference with the continuity property.

inference, oncology, survey article, (23 more...)

arXiv.org Machine Learning

1907.13208

Country: North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)

Genre: Research Report > New Finding (0.49)

Industry:

Information Technology > Security & Privacy (0.66)
Health & Medicine > Therapeutic Area (0.47)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Mining > Big Data (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.51)

Add feedback

Fast communication-efficient spectral clustering over distributed data

Yan, Donghui, Wang, Yingjie, Wang, Jin, Wu, Guodong, Wang, Honggang

arXiv.org Machine LearningMay-5-2019

The last decades have seen a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. Existing distributed algorithms typically assume {\it all the data are already in one place}, and divide the data and conquer on multiple machines. However, it is increasingly often that the data are located at a number of distributed sites, and one wishes to compute over all the data with low communication overhead. For spectral clustering, we propose a novel framework that enables its computation over such distributed data, with "minimal" communications while a major speedup in computation. The loss in accuracy is negligible compared to the non-distributed setting. Our approach allows local parallel computing at where the data are located, thus turns the distributed nature of the data into a blessing; the speedup is most substantial when the data are evenly distributed across sites. Experiments on synthetic and large UC Irvine datasets show almost no loss in accuracy with our approach while about 2x speedup under various settings with two distributed sites. As the transmitted data need not be in their original form, our framework readily addresses the privacy concern for data sharing in distributed computing.

neural network, spectral, survey article, (23 more...)

arXiv.org Machine Learning

doi: 10.1109/TBDATA.2019.2907985

1905.01596

Country: North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Communications (1.00)
(4 more...)

Add feedback

Cost-sensitive Selection of Variables by Ensemble of Model Sequences

Yan, Donghui, Qin, Zhiwei, Gu, Songxiang, Xu, Haiping, Shao, Ming

arXiv.org Machine LearningJan-2-2019

Many applications require the collection of data on different variables or measurements overa number of system performance metrics. For example, some cyber systems rely on scanning various system metrics to detect or to predict potential cyber intrusions or threats. In the maintenance of airplanes or major factorymachinery, measurements of different system components and their usage statistics are collected to determine when a maintenance is required. In medical diagnosis, a patient may be asked to take various medical tests, such 1 as on blood pressure, cholesterol level, heart rates and so on, so that the doctor coulddetermine if the patient has a certain disease. In the development of an e-commerce product that predicts the click or purchase of a product at an e-commerce website, many data related to a user's shopping behavior will be collected, and often extra data relevant to the product or the user's shopping behavior are purchased from a third-party vendor etc. The data collected on various measures need to be combined, and if cost is a concern, a subset of measures need to be selected to satisfy the budget constraint. The problem of combining measures for a target application can be formulated as follows.

model schedule, optimization problem, vascular disease, (24 more...)

arXiv.org Machine Learning

1901.00456

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (1.00)

Industry:

Information Technology > Services > e-Commerce Services (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.88)
Health & Medicine > Diagnostic Medicine (0.88)
Health & Medicine > Therapeutic Area > Endocrinology (0.68)

Technology:

Information Technology > e-Commerce (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

K-nearest Neighbor Search by Random Projection Forests

Yan, Donghui, Wang, Yingjie, Wang, Jin, Wang, Honggang, Li, Zhenpeng

arXiv.org Machine LearningDec-30-2018

K-nearest neighbor (kNN) search refers to the problem of finding K points closest toa given data point on a distance metric of interest. It is an important task in a wide range of applications, including similarity search in data mining [15,19], fast kernel methods in machine learning [17, 30, 38], nonparametric density estimation [5, 29, 31] and intrinsic dimension estimation [6, 26] in statistics, aswell as anomaly detection algorithms [2, 10, 37]. Numerous algorithms have been proposed for kNN search; the readers are referred to [35, 46] and references therein. Our interest is kNN search in emerging applications. Two 1 salient features of such applications are the expected scalability of the algorithms andtheir ability to handle data of high dimensionality. Additionally, such applications often desire more accurate kNN search. For example, robotic route planning [23] and face-based surveillance systems [34] require a high accuracy forthe robust execution of tasks. However, most existing work on kNN search [1, 4, 12, 15] have focused mainly on the fast computation and accuracy isofalessconcern.

artificial intelligence, health & medicine, knn search, (17 more...)

arXiv.org Machine Learning

1812.11689

Country: North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.71)

Add feedback

Incorporating Deep Features in the Analysis of Tissue Microarray Images

Yan, Donghui, Randolph, Timothy W., Zou, Jian, Gong, Peng

arXiv.org Machine LearningNov-25-2018

Tissue microarray (TMA) images have been used increasingly often in cancer studies and the validation of biomarkers. TACOMA---a cutting-edge automatic scoring algorithm for TMA images---is comparable to pathologists in terms of accuracy and repeatability. Here we consider how this algorithm may be further improved. Inspired by the recent success of deep learning, we propose to incorporate representations learnable through computation. We explore representations of a group nature through unsupervised learning, e.g., hierarchical clustering and recursive space partition. Information carried by clustering or spatial partitioning may be more concrete than the labels when the data are heterogeneous, or could help when the labels are noisy. The use of such information could be viewed as regularization in model fitting. It is motivated by major challenges in TMA image scoring---heterogeneity and label noise, and the cluster assumption in semi-supervised learning. Using this information on TMA images of breast cancer, we have reduced the error rate of TACOMA by about 6%. Further simulations on synthetic data provide insights on when such representations would likely help. Although we focus on TMAs, learnable representations of this type are expected to be applicable in other settings.

neural network, oncology, tma image, (21 more...)

arXiv.org Machine Learning

1812.00887

Country:

North America > United States > Massachusetts > Bristol County > Dartmouth (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Classification under Data Contamination with Application to Remote Sensing Image Mis-registration

Yan, Donghui, Gong, Peng, Chen, Aiyou, Zhong, Liheng

arXiv.org Machine LearningJan-5-2012

This work is motivated by the problem of image mis-registration in remote sensing and we are interested in determining the resulting loss in the accuracy of pattern classification. A statistical formulation is given where we propose to use data contamination to model and understand the phenomenon of image mis-registration. This model is widely applicable to many other types of errors as well, for example, measurement errors and gross errors etc. The impact of data contamination on classification is studied under a statistical learning theoretical framework. A closed-form asymptotic bound is established for the resulting loss in classification accuracy, which is less than $\epsilon/(1-\epsilon)$ for data contamination of an amount of $\epsilon$. Our bound is sharper than similar bounds in the domain adaptation literature and, unlike such bounds, it applies to classifiers with an infinite Vapnik-Chervonekis (VC) dimension. Extensive simulations have been conducted on both synthetic and real datasets under various types of data contamination, including label flipping, feature swapping and the replacement of feature values with data generated from a random source such as a Gaussian or Cauchy distribution. Our simulation results show that the bound we derive is fairly tight.

artificial intelligence, bayesian inference, data contamination, (15 more...)

arXiv.org Machine Learning

1101.3594

Country: North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.63)

Add feedback