AITopics | Accuracy

Collaborating Authors

Accuracy

News Overviews Instructional Materials AI-Alerts Classics

K-Metamodes: frequency- and ensemble-based distributed k-modes clustering for security analytics

arXiv.org Machine LearningSep-30-2019

Nowadays processing of Big Security Data, such as log messages, is commonly used for intrusion detection purposed. Its heterogeneous nature, as well as combination of numerical and categorical attributes does not allow to apply the existing data mining methods directly on the data without feature preprocessing. Therefore, a rather computationally expensive conversion of categorical attributes into vector space should be utilised for analysis of such data. However, a well-known k-modes algorithm allows to cluster the categorical data directly and avoid conversion into the vector space. The existing implementations of k-modes for Big Data processing are ensemble-based and utilise two-step clustering, where data subsets are first clustered independently, whereas the resulting cluster modes are clustered again in order to calculate metamodes valid for all data subsets. In this paper, the novel frequency-based distance function is proposed for the second step of ensemble-based k-modes clustering. Besides this, the existing feature discretisation method from the previous work is utilised in order to adapt k-modes for processing of mixed data sets. The resulting k-metamodes algorithm was tested on two public security data sets and reached higher effectiveness in comparison with the previous work.

algorithm, distance function, frequency-based distance function, (10 more...)

arXiv.org Machine Learning

1909.13721

Country:

Europe > Germany > Brandenburg > Potsdam (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)

Add feedback

Unsupervised Evaluation Metrics and Learning Criteria for Non-Parallel Textual Transfer

Pang, Richard Yuanzhe, Gimpel, Kevin

arXiv.org Artificial IntelligenceSep-30-2019

We consider the problem of automatically generating textual paraphrases with modified attributes or properties, focusing on the setting without parallel data (Hu et al., 2017; Shen et al., 2017). This setting poses challenges for evaluation. We show that the metric of post-transfer classification accuracy is insufficient on its own, and propose additional metrics based on semantic preservation and fluency as well as a way to combine them into a single overall score. We contribute new loss functions and training strategies to address the different metrics. Semantic preservation is addressed by adding a cyclic consistency loss and a loss based on paraphrase pairs, while fluency is improved by integrating losses based on style-specific language models. We experiment with a Yelp sentiment dataset and a new literature dataset that we propose, using multiple models that extend prior work (Shen et al., 2017). We demonstrate that our metrics correlate well with human judgments, at both the sentence-level and system-level. Automatic and manual evaluation also show large improvements over the baseline method of Shen et al. (2017). We hope that our proposed metrics can speed up system development for new textual transfer tasks while also encouraging the community to address our three complementary aspects of transfer quality.

computational linguistic, metric, proceedings, (12 more...)

arXiv.org Artificial Intelligence

1810.11878

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

Multi-classifier prediction of knee osteoarthritis progression from incomplete imbalanced longitudinal data

Widera, Paweł, Welsing, Paco M. J., Ladel, Christoph, Loughlin, John, Lafeber, Floris P. F. J., Dop, Florence Petit, Larkin, Jonathan, Weinans, Harrie, Mobasheri, Ali, Bacardit, Jaume

arXiv.org Machine LearningSep-29-2019

Conventional inclusion criteria used in osteoarthritis clinical trials are not very effective in selecting patients who would benefit the most from a therapy under test. Typically these criteria select majority of patients who show no or limited disease progression during a short evaluation window of the study. As a consequence, less insight on the relative effect of the treatment can be gained from the collected data, and the efforts and resources invested in running the study are not paying off. This could be avoided, if selection criteria were more predictive of the future disease progression. In this article, we formulated the patient selection problem as a multi-class classification task, with classes based on clinically relevant measures of progression (over a time scale typical for clinical trials). Using data from two long-term knee osteoarthritis studies OAI and CHECK, we tested multiple algorithms and learning process configurations (including multi-classifier approaches, cost-sensitive learning, and feature selection), to identify the best performing machine learning models. We examined the behaviour of the best models, with respect to prediction errors and the impact of used features, to confirm their clinical relevance. We found that the model-based selection outperforms the conventional inclusion criteria, reducing by 20-25% the number of patients who show no progression and making the representation of the patient categories more even. This result indicates that our machine learning approach could lead to efficiency improvements in clinical trial design.

knee osteoarthritis progression, prediction, progression, (14 more...)

arXiv.org Machine Learning

1909.13408

Country:

Europe > United Kingdom (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
Europe > Finland > Northern Ostrobothnia > Oulu (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Rheumatology (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Active Anomaly Detection for time-domain discoveries

Ishida, Emille E. O., Kornilov, Matwey V., Malanchev, Konstantin L., Pruzhinskaya, Maria V., Volnova, Alina A., Korolev, Vladimir S., Mondon, Florian, Sreejith, Sreevarsha, Malancheva, Anastasia, Das, Shubhomoy

arXiv.org Machine LearningSep-29-2019

We present the first application of adaptive machine learning to the identification of anomalies in a data set of non-periodic astronomical light curves. The method follows an active learning strategy where highly informative objects are selected to be labelled. This new information is subsequently used to improve the machine learning model, allowing its accuracy to evolve with the addition of every new classification. For the case of anomaly detection, the algorithm aims to maximize the number of real anomalies presented to the expert by slightly modifying the decision boundary of a traditional isolation forest in each iteration. As a proof of concept, we apply the Active Anomaly Discovery (AAD) algorithm to light curves from the Open Supernova Catalog and compare its results to those of a static Isolation Forest (IF). For both methods, we visually inspected objects within 2% highest anomaly scores. We show that AAD was able to identify 80% more true anomalies than IF. This result is the first evidence that AAD algorithms can play a central role in the search for new physics in the era of large scale sky surveys.

algorithm, anomaly, anomaly score, (12 more...)

arXiv.org Machine Learning

1909.1326

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.06)
Asia > Russia (0.05)
North America > United States > Washington > Whitman County > Pullman (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.30)

Add feedback

Learning Sparse Nonparametric DAGs

Zheng, Xun, Dan, Chen, Aragam, Bryon, Ravikumar, Pradeep, Xing, Eric P.

arXiv.org Machine LearningSep-28-2019

We develop a framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data. Our approach is based on a recent algebraic characterization of DAGs that led to the first fully continuous optimization for score-based learning of DAG models parametrized by a linear structural equation model (SEM). We extend this algebraic characterization to nonparametric SEM by leveraging nonparametric sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases. We also explore the use of neural networks and orthogonal basis expansions to model nonlinearities for general nonparametric models. Extensive empirical study confirms the necessity of nonlinear dependency and the advantage of continuous optimization for score-based learning.

additive model, machine learning research, optimization problem, (13 more...)

arXiv.org Machine Learning

1909.13189

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
(2 more...)

Add feedback

Learning Generative Adversarial RePresentations (GAP) under Fairness and Censoring Constraints

Liao, Jiachun, Huang, Chong, Kairouz, Peter, Sankar, Lalitha

arXiv.org Machine LearningSep-27-2019

We present Generative Adversarial rePresentations (GAP) as a data-driven framework for learning censored and/or fair representations. GAP leverages recent advancements in adversarial learning to allow a data holder to learn universal representations that decouple a set of sensitive attributes from the rest of the dataset. Under GAP, finding the optimal mechanism? {decorrelating encoder/decorrelator} is formulated as a constrained minimax game between a data encoder and an adversary. We show that for appropriately chosen adversarial loss functions, GAP provides {censoring} guarantees against strong information-theoretic adversaries and enforces demographic parity. We also evaluate the performance of GAP on multi-dimensional Gaussian mixture models and real datasets, and show how a designer can certify that representations learned under an adversary with a fixed architecture perform well against more complex adversaries.

adversary, dataset, representation, (15 more...)

arXiv.org Machine Learning

1910.00411

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.81)

Industry:

Law > Civil Rights & Constitutional Law (0.94)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

A New Covariance Estimator for Sufficient Dimension Reduction in High-Dimensional and Undersized Sample Problems

Olorede, Kabir Opeyemi, Yahya, Waheed Babatunde

arXiv.org Machine LearningSep-27-2019

The application of standard sufficient dimension reduction methods for reducing the dimension space of predictors without losing regression information requires inverting the covariance matrix of the predictors. This has posed a number of challenges especially when analyzing high-dimensional data sets in which the number of predictors $\mathit{p}$ is much larger than number of samples $n,~(n\ll p)$. A new covariance estimator, called the \textit{Maximum Entropy Covariance} (MEC) that addresses loss of covariance information when similar covariance matrices are linearly combined using \textit{Maximum Entropy} (ME) principle is proposed in this work. By benefitting naturally from slicing or discretizing range of the response variable, y into \textit{H} non-overlapping categories, $\mathit{h_{1},\ldots ,h_{H}}$, MEC first combines covariance matrices arising from samples in each y slice $\mathit{h\in H}$ and then select the one that maximizes entropy under the principle of maximum uncertainty. The MEC estimator is then formed from convex mixture of such entropy-maximizing sample covariance $S_{\mbox{mec}}$ estimate and pooled sample covariance $\mathbf{S}_{\mathit{p}}$ estimate across the $\mathit{H}$ slices without requiring time-consuming covariance optimization procedures. MEC deals directly with singularity and instability of sample group covariance estimate in both regression and classification problems. The efficiency of the MEC estimator is studied with the existing sufficient dimension reduction methods such as \textit{Sliced Inverse Regression} (SIR) and \textit{Sliced Average Variance Estimator} (SAVE) as demonstrated on both classification and regression problems using real life Leukemia cancer data and customers' electricity load profiles from smart meter data sets respectively.

covariance matrix, estimator, predictor, (12 more...)

arXiv.org Machine Learning

1909.13017

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Austria > Vienna (0.14)
Africa > Nigeria > Kwara State > Ilorin (0.05)
(7 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Hematology (0.90)
Health & Medicine > Therapeutic Area > Oncology > Leukemia (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

Improved histogram-based anomaly detector with the extended principal component features

Aryal, Sunil, Baniya, Arbind Agrahari, Santosh, KC

arXiv.org Machine LearningSep-27-2019

In this era of big data, databases are growing rapidly in terms of the number of records. Fast automatic detection of anomalous records in these massive databases is a challenging task. Traditional distance based anomaly detectors are not applicable in these massive datasets. Recently, a simple but extremely fast anomaly detector using one-dimensional histograms has been introduced. The anomaly score of a data instance is computed as the product of the probability mass of histograms in each dimensions where it falls into. It is shown to produce competitive results compared to many state-of-the-art methods in many datasets. Because it assumes data features are independent of each other, it results in poor detection accuracy when there is correlation between features. To address this issue, we propose to increase the feature size by adding more features based on principal components. Our results show that using the original input features together with principal components improves the detection accuracy of histogram-based anomaly detector significantly without compromising much in terms of run-time.

anomaly, dataset, spad, (13 more...)

arXiv.org Machine Learning

1909.12702

Country:

North America > United States > South Dakota > Clay County > Vermillion (0.14)
Asia (0.04)
Oceania > Australia > Victoria (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report > New Finding (0.86)

Industry:

Health & Medicine (0.69)
Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Oakden-Rayner, Luke, Dunnmon, Jared, Carneiro, Gustavo, Ré, Christopher

arXiv.org Machine LearningSep-26-2019

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model still consistently misses a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring and describing hidden stratification effects, and characterize these effects both on multiple medical imaging datasets and via synthetic experiments on the well-characterised CIFAR-100 benchmark dataset. We find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we explore the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

dataset, stratification, subclass, (15 more...)

arXiv.org Machine Learning

1909.12475

Country:

Oceania > Australia > South Australia > Adelaide (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Crowdsourcing via Pairwise Co-occurrences: Identifiability and Algorithms

Ibrahim, Shahana, Fu, Xiao, Kargas, Nikos, Huang, Kejun

arXiv.org Machine LearningSep-26-2019

The data deluge comes with high demands for data labeling. Crowdsourcing (or, more generally, ensemble learning) techniques aim to produce accurate labels via integrating noisy, non-expert labeling from annotators. The classic Dawid-Skene estimator and its accompanying expectation maximization (EM) algorithm have been widely used, but the theoretical properties are not fully understood. Tensor methods were proposed to guarantee identification of the Dawid-Skene model, but the sample complexity is a hurdle for applying such approaches---since the tensor methods hinge on the availability of third-order statistics that are hard to reliably estimate given limited data. In this paper, we propose a framework using pairwise co-occurrences of the annotator responses, which naturally admits lower sample complexity. We show that the approach can identify the Dawid-Skene model under realistic conditions. We propose an algebraic algorithm reminiscent of convex geometry-based structured matrix factorization to solve the model identification problem efficiently, and an identifiability-enhanced algorithm for handling more challenging and critical scenarios. Experiments show that the proposed algorithms outperform the state-of-art algorithms under a variety of scenarios.

algorithm, annotator, confusion matrix, (15 more...)

arXiv.org Machine Learning

1909.12325

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Florida > Alachua County > Gainesville (0.14)
North America > United States > Oregon > Benton County > Corvallis (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
(2 more...)

Add feedback