AITopics | Marchant, Neil G.

Collaborating Authors

Marchant, Neil G.

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adaptive Data Analysis for Growing Data

Marchant, Neil G., Rubinstein, Benjamin I. P.

arXiv.org Machine LearningMay-22-2024

Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodate situations where data grows over time. In this paper we address this gap, presenting the first generalization bounds for adaptive analysis in the dynamic data setting. We allow the analyst to adaptively schedule their queries conditioned on the current size of the data, in addition to previous queries and responses. We also incorporate time-varying empirical accuracy bounds and mechanisms, allowing for tighter guarantees as data accumulates. In a batched query setting, the asymptotic data requirements of our bound grows with the square-root of the number of adaptive queries, matching prior works' improvement over data splitting for the static setting. We instantiate our bound for statistical queries with the clipped Gaussian mechanism, where it empirically outperforms baselines composed from static bounds.

artificial intelligence, machine learning, query, (13 more...)

arXiv.org Machine Learning

2405.13375

Country: Oceania > Australia (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion

Huang, Zhuoqun, Marchant, Neil G., Lucas, Keane, Bauer, Lujo, Ohrimenko, Olga, Rubinstein, Benjamin I. P.

arXiv.org Machine LearningOct-26-2023

Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.

classifier, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

2302.01757

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

A general framework for label-efficient online evaluation with asymptotic guarantees

Marchant, Neil G., Rubinstein, Benjamin I. P.

arXiv.org Machine LearningJun-12-2020

Achieving statistically significant evaluation with passive sampling of test data is challenging in settings such as extreme classification and record linkage, where significant class imbalance is prevalent. Adaptive importance sampling focuses labeling on informative regions of the instance space, however it breaks data independence assumptions - commonly required for asymptotic guarantees that assure estimates approximate population performance and provide practical confidence intervals. In this paper we develop an adaptive importance sampling framework for supervised evaluation that defines a sequence of proposal distributions given a user-defined discriminative model of p(y|x) and a generalized performance measure to evaluate. Under verifiable conditions on the model and performance measure, we establish strong consistency and a (martingale) central limit theorem for resulting performance estimates. We instantiate our framework with worked examples given stochastic or deterministic label oracle access. Both examples leverage Dirichlet-tree models for practical online evaluation, with the deterministic case achieving asymptotic optimality. Experiments on seven datasets demonstrate an average mean-squared error superior to state-of-the-art samplers on fixed label budgets.

artificial intelligence, natural language, proposal, (21 more...)

arXiv.org Machine Learning

2006.06963

Country:

Asia (0.67)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.82)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Add feedback

d-blink: Distributed End-to-End Bayesian Entity Resolution

Marchant, Neil G., Steorts, Rebecca C., Kaplan, Andee, Rubinstein, Benjamin I. P., Elazar, Daniel N.

arXiv.org Machine LearningSep-13-2019

Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose "distributed Bayesian linkage" or d-blink -- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs sampler; and (iv) a novel perturbation sampling algorithm (leveraging the Vose-Alias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that d-blink can achieve significant efficiency gains -- in excess of 300$\times$ -- when compared to existing non-distributed methods.

artificial intelligence, bayesian inference, partition, (22 more...)

arXiv.org Machine Learning

1909.06039

Country:

North America > United States > New York (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Colorado (0.14)

Genre: Research Report (1.00)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.84)

Add feedback

In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling

Marchant, Neil G., Rubinstein, Benjamin I. P.

arXiv.org Machine LearningJun-25-2017

Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates for rigorous evaluation. This paper addresses this important challenge with the OASIS algorithm: a sampler and F-measure estimator for ER evaluation. OASIS draws samples from a (biased) instrumental distribution, chosen to ensure estimators with optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on unlabelled items providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 83% without loss to estimate accuracy.

crowdsourcing, instrumental distribution, social media, (24 more...)

arXiv.org Machine Learning

1703.00617

Country: Oceania > Australia (0.28)

Genre: Research Report (0.82)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
(4 more...)

Add feedback