AITopics

2111.15546

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Air (0.83)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

#artificialintelligenceDec-20-2022, 04:50:46 GMT

Learning Machine Learning Part 1: Introduction and Revoke-Obfuscation

The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

algorithm, dataset, logistic regression, (15 more...)

#artificialintelligence

Country: North America > United States (0.04)

Industry:

Leisure & Entertainment > Games (0.93)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.97)

Chen, Mayee F., Nachman, Benjamin, Sala, Frederic

Resonant Anomaly Detection with Multiple Reference Datasets

An important class of techniques for resonant anomaly detection in high energy physics builds models that can distinguish between reference and target datasets, where only the latter has appreciable signal. Such techniques, including Classification Without Labels (CWoLa) and Simulation Assisted Likelihood-free Anomaly Detection (SALAD) rely on a single reference dataset. They cannot take advantage of commonly-available multiple datasets and thus cannot fully exploit available information. In this work, we propose generalizations of CWoLa and SALAD for settings where multiple reference datasets are available, building on weak supervision techniques. We demonstrate improved performance in a number of settings with realistic and synthetic data. As an added benefit, our generalizations enable us to provide finite-sample guarantees, improving on existing asymptotic analyses.

artificial intelligence, data mining, machine learning, (15 more...)

doi: 10.1007/JHEP07(2023)188

2212.10579

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.14)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry:

Semiconductors & Electronics (0.46)
Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Leiter, Christoph, Nguyen, Hoa, Eger, Steffen

BMX: Boosting Machine Translation Metrics with Explainability

State-of-the-art machine translation evaluation metrics are based on black-box language models. Hence, recent works consider their explainability with the goals of better understandability for humans and better metric analysis, including failure cases. In contrast, we explicitly leverage explanations to boost the metrics' performance. In particular, we perceive explanations as word-level scores, which we convert, via power means, into sentence-level scores. We combine this sentence-level score with the original metric to obtain a better metric. Our extensive evaluation and analysis across 5 datasets, 5 metrics and 4 explainability techniques shows that some configurations reliably improve the original metrics' correlation with human judgment. On two held datasets for testing, we obtain improvements in 15/18 resp. 4/4 cases. The gains in Pearson correlation are up to 0.032 resp. 0.055. We make our code available.

machine learning, metric, natural language, (19 more...)

2212.10469

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Pennsylvania (0.04)
North America > Dominican Republic (0.04)
(8 more...)

Genre:

Research Report (1.00)
Overview (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Cho, Woo-Hyung, Henderson, Shane, Shmoys, David

Scheduling with Predictions

There is significant interest in deploying machine learning algorithms for diagnostic radiology, as modern learning techniques have made it possible to detect abnormalities in medical images within minutes. While machine-assisted diagnoses cannot yet reliably replace human reviews of images by a radiologist, they could inform prioritization rules for determining the order by which to review patient cases so that patients with time-sensitive conditions could benefit from early intervention. We study this scenario by formulating it as a learning-augmented online scheduling problem. We are given information about each arriving patient's urgency level in advance, but these predictions are inevitably error-prone. In this formulation, we face the challenges of decision making under imperfect information, and of responding dynamically to prediction error as we observe better data in real-time. We propose a simple online policy and show that this policy is in fact the best possible in certain stylized settings. We also demonstrate that our policy achieves the two desiderata of online algorithms with predictions: consistency (performance improvement with prediction accuracy) and robustness (protection against the worst case). We complement our theoretical findings with empirical evaluations of the policy under settings that more accurately reflect clinical scenarios in the real world.

artificial intelligence, machine learning, prediction, (17 more...)

2212.10433

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Zhang, Qian, Makur, Anuran, Azizzadenesheli, Kamyar

Functional Linear Regression of Cumulative Distribution Functions

The estimation of cumulative distribution functions (CDFs) is an important learning task with a great variety of downstream applications, such as risk assessments in predictions and decision making. In this paper, we study functional regression of contextual CDFs where each data point is sampled from a linear combination of context dependent CDF basis functions. We propose functional ridge-regression-based estimation methods that estimate CDFs accurately everywhere. In particular, given $n$ samples with $d$ basis functions, we show estimation error upper bounds of $\widetilde{O}(\sqrt{d/n})$ for fixed design, random design, and adversarial context cases. We also derive matching information theoretic lower bounds, establishing minimax optimality for CDF functional regression. Furthermore, we remove the burn-in time in the random design setting using an alternative penalized estimator. Then, we consider agnostic settings where there is a mismatch in the data generation process. We characterize the error of the proposed estimators in terms of the mismatched error, and show that the estimators are well-behaved under model mismatch. Finally, to complete our study, we formalize infinite dimensional models where the parameter space is an infinite dimensional Hilbert space, and establish self-normalized estimation error upper bounds for this setting.

artificial intelligence, data mining, machine learning, (19 more...)

2205.14545

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)
Information Technology > Data Science > Data Mining > Big Data (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Learning from Training Dynamics: Identifying Mislabeled Data Beyond Manually Designed Features

Jia, Qingrui, Li, Xuhong, Yu, Lei, Bian, Jiang, Zhao, Penghao, Li, Shupeng, Xiong, Haoyi, Dou, Dejing

While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proved to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detector on Tiny ImageNet, CUB-200, Caltech-256, WebVision and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that the mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.

artificial intelligence, deep learning, machine learning, (16 more...)

2212.09321

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report > Promising Solution (1.00)

Industry:

Education (0.93)
Transportation (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Integrating Heterogeneous Domain Information into Relation Extraction: A Case Study on Drug-Drug Interaction Extraction

Asada, Masaki

The development of deep neural networks has improved representation learning in various domains, including textual, graph structural, and relational triple representations. This development opened the door to new relation extraction beyond the traditional text-oriented relation extraction. However, research on the effectiveness of considering multiple heterogeneous domain information simultaneously is still under exploration, and if a model can take an advantage of integrating heterogeneous information, it is expected to exhibit a significant contribution to many problems in the world. This thesis works on Drug-Drug Interactions (DDIs) from the literature as a case study and realizes relation extraction utilizing heterogeneous domain information. First, a deep neural relation extraction model is prepared and its attention mechanism is analyzed. Next, a method to combine the drug molecular structure information and drug description information to the input sentence information is proposed, and the effectiveness of utilizing drug molecular structures and drug descriptions for the relation extraction task is shown. Then, in order to further exploit the heterogeneous information, drug-related items, such as protein entries, medical terms and pathways are collected from multiple existing databases and a new data set in the form of a knowledge graph (KG) is constructed. A link prediction task on the constructed data set is conducted to obtain embedding representations of drugs that contain the heterogeneous domain information. Finally, a method that integrates the input sentence information and the heterogeneous KG information is proposed. The proposed model is trained and evaluated on a widely used data set, and as a result, it is shown that utilizing heterogeneous domain information significantly improves the performance of relation extraction from the literature.

information, machine learning, natural language, (20 more...)

2212.10714

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(14 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.92)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Kandpal, Nikhil, Wallace, Eric, Raffel, Colin

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Past work has shown that large language models are susceptible to privacy attacks, where adversaries generate sequences from a trained model and detect which sequences are memorized from the training set. In this work, we show that the success of these attacks is largely due to duplication in commonly used web-scraped training sets. We first show that the rate at which language models regenerate training sequences is superlinearly related to a sequence's count in the training set. For instance, a sequence that is present 10 times in the training data is on average generated ~1000 times more often than a sequence that is present only once. We next show that existing methods for detecting memorized sequences have near-chance accuracy on non-duplicated training sequences. Finally, we find that after applying methods to deduplicate training data, language models are considerably more secure against these types of privacy attacks. Taken together, our results motivate an increased focus on deduplication in privacy-sensitive applications and a reevaluation of the practicality of existing privacy attacks.

large language model, machine learning, natural language, (14 more...)

2202.06539

Country: North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Galaxy Image Classification using Hierarchical Data Learning with Weighted Sampling and Label Smoothing

Ma, Xiaohua, Li, Xiangru, Luo, Ali, Zhang, Jinqu, Li, Hui

With the development of a series of Galaxy sky surveys in recent years, the observations increased rapidly, which makes the research of machine learning methods for galaxy image recognition a hot topic. Available automatic galaxy image recognition researches are plagued by the large differences in similarity between categories, the imbalance of data between different classes, and the discrepancy between the discrete representation of Galaxy classes and the essentially gradual changes from one morphological class to the adjacent class (DDRGC). These limitations have motivated several astronomers and machine learning experts to design projects with improved galaxy image recognition capabilities. Therefore, this paper proposes a novel learning method, ``Hierarchical Imbalanced data learning with Weighted sampling and Label smoothing" (HIWL). The HIWL consists of three key techniques respectively dealing with the above-mentioned three problems: (1) Designed a hierarchical galaxy classification model based on an efficient backbone network; (2) Utilized a weighted sampling scheme to deal with the imbalance problem; (3) Adopted a label smoothing technique to alleviate the DDRGC problem. We applied this method to galaxy photometric images from the Galaxy Zoo-The Galaxy Challenge, exploring the recognition of completely round smooth, in between smooth, cigar-shaped, edge-on and spiral. The overall classification accuracy is 96.32\%, and some superiorities of the HIWL are shown based on recall, precision, and F1-Score in comparing with some related works. In addition, we also explored the visualization of the galaxy image features and model attention to understand the foundations of the proposed scheme.

artificial intelligence, machine learning, pattern recognition, (17 more...)

doi: 10.1093/mnras/stac3770

2212.10081

Country:

Asia > China (0.46)
North America > United States (0.28)
Europe (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.74)