AITopics

2204.07657

Country:

South America > Uruguay > Artigas > Artigas (0.04)
Oceania > Australia (0.04)
North America > United States > Oregon (0.04)
(5 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Government > Regional Government > North America Government > United States Government (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Awasthi, Pranjal, Jung, Christopher, Morgenstern, Jamie

Distributionally Robust Data Join

arXiv.org Artificial IntelligenceJun-14-2023

Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is $r_1$ away from the empirical distribution over the labeled dataset and $r_2$ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.

artificial intelligence, machine learning, optimization problem, (15 more...)

2202.05797

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > New York (0.04)
Europe > Sweden > Norrbotten County > Luleå (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.45)

arXiv.org Artificial IntelligenceJun-14-2023

Investigating Membership Inference Attacks under Data Dependencies

Humphries, Thomas, Oya, Simon, Tulloch, Lindsey, Rafuse, Matthew, Goldberg, Ian, Hengartner, Urs, Kerschbaum, Florian

Training machine learning models on privacy-sensitive data has become a popular practice, driving innovation in ever-expanding fields. This has opened the door to new attacks that can have serious privacy implications. One such attack, the Membership Inference Attack (MIA), exposes whether or not a particular data point was used to train a model. A growing body of literature uses Differentially Private (DP) training algorithms as a defence against such attacks. However, these works evaluate the defence under the restrictive assumption that all members of the training set, as well as non-members, are independent and identically distributed. This assumption does not hold for many real-world use cases in the literature. Motivated by this, we evaluate membership inference with statistical dependencies among samples and explain why DP does not provide meaningful protection (the privacy parameter $\epsilon$ scales with the training set size $n$) in this more general case. We conduct a series of empirical evaluations with off-the-shelf MIAs using training sets built from real-world data showing different types of dependencies among samples. Our results reveal that training set dependencies can severely increase the performance of MIAs, and therefore assuming that data samples are statistically independent can significantly underestimate the performance of MIAs.

artificial intelligence, experiment, machine learning, (17 more...)

2010.12112

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.14)
North America > United States > Texas (0.04)
Europe > Switzerland (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Automatic and Accurate Classification of Hotel Bathrooms from Images with Deep Learning

Temiz, Hakan

Hotel bathrooms are one of the most important places in terms of customer satisfaction, and where the most complaints are reported. To share their experiences, guests rate hotels, comment, and share images of their positive or negative ratings. An important part of the room images shared by guests is related to bathrooms. Guests tend to prove their satisfaction or dissatisfaction with the bathrooms with images in their comments. These Positive or negative comments and visuals potentially affect the prospective guests. In this study, two different versions of a deep learning algorithm were designed to classify hotel bathrooms as satisfactory (good) or unsatisfactory (bad, when any defects such as dirtiness, deficiencies, malfunctions were present) by analyzing images. The best-performer between the two models was determined as a result of a series of extensive experimental studies. The models were trained for each of 144 combinations of 5 hyper-parameter sets with a data set containing more than 11 thousand bathroom images, specially created for this study. The "HotelBath" data set was shared also with the community with this study. Four different image sizes were taken into consideration: 128, 256, 512 and 1024 pixels in both directions. The classification performances of the models were measured with several metrics. Both algorithms showed very attractive performances even with many combinations of hyper-parameters. They can classify bathroom images with very high accuracy. Suh that the top algorithm achieved an accuracy of 92.4% and an AUC (area under the curve) score of 0.967. In addition, other metrics also proved the success...

bathroom, classification performance, image size, (14 more...)

doi: 10.29137/umagd.1217004

2306.07727

Country:

Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.05)
Asia > Middle East > Republic of Türkiye > Artvin Province > Artvin (0.04)
Oceania > New Zealand (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry: Consumer Products & Services > Hotels (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Quidwai, Mujahid Ali, Li, Chunhui, Dube, Parijat

Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to Document Level

The increasing reliance on large language models (LLMs) in academic writing has led to a rise in plagiarism. Existing AI-generated text classifiers have limited accuracy and often produce false positives. We propose a novel approach using natural language processing (NLP) techniques, offering quantifiable metrics at both sentence and document levels for easier interpretation by human evaluators. Our method employs a multi-faceted approach, generating multiple paraphrased versions of a given question and inputting them into the LLM to generate answers. By using a contrastive loss function based on cosine similarity, we match generated sentences with those from the student's response. Our approach achieves up to 94% accuracy in classifying human and AI text, providing a robust and adaptable solution for plagiarism detection in academic settings. This method improves with LLM advancements, reducing the need for new model training or reconfiguration, and offers a more transparent way of evaluating and detecting AI-generated text.

large language model, machine learning, natural language, (20 more...)

2306.08122

Country:

Africa > Chad > Tibesti > Bardai (0.05)
North America > United States > New York (0.04)

Genre:

Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Industry:

Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.61)
Transportation > Air (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)

Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning

Cho, Hyunsoo, Park, Choonghyun, Kim, Junyeop, Kim, Hyuhng Joon, Yoo, Kang Min, Lee, Sang-goo

As the size of the pre-trained language model (PLM) continues to increase, numerous parameter-efficient transfer learning methods have been proposed recently to compensate for the tremendous cost of fine-tuning. Despite the impressive results achieved by large pre-trained language models (PLMs) and various parameter-efficient transfer learning (PETL) methods on sundry benchmarks, it remains unclear if they can handle inputs that have been distributionally shifted effectively. In this study, we systematically explore how the ability to detect out-of-distribution (OOD) changes as the size of the PLM grows or the transfer methods are altered. Specifically, we evaluated various PETL techniques, including fine-tuning, Adapter, LoRA, and prefix-tuning, on three different intention classification tasks, each utilizing various language models with different scales.

large language model, machine learning, natural language, (20 more...)

2301.1166

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.82)
(2 more...)

Veselovsky, Veniamin, Ribeiro, Manoel Horta, West, Robert

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk

large language model, machine learning, natural language, (18 more...)

2306.07899

Country:

North America > United States > California (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Differential Privacy with Random Projections and Sign Random Projections

Li, Ping, Li, Xiaoyun

In this paper, we develop a series of differential privacy (DP) algorithms from a family of random projections (RP) for general applications in machine learning, data mining, and information retrieval. Among the presented algorithms, iDP-SignRP is remarkably effective under the setting of ``individual differential privacy'' (iDP), based on sign random projections (SignRP). Also, DP-SignOPORP considerably improves existing algorithms in the literature under the standard DP setting, using ``one permutation + one random projection'' (OPORP), where OPORP is a variant of the celebrated count-sketch method with fixed-length binning and normalization. Without taking signs, among the DP-RP family, DP-OPORP achieves the best performance. Our key idea for improving DP-RP is to take only the signs, i.e., $sign(x_j) = sign\left(\sum_{i=1}^p u_i w_{ij}\right)$, of the projected data. The intuition is that the signs often remain unchanged when the original data ($u$) exhibit small changes (according to the ``neighbor'' definition in DP). In other words, the aggregation and quantization operations themselves provide good privacy protections. We develop a technique called ``smooth flipping probability'' that incorporates this intuitive privacy benefit of SignRPs and improves the standard DP bit flipping strategy. Based on this technique, we propose DP-SignOPORP which satisfies strict DP and outperforms other DP variants based on SignRP (and RP), especially when $\epsilon$ is not very large (e.g., $\epsilon = 5\sim10$). Moreover, if an application scenario accepts individual DP, then we immediately obtain an algorithm named iDP-SignRP which achieves excellent utilities even at small~$\epsilon$ (e.g., $\epsilon<0.5$).

data mining, machine learning, natural language, (19 more...)

2306.01751

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Ontario > Toronto (0.14)
(36 more...)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Lin, Zong-Zhi, Pike, Thomas D., Bailey, Mark M., Bastian, Nathaniel D.

A Hypergraph-Based Machine Learning Ensemble Network Intrusion Detection System

Network intrusion detection systems (NIDS) to detect malicious attacks continue to meet challenges. NIDS are often developed offline while they face auto-generated port scan infiltration attempts, resulting in a significant time lag from adversarial adaption to NIDS response. To address these challenges, we use hypergraphs focused on internet protocol addresses and destination ports to capture evolving patterns of port scan attacks. The derived set of hypergraph-based metrics are then used to train an ensemble machine learning (ML) based NIDS that allows for real-time adaption in monitoring and detecting port scanning activities, other types of attacks, and adversarial intrusions at high accuracy, precision and recall performances. This ML adapting NIDS was developed through the combination of (1) intrusion examples, (2) NIDS update rules, (3) attack threshold choices to trigger NIDS retraining requests, and (4) a production environment with no prior knowledge of the nature of network traffic. 40 scenarios were auto-generated to evaluate the ML ensemble NIDS comprising three tree-based models. The resulting ML Ensemble NIDS was extended and evaluated with the CIC-IDS2017 dataset. Results show that under the model settings of an Update-ALL-NIDS rule (specifically retrain and update all the three models upon the same NIDS retraining request) the proposed ML ensemble NIDS evolved intelligently and produced the best results with nearly 100% detection performance throughout the simulation.

artificial intelligence, machine learning, nid, (14 more...)

2211.03933

Country:

North America > United States > Maryland > Montgomery County > Bethesda (0.04)
North America > United States > Illinois > Cook County > Evanston (0.04)
North America > United States > Hawaii (0.04)
Europe > Portugal (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.48)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)

Wu, Caesar, Lib, Yuan-Fang, Bouvry, Pascal

Survey of Trustworthy AI: A Meta Decision of AI

arXiv.org Artificial IntelligenceJun-12-2023

When making strategic decisions, we are often confronted with overwhelming information to process. The situation can be further complicated when some pieces of evidence are contradicted each other or paradoxical. The challenge then becomes how to determine which information is useful and which ones should be eliminated. This process is known as meta-decision. Likewise, when it comes to using Artificial Intelligence (AI) systems for strategic decision-making, placing trust in the AI itself becomes a meta-decision, given that many AI systems are viewed as opaque "black boxes" that process large amounts of data. Trusting an opaque system involves deciding on the level of Trustworthy AI (TAI). We propose a new approach to address this issue by introducing a novel taxonomy or framework of TAI, which encompasses three crucial domains: articulate, authentic, and basic for different levels of trust. To underpin these domains, we create ten dimensions to measure trust: explainability/transparency, fairness/diversity, generalizability, privacy, data governance, safety/robustness, accountability, reproducibility, reliability, and sustainability. We aim to use this taxonomy to conduct a comprehensive survey and explore different TAI approaches from a strategic decision-making perspective.

data mining, machine learning, natural language, (21 more...)

2306.0038

Country:

Europe > Switzerland (0.14)
North America > United States > Massachusetts (0.14)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre:

Overview (1.00)
Research Report > New Finding (0.92)
Research Report > Experimental Study (0.67)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
(4 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(5 more...)