Goto

Collaborating Authors

 extract information


How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy

Ponomareva, Natalia, Xu, Zheng, McMahan, H. Brendan, Kairouz, Peter, Rosenblatt, Lucas, Cohen-Addad, Vincent, Guzmán, Cristóbal, McKenna, Ryan, Andrew, Galen, Bie, Alex, Yu, Da, Kurakin, Alex, Zadimoghaddam, Morteza, Vassilvitskii, Sergei, Terzis, Andreas

arXiv.org Machine Learning

High quality data is needed to unlock the full potential of AI for end users. However finding new sources of such data is getting harder: most publicly-available human generated data will soon have been used. Additionally, publicly available data often is not representative of users of a particular system -- for example, a research speech dataset of contractors interacting with an AI assistant will likely be more homogeneous, well articulated and self-censored than real world commands that end users will issue. Therefore unlocking high-quality data grounded in real user interactions is of vital interest. However, the direct use of user data comes with significant privacy risks. Differential Privacy (DP) is a well established framework for reasoning about and limiting information leakage, and is a gold standard for protecting user privacy. The focus of this work, \emph{Differentially Private Synthetic data}, refers to synthetic data that preserves the overall trends of source data,, while providing strong privacy guarantees to individuals that contributed to the source dataset. DP synthetic data can unlock the value of datasets that have previously been inaccessible due to privacy concerns and can replace the use of sensitive datasets that previously have only had rudimentary protections like ad-hoc rule-based anonymization. In this paper we explore the full suite of techniques surrounding DP synthetic data, the types of privacy protections they offer and the state-of-the-art for various modalities (image, tabular, text and decentralized). We outline all the components needed in a system that generates DP synthetic data, from sensitive data handling and preparation, to tracking the use and empirical privacy testing. We hope that work will result in increased adoption of DP synthetic data, spur additional research and increase trust in DP synthetic data approaches.


Prompt Engineering Guidance for Conceptual Agent-based Model Extraction using Large Language Models

Khatami, Siamak, Frantz, Christopher

arXiv.org Artificial Intelligence

This document contains detailed information about the prompts used in the experimental process discussed in the paper "Toward Automating Agent-based Model Generation: A Benchmark for Model Extraction using Question-Answering Techniques". The paper aims to utilize Question-answering (QA) models to extract the necessary information to implement Agent-based Modeling (ABM) from conceptual models. It presents the extracted information in formats that can be read by both humans and computers (i.e., JavaScript Object Notation (JSON)), enabling manual use by humans and auto-code generation by Large Language Models (LLM).


HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

Kawamura, Kazuki, Yamamoto, Akihiro

arXiv.org Artificial Intelligence

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.


Question Answering as Programming for Solving Time-Sensitive Questions

Zhu, Xinyu, Yang, Cheng, Chen, Bei, Li, Siheng, Lou, Jian-Guang, Yang, Yujiu

arXiv.org Artificial Intelligence

Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world. However, due to the dynamic and ever-changing nature of real-world facts, the answer can be completely different when the time constraint in the question changes. Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering, while our experiments reveal that the aforementioned problems still pose a significant challenge to existing LLMs. This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics. To overcome this limitation, rather than requiring LLMs to directly answer the question, we propose a novel approach where we reframe the $\textbf{Q}$uestion $\textbf{A}$nswering task $\textbf{a}$s $\textbf{P}$rogramming ($\textbf{QAaP}$). Concretely, by leveraging modern LLMs' superior capability in understanding both natural language and programming language, we endeavor to harness LLMs to represent diversely expressed text as well-structured code and select the best matching answer from multiple candidates through programming. We evaluate our QAaP framework on several time-sensitive question answering datasets and achieve decent improvement, up to $14.5$% over strong baselines. Our codes and data are available at https://github.com/TianHongZXY/qaap


GETTING OVER A.I 🤖. The cult of Artificial Intelligence…

#artificialintelligence

The cult of Artificial Intelligence seems to have taken over the world! From facial data recognition sensors to chatbots, everything is bound to a certain algorithm. But when it comes to innovation in AI, a particular company seems to be the forerunner. OpenAI is a research organization that conducts research in the field of artificial intelligence (AI). It was founded in 2015 by a group of entrepreneurs, including Elon Musk and Sam Altman, with the goal of advancing the field of AI in a way that is safe and beneficial to humanity.


Which models are interpretable?

#artificialintelligence

Model explanation is an essential task in supervised machine learning. Explaining how a model can represent the information is crucial to understanding the dynamics that rule our data. Let's see some models that are easy to interpret. Data Scientists have the role to extract information from raw data. They aren't engineers, nor they are software developers.


La veille de la cybersécurité

#artificialintelligence

Technological breakthroughs have revolutionized the way individuals work and conduct business. For instance, people must develop skills that will enable them to find new jobs because it is predicted that automation could replace up to a third of all jobs by 2030. Consider the following to demonstrate how crucial document AI will be in the future: Did you know that 70% of enterprise documents are free-form text, such as written documents and emails? This indicates that the software used to automatically extract information and decode text from all of your documents has been processed (without human input). As a result, document AI has been made possible via machine learning.


How AI is transforming chat channels?

#artificialintelligence

AI is used in chat channels to assist with tasks such as customer service, order fulfillment, and product research. For example, customer service can use AI to answer customer questions, identify customer needs, and make recommendations. AI can also be used to monitor chat channels for problem keywords and phrases and automatically respond with appropriate solutions. Conversational AI is the process of using machine learning and deep neural networks to enable users to communicate with computer systems in natural language. The system extracts user intent from text or voice input and transforms the text into structured data.


Is there any difference between data science and machine learning?

#artificialintelligence

Data Science and machine learning are two wonderful and exciting disciplines and are a great part of our lives. Sometimes people confuse them, but they are quite different things. Data Science is, like the name suggests, the science of data. It's a set of techniques and tools that make the data scientist extract information behind data. Such a mining process can be done using statistical tools or mathematical models.


Breaking Privacy in Federated Learning

#artificialintelligence

Federated learning is a new way of training a machine learning using distributed data that is not centralized in a server. It works by training a generic (shared) model with a given user’s private…