Goto

Collaborating Authors

Results


A Survey on Data Pricing: from Economics to Data Science

arXiv.org Artificial Intelligence

How can we assess the value of data objectively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electronic commerce, data management, data mining and machine learning. In this article, we present a unified, interdisciplinary and comprehensive overview of this important direction. We examine various motivations behind data pricing, understand the economics of data pricing and review the development and evolution of pricing models according to a series of fundamental principles. We discuss both digital products and data products. We also consider a series of challenges and directions for future work.


Applications of Differential Privacy to European Privacy Law (GDPR) and Machine Learning

#artificialintelligence

Differential privacy is a data anonymization technique that's used by major technology companies such as Apple and Google. The goal of differential privacy is simple: allow data analysts to build accurate models without sacrificing the privacy of the individual data points. But what does "sacrificing the privacy of the data points" mean? Well, let's think about an example. Suppose I have a dataset that contains information (age, gender, treatment, marriage status, other medical conditions, etc.) about every person who was treated for breast cancer at Hospital X.


Confidential computing: the final frontier of data security

#artificialintelligence

Data threats never rest, nor should the protection of your sensitive information. That's the driving principle behind confidential computing, which seeks to plug a potentially crippling hole in data security. Confidential computing provides a secure platform for multiple parties to combine, analyze and learn from sensitive data without exposing their data or machine learning algorithms to the other party. This technique goes by several names -- multiparty computing, federated learning and privacy-preserving analytics, among them -- and confidential computing can enable this type of collaboration while preserving privacy and regulatory compliance. Data exists in three states: in transit when it is moving through the network; at rest when stored; and in use as it's being processed.


Deconvoluting Kernel Density Estimation and Regression for Locally Differentially Private Data

arXiv.org Machine Learning

Local differential privacy has become the gold-standard of privacy literature for gathering or releasing sensitive individual data points in a privacy-preserving manner. However, locally differential data can twist the probability density of the data because of the additive noise used to ensure privacy. In fact, the density of privacy-preserving data (no matter how many samples we gather) is always flatter in comparison with the density function of the original data points due to convolution with privacy-preserving noise density function. The effect is especially more pronounced when using slow-decaying privacy-preserving noises, such as the Laplace noise. This can result in under/over-estimation of the heavy-hitters. This is an important challenge facing social scientists due to the use of differential privacy in the 2020 Census in the United States. In this paper, we develop density estimation methods using smoothing kernels. We use the framework of deconvoluting kernel density estimators to remove the effect of privacy-preserving noise. This approach also allows us to adapt the results from non-parameteric regression with errors-in-variables to develop regression models based on locally differentially private data. We demonstrate the performance of the developed methods on financial and demographic datasets.


6 Privacy Solutions for Big Data and Machine Learning

#artificialintelligence

Travelers who wander the banana pancake trail through Southeast Asia will all get roughly the same experience. They'll eat crummy food on one of fifty boats floating around Halong Bay, then head up to the highlands of Sapa for a faux cultural experience with hill tribes that grow dreadful cannabis. After that, it's on to Laos to float the river in Vang Vien while smashed on opium tea. Eventually, you'll see someone wearing a t-shirt with the classic slogan – "same same, but different." The origins of this phrase surround the Southeast Asian vendors who often respond to queries about the authenticity of fake goods they're selling with "same same, but different." It's a phrase that appropriately describes how the technology world loves to spin things as fresh and new when they've hardly changed at all.


6 Privacy Solutions for Big Data and Machine Learning

#artificialintelligence

Travelers who wander the banana pancake trail through Southeast Asia will all get roughly the same experience. They'll eat crummy food on one of fifty boats floating around Ha Long Bay, then head up to the highlands of Sa Pa for a faux cultural experience with hill tribes that grow dreadful cannabis. After that, it's on to Laos to float the river in Vang Vieng while smashed on opium tea. Eventually, you'll see someone wearing a t-shirt with the classic slogan – "same same, but different." The origins of this phrase surround the Southeast Asian vendors who often respond to queries about the authenticity of fake goods they're selling with "same same, but different." It's a phrase that appropriately describes how the technology world loves to spin things as fresh and new when they've hardly changed at all.


Coresets for Regressions with Panel Data

arXiv.org Machine Learning

This paper introduces the problem of coresets for regression problems to panel data settings. We first define coresets for several variants of regression problems with panel data and then present efficient algorithms to construct coresets of size that depend polynomially on 1/$\varepsilon$ (where $\varepsilon$ is the error parameter) and the number of regression parameters - independent of the number of individuals in the panel data or the time units each individual is observed for. Our approach is based on the Feldman-Langberg framework in which a key step is to upper bound the "total sensitivity" that is roughly the sum of maximum influences of all individual-time pairs taken over all possible choices of regression parameters. Empirically, we assess our approach with synthetic and real-world datasets; the coreset sizes constructed using our approach are much smaller than the full dataset and coresets indeed accelerate the running time of computing the regression objective.


Does Palantir See Too Much?

#artificialintelligence

On a bright Tuesday afternoon in Paris last fall, Alex Karp was doing tai chi in the Luxembourg Gardens. He wore blue Nike sweatpants, a blue polo shirt, orange socks, charcoal-gray sneakers and white-framed sunglasses with red accents that inevitably drew attention to his most distinctive feature, a tangle of salt-and-pepper hair rising skyward from his head. Under a canopy of chestnut trees, Karp executed a series of elegant tai chi and qigong moves, shifting the pebbles and dirt gently under his feet as he twisted and turned. A group of teenagers watched in amusement. After 10 minutes or so, Karp walked to a nearby bench, where one of his bodyguards had placed a cooler and what looked like an instrument case. The cooler held several bottles of the nonalcoholic German beer that Karp drinks (he would crack one open on the way out of the park). The case contained a wooden sword, which he needed for the next part of his routine. "I brought a real sword the last time I was here, but the police stopped me," he said matter of factly as he began slashing the air with the sword. Those gendarmes evidently didn't know that Karp, far from being a public menace, was the chief executive of an American company whose software has been deployed on behalf of public safety in France. The company, Palantir Technologies, is named after the seeing stones in J.R.R. Tolkien's "The Lord of the Rings." Its two primary software programs, Gotham and Foundry, gather and process vast quantities of data in order to identify connections, patterns and trends that might elude human analysts. The stated goal of all this "data integration" is to help organizations make better decisions, and many of Palantir's customers consider its technology to be transformative. Karp claims a loftier ambition, however. "We built our company to support the West," he says. To that end, Palantir says it does not do business in countries that it considers adversarial to the U.S. and its allies, namely China and Russia. In the company's early days, Palantir employees, invoking Tolkien, described their mission as "saving the shire." The brainchild of Karp's friend and law-school classmate Peter Thiel, Palantir was founded in 2003. It was seeded in part by In-Q-Tel, the C.I.A.'s venture-capital arm, and the C.I.A. remains a client. Palantir's technology is rumored to have been used to track down Osama bin Laden -- a claim that has never been verified but one that has conferred an enduring mystique on the company. These days, Palantir is used for counterterrorism by a number of Western governments.


Meet modern compliance: Using AI and data to manage business risk better

#artificialintelligence

In June 2020, when the U.S. Department of Justice (DoJ) issued updated guidance on how to evaluate corporate compliance programs, it came with a clear mandate to companies: Compliance programs must use robust technology and data analytics to assess their own actions and those of any third parties they do business with, from the point of engagement onward. At the very least, companies are expected to be able to explain the rationale for using third parties, whether they have relationships with foreign officials, and any potential risks to their reputation. This is a compliance game-changer. Historically, organizations could argue that they simply did not have the information available to identify potential compliance dissonance across their networks: the "needle in a haystack" defense. Organizations are now expected to show that they are leveraging data and applying modern analytics to draw insights and navigate the risks across their entire business network.


Using Data and Respecting Users

Communications of the ACM

Transaction data is like a friendship tie: both parties must respect the relationship and if one party exploits it the relationship sours. As data becomes increasingly valuable, firms must take care not to exploit their users or they will sour their ties. Ethical uses of data cover a spectrum: at one end, using patient data in healthcare to cure patients is little cause for concern. At the other end, selling data to third parties who exploit users is serious cause for concern.2 Between these two extremes lies a vast gray area where firms need better ways to frame data risks and rewards in order to make better legal and ethical choices.