Goto

Collaborating Authors

 label data


Learning from Concealed Labels

arXiv.org Artificial Intelligence

Annotating data for sensitive labels (e.g., disease, smoking) poses a potential threats to individual privacy in many real-world scenarios. To cope with this problem, we propose a novel setting to protect privacy of each instance, namely learning from concealed labels for multi-class classification. Concealed labels prevent sensitive labels from appearing in the label set during the label collection stage, which specifies none and some random sampled insensitive labels as concealed labels set to annotate sensitive data. In this paper, an unbiased estimator can be established from concealed data under mild assumptions, and the learned multi-class classifier can not only classify the instance from insensitive labels accurately but also recognize the instance from the sensitive labels. Moreover, we bound the estimation error and show that the multi-class classifier achieves the optimal parametric convergence rate. Experiments demonstrate the significance and effectiveness of the proposed method for concealed labels in synthetic and real-world datasets.


Scale AI launches rapid data-labeling service

#artificialintelligence

Amid the boom of AI in application building, companies face a significant data-labeling problem, especially when it comes to labeling images or other media content they want to train deep learning algorithms on. Today data-labeling and infrastructure provider Scale AI launched a service called Scale Rapid that aims to solve this problem by labeling a data sample within one to three hours. Users can review the work to make sure the labeling is being done correctly, iterate upon their labeling instructions if necessary, and then ramp up to have Scale AI label the rest of their dataset. This is the latest in a series of products Scale AI has launched in the last year as it seeks to maintain its leadership in the labeling sphere. In April, the company raised $325 million, bringing its total raised to over $602 million.


Learning with Different Amounts of Annotation: From Zero to Many Labels

arXiv.org Artificial Intelligence

Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the spectrum of language interpretation. We explore new annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in two tasks, suggesting distributing labels unevenly among training examples can be beneficial for many NLP tasks.


6 Reasons to Spend More Time Thinking About Labels

#artificialintelligence

Quite a few of the issues should be addressed as part of an established machine learning operations. Some issues may be resolved through support functions such as legal, people, general data management and smart procedure design -- more on that at a later post. For now, let's focus on the all important labels, as opposed to the features.


The ultimate guide to data labeling: How to label data for ML

#artificialintelligence

Artificial Intelligence (AI) is driving the future, and you should be ready for it to have a competitive advantage. Machine learning (ML) is a subset of AI that provides software applications with the ability to detect patterns and make accurate predictions. ML gave us self-driving cars, email spam filtering, traffic detection, and more. To train the highest-quality ML models, you need to feed their algorithm with accurate labeled data. This blog post covers everything you need to know about data labeling to make informed decisions for your business.


Aggregate Learning for Mixed Frequency Data

arXiv.org Machine Learning

Large and acute economic shocks such as the 2007-2009 financial crisis and the current COVID-19 infections rapidly change the economic environment. In such a situation, the importance of real-time economic analysis using alternative datais emerging. Alternative data such as search query and location data are closer to real-time and richer than official statistics that are typically released once a month in an aggregated form. We take advantage of spatio-temporal granularity of alternative data and propose a mixed-FrequencyAggregate Learning (MF-AGL)model that predicts economic indicators for the smaller areas in real-time. We apply the model for the real-world problem; prediction of the number of job applicants which is closely related to the unemployment rates. We find that the proposed model predicts (i) the regional heterogeneity of the labor market condition and (ii) the rapidly changing economic status. The model can be applied to various tasks, especially economic analysis


How Synthetic Data Sets Can Improve Computer Vision Models

#artificialintelligence

In recent years, deep learning models have produced a substantial amount of advances in various areas, including computer vision. Computer vision typically usually works by analysing images that have been captured using the physical camera sensor, followed by a human-in-the-loop process that requires annotators to label things of interest. It's important to note that the more sophisticated the annotation is, the more laborious labelling can be. But it provides for a much richer analysis of the image itself. For example, for spotting a tiny detail within an image, a simple bounding box around the object might suffice. But once you start looking to get a robot to grasp something, you might need a segmentation mask to flesh out the fine contours of the object.


Methods of Data Labeling in Machine Learning

#artificialintelligence

Accruing a large amount of data is relatively simple. Data can be scraped, created or copied and then be stored in huge data storages. A key driver in developing an intelligent model, however, is not just a sheer mass of data but also an effective strategy to intelligently label data to add structure and sense to the data. Data labeling can, therefore, be described as a way to organize information depending on its content. This content determines the tag or label to be assigned to a specific piece of information after it has been processed.


How to Label Data -- Create ML for Object Detection

#artificialintelligence

The new Create ML app just announced at WWDC 2019, is an incredibly easy way to train your own personalized machine learning models. All that's required is dragging a folder containing your training data into the tool and Create ML does the rest of the heavy lifting. So how do we prepare our data? When doing image or sound classification we just need to organize the data into folders, but if we want to do object detection the task becomes a bit more complicated. With object detection, we need to specify some additional information.


Amazon SageMaker Ground Truth AWS

#artificialintelligence

Amazon SageMaker Ground Truth helps you build highly accurate training datasets for machine learning quickly. SageMaker Ground Truth offers easy access to public and private human labelers and provides them with built-in workflows and interfaces for common labeling tasks. Additionally, SageMaker Ground Truth can lower your labeling costs by up to 70% using automatic labeling, which works by training Ground Truth from data labeled by humans so that the service learns to label data independently. Successful machine learning models are built on the shoulders of large volumes of high-quality training data. But, the process to create the training data necessary to build these models is often expensive, complicated, and time-consuming.