Truth Inference at Scale: A Bayesian Model for Adjudicating Highly Redundant Crowd Annotations

arXiv.org Machine Learning

Crowd-sourcing is a cheap and popular means of creating training and evaluation datasets for machine learning, however it poses the problem of `truth inference', as individual workers cannot be wholly trusted to provide reliable annotations. Research into models of annotation aggregation attempts to infer a latent `true' annotation, which has been shown to improve the utility of crowd-sourced data. However, existing techniques beat simple baselines only in low redundancy settings, where the number of annotations per instance is low ($\le 3$), or in situations where workers are unreliable and produce low quality annotations (e.g., through spamming, random, or adversarial behaviours.) As we show, datasets produced by crowd-sourcing are often not of this type: the data is highly redundantly annotated ($\ge 5$ annotations per instance), and the vast majority of workers produce high quality outputs. In these settings, the majority vote heuristic performs very well, and most truth inference models underperform this simple baseline. We propose a novel technique, based on a Bayesian graphical model with conjugate priors, and simple iterative expectation-maximisation inference. Our technique produces competitive performance to the state-of-the-art benchmark methods, and is the only method that significantly outperforms the majority vote heuristic at one-sided level 0.025, shown by significance tests. Moreover, our technique is simple, is implemented in only 50 lines of code, and trains in seconds.


The Siri of the cell – tech podcast

The Guardian

How can scientists deal with the huge volume of new research publish on a daily basis? How can computers go further than merely parsing scientific papers, and actually suggest hypotheses themselves? When will we see a computer as another member of the lab team, serving hundreds of scientists simultaneously from its huge data set of extant research? This is the work of John Bachman, a systems biology PhD from Harvard Medical School, and Ben Giori, a postdoctoral fellow at Harvard Medical School's systems pharmacology lab. They're part of Darpa's Big Mechanism project, which is developing technology to read research abstracts and papers to extract pieces of causal mechanisms, then to assemble these pieces into more complete causal models, and to produce explanations.


The Morning After: Wednesday, May 10th 2017

Engadget

How's it gone so far? Microsoft's big annual conference kicks off today, and we've sniffed out what you can expect. We also get the full reveal of Amazon's Echo-with-a-screen. It's not pretty, but it does sound pretty smart. What to expect at Microsoft's Build 2017 conference While it's a mobile computing world, Microsoft has no shortage of projects we need to be updated on.


Apple says the iPhone doesn't listen to your conversations

Engadget

Last month, members of the House Energy and Commerce Committee fired off a letter to Apple following reports that phones and other devices, such as smart speakers, can listen in on conversations. Now, the tech giant has sent the Representatives its response: iPhones, it says, don't listen to people's conversations and don't share people's spoken words with third-parties. In what could be interpreted as a dig at its staunchest competitors, Cupertino explains in the letter (courtesy of CNET) that the customer is not its product and that its business model "does not depend on collecting vast amounts of personally identifiable information to enrich targeted profiles marketed to advertisers." In the original letter the lawmakers sent, they specifically noted reports that third-party apps could access the data devices supposedly collect while listening for their "trigger words," such as "Hey, Siri, "OK Google" and "Hey, Alexa." During Facebook's congressional hearing back in April, Senator Gary Peters (D-MI) even asked Mark Zuckerberg whether the social network listens in on people through their phone mics in order to serve relevant ads.


Tech Advances Make It Easier to Assign Blame for Cyberattacks

WSJ.com: WSJD - Technology

"All you have to do is look at the attacks that have taken place recently--WannaCry, NotPetya and others--and see how quickly the industry and government is coming out and assigning responsibility to nation states such as North Korea, Russia and Iran," said Dmitri Alperovitch, chief technology officer at CrowdStrike Inc., a cybersecurity company that has investigated a number of state-sponsored hacks. The White House and other countries took roughly six months to blame North Korea and Russia for the WannaCry and NotPetya attacks, respectively, while it took about three years for U.S. authorities to indict a North Korean hacker for the 2014 attack against Sony . Forensic systems are gathering and analyzing vast amounts of data from digital databases and registries to glean clues about an attacker's infrastructure. These clues, which may include obfuscation techniques and domain names used for hacking, can add up to what amounts to a unique footprint, said Chris Bell, chief executive of Diskin Advanced Technologies, a startup that uses machine learning to attribute cyberattacks. Additionally, the increasing amount of data related to cyberattacks--including virus signatures, the time of day the attack took place, IP addresses and domain names--makes it easier for investigators to track organized hacking groups and draw conclusions about them.