Hidalgo, César
Measuring and Controlling Divisiveness in Rank Aggregation
Colley, Rachael, Grandi, Umberto, Hidalgo, César, Macedo, Mariana, Navarrete, Carlos
Rank aggregation is the problem of ordering a set of issues according to a set of individual rankings given as input. This problem has been studied extensively in computational social choice (see, e.g., Brandt et al. 2016) when the rankings are assumed to represent human preferences over, for example, candidates in a political election, projects to be funded, or more generally alternative proposals. The most common approach in this literature is to find normative desiderata for the aggregation process, including computational requirements such as the existence of tractable algorithms for its calculation and characterisations of the aggregators that satisfy them. Rank aggregation also has a wide spectrum of applications from metasearch engines [Dwork et al., 2001] to bioinformatics
Sherlock: A Deep Learning Approach to Semantic Data Type Detection
Hulsebos, Madelon, Hu, Kevin, Bakker, Michiel, Zgraggen, Emanuel, Satyanarayan, Arvind, Kraska, Tim, Demiralp, Çağatay, Hidalgo, César
Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.