Statistical Learning
Learning to Identify Locally Actionable Health Anomalies
Chen, Kuang (University of California, Berkeley) | Brunskill, Emma (University of California, Berkeley) | Dick, Jonathan (University of Chicago) | Dhadialla, Prabhjot (Columbia University)
Local information access (LIA) programs tap into existing public health data flows, and present data in simple and useful ways to ground staff. LIAs hold great potential for improving rural health systems in developing regions; benefits include more evidence-based decision making and optimizations at a local scale, as well as improved service delivery and data quality. Our fledgling LIA program in rural Uganda currently provides clinicians with a small set of static data visualizations for discussion. To increase the program’s effectiveness, we want to automatically identify relevant data visualizations. We propose an adaptive tool that learns from local clinicians’ decision-making processes to predict and generate visualizations that show actionable anomalies.
Who’s Calling? Demographics of Mobile Phone Use in Rwanda
Blumenstock, Joshua Evan (University of California, Berkeley) | Gillick, Dan (University of California, Berkeley) | Eagle, Nathan (Santa Fe Institute)
But whereas in the general Rwandan populace males tend Despite the increasing ubiquity of mobile phones in the developing to be much better educated (76.3% of males are literate, but world, remarkably little is known about the structure only 64.7% of females), among mobile phone users it is the and demographics of the mobile phone market. While a women who achieve higher levels of education: the median few qualitative studies have detailed social norms of phone woman completes secondary school, while the median man use in specific communities (Donner 2007; Burrell 2009), does not (t 4.79). Table 1 shows a few statistics on asset and a handful of quantitative researchers have begun to analyze ownership, with associated sampling error.
People, Quakes, and Communications: Inferences from Call Dynamics about a Seismic Event and its Influences on a Population
Kapoor, Ashish (Microsoft Research) | Eagle, Nathan (The Santa Fe Institute) | Horvitz, Eric (Microsoft Research)
We explore the prospect of inferring the epicenter and influences of seismic activity from changes in background phone communication activities logged at cell towers. In particular, we explore the perturbations in Rwandan call data invoked by an earthquake in February 2008 centered in the Lac Kivu region of the Democratic Republic of the Congo. Beyond the initial seismic event, we investigate the challenge of assessing the distribution of the persistence of needs over geographic regions, using the persistence of call anomalies after the earthquake as a proxy for lasting influences and the potential need for assistance. We also infer uncertainties in the inferences and consider the prospect of identifying the value of surveying the areas so that surveillance resources can be best triaged.
A Gender-Centric Analysis of Calling Behavior in a Developing Economy Using Call Detail Records
Frias-Martinez, Vanessa (Telefonica Research, Madrid) | Frias-Martinez, Enrique (Telefonica Research, Madrid) | Oliver, Nuria (Telefonica Research, Madrid)
The gender divide in the access to technology in developing economies makes gender characterization and automatic gender identification two of the most critical needs for improving cell phone-based services. Gender identification has been typically solved using voice or image processing. However, such techniques cannot be applied to cell phone networks mostly due to privacy concerns. In this paper, we present a study aimed at characterizing and automatically identifying the gender of a cell phone user in a developing economy based on behavioral, social and mobility variables. Our contributions are twofold: (1) understanding the role that gender plays on phone usage, and (2) evaluating common machine learning approaches for gender identification. The analysis was carried out using the encrypted CDRs (Call Detail Records) of approximately 10,000 users from a developing economy, whose gender was known a priori. Our results indicate that behavioral and social variables, including the number of input/output calls and the in degree/out degree of the social network, reveal statistically significant differences between male and female callers. Finally, we propose a new gender identification algorithm that can achieve classification rates of up to 80% when the percentage of predicted instances is reduced.
Mining Road Traffic Accident Data to Improve Safety: Role of Road-Related Factors on Accident Severity in Ethiopia
Beshah, Tibebe (Addis Ababa University) | Hill, Shawndra (University of Pennsylvania)
Road traffic accidents (RTAs) are a major public health concern, resulting in an estimated 1.2 million deaths and 50 million injuries worldwide each year. In the developing world, RTAs are among the leading cause of death and injury; Ethiopia in particular experiences the highest rate of such accidents. Thus, methods to reduce accident severity are of great interest to traffic agencies and the public at large. In this work, we applied data mining technologies to link recorded road characteristics to accident severity in Ethiopia, and developed a set of rules that could be used by the Ethiopian Traffic Agency to improve safety.
Elliptical slice sampling
Murray, Iain, Adams, Ryan Prescott, MacKay, David J. C.
Many probabilistic models introduce strong dependencies between variables using a latent multivariate Gaussian distribution or a Gaussian process. We present a new Markov chain Monte Carlo algorithm for performing inference in models with multivariate Gaussian priors. Its key properties are: 1) it has simple, generic code applicable to many models, 2) it has no free parameters, 3) it works well for a variety of Gaussian process based models. These properties make our method ideal for use while model building, removing the need to spend time deriving and tuning updates for more complex algorithms.
Linear Time Feature Selection for Regularized Least-Squares
Pahikkala, Tapio, Airola, Antti, Salakoski, Tapio
We propose a novel algorithm for greedy forward feature selection for regularized least-squares (RLS) regression and classification, also known as the least-squares support vector machine or ridge regression. The algorithm, which we call greedy RLS, starts from the empty feature set, and on each iteration adds the feature whose addition provides the best leave-one-out cross-validation performance. Our method is considerably faster than the previously proposed ones, since its time complexity is linear in the number of training examples, the number of features in the original data set, and the desired size of the set of selected features. Therefore, as a side effect we obtain a new training algorithm for learning sparse linear RLS predictors which can be used for large scale learning. This speed is possible due to matrix calculus based short-cuts for leave-one-out and feature addition. We experimentally demonstrate the scalability of our algorithm and its ability to find good quality feature sets.
Context-based Word Acquisition for Situated Dialogue in a Virtual World
To tackle the vocabulary problem in conversational systems, previous work has applied unsupervised learning approaches on co-occurring speech and eye gaze during interaction to automatically acquire new words. Although these approaches have shown promise, several issues related to human language behavior and human-machine conversation have not been addressed. First, psycholinguistic studies have shown certain temporal regularities between human eye movement and language production. While these regularities can potentially guide the acquisition process, they have not been incorporated in the previous unsupervised approaches. Second, conversational systems generally have an existing knowledge base about the domain and vocabulary. While the existing knowledge can potentially help bootstrap and constrain the acquired new words, it has not been incorporated in the previous models. Third, eye gaze could serve different functions in human-machine conversation. Some gaze streams may not be closely coupled with speech stream, and thus are potentially detrimental to word acquisition. Automated recognition of closely-coupled speech-gaze streams based on conversation context is important. To address these issues, we developed new approaches that incorporate user language behavior, domain knowledge, and conversation context in word acquisition. We evaluated these approaches in the context of situated dialogue in a virtual world. Our experimental results have shown that incorporating the above three types of contextual information significantly improves word acquisition performance.
Predicting Positive and Negative Links in Online Social Networks
Leskovec, Jure, Huttenlocher, Daniel, Kleinberg, Jon
We study online social networks in which relationships can be either positive (indicating relations such as friendship) or negative (indicating relations such as opposition or antagonism). Such a mix of positive and negative links arise in a variety of online settings; we study datasets from Epinions, Slashdot and Wikipedia. We find that the signs of links in the underlying social networks can be predicted with high accuracy, using models that generalize across this diverse range of sites. These models provide insight into some of the fundamental principles that drive the formation of signed links in networks, shedding light on theories of balance and status from social psychology; they also suggest social computing applications by which the attitude of one user toward another can be estimated from evidence provided by their relationships with other members of the surrounding social network.
Universality, Characteristic Kernels and RKHS Embedding of Measures
Sriperumbudur, Bharath K., Fukumizu, Kenji, Lanckriet, Gert R. G.
A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding is injective. In this paper, we generalize this embedding to finite signed Borel measures, wherein any finite signed Borel measure is represented as a mean element in an RKHS. We show that the proposed embedding is injective if and only if the kernel is universal. This therefore, provides a novel characterization of universal kernels, which are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms. By exploiting this relation between universality and the embedding of finite signed Borel measures into an RKHS, we establish the relation between universal and characteristic kernels.