South America
Avoiding and Escaping Depressions in Real-Time Heuristic Search
Heuristics used for solving hard real-time search problems have regions with depressions. Such regions are bounded areas of the search space in which the heuristic function is inaccurate compared to the actual cost to reach a solution. Early real-time search algorithms, like LRTA*, easily become trapped in those regions since the heuristic values of their states may need to be updated multiple times, which results in costly solutions. State-of-the-art real-time search algorithms, like LSS-LRTA* or LRTA*(k), improve LRTA*'s mechanism to update the heuristic, resulting in improved performance. Those algorithms, however, do not guide search towards avoiding depressed regions. This paper presents depression avoidance, a simple real-time search principle to guide search towards avoiding states that have been marked as part of a heuristic depression. We propose two ways in which depression avoidance can be implemented: mark-and-avoid and move-to-border. We implement these strategies on top of LSS-LRTA* and RTAA*, producing 4 new real-time heuristic search algorithms: aLSS-LRTA*, daLSS-LRTA*, aRTAA*, and daRTAA*. When the objective is to find a single solution by running the real-time search algorithm once, we show that daLSS-LRTA* and daRTAA* outperform their predecessors sometimes by one order of magnitude. Of the four new algorithms, daRTAA* produces the best solutions given a fixed deadline on the average time allowed per planning episode. We prove all our algorithms have good theoretical properties: in finite search spaces, they find a solution if one exists, and converge to an optimal after a number of trials.
A Privacy-Aware Bayesian Approach for Combining Classifier and Cluster Ensembles
Acharya, Ayan, Hruschka, Eduardo R., Ghosh, Joydeep
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privacy aware computation of the model when instances of the target data are distributed across different data sites, is also discussed. Experimental results show that the proposed approach can provide good classification accuracies while adhering to the data/model sharing constraints.
Generalized Biwords for Bitext Compression and Translation Spotting
Sánchez-Martínez, F., Carrasco, R. C., Martínez-Prieto, M. A., Adiego, J.
Large bilingual parallel texts (also known as bitexts) are usually stored in a compressed form, and previous work has shown that they can be more efficiently compressed if the fact that the two texts are mutual translations is exploited. For example, a bitext can be seen as a sequence of biwords ---pairs of parallel words with a high probability of co-occurrence--- that can be used as an intermediate representation in the compression process. However, the simple biword approach described in the literature can only exploit one-to-one word alignments and cannot tackle the reordering of words. We therefore introduce a generalization of biwords which can describe multi-word expressions and reorderings. We also describe some methods for the binary compression of generalized biword sequences, and compare their performance when different schemes are applied to the extraction of the biword sequence. In addition, we show that this generalization of biwords allows for the implementation of an efficient algorithm to look on the compressed bitext for words or text segments in one of the texts and retrieve their counterpart translations in the other text ---an application usually referred to as translation spotting--- with only some minor modifications in the compression algorithm.
Improving Crowd Labeling through Expert Evaluation
Khattak, Faiza Khan (Columbia University) | Salleb-Aouissi, Ansaf (Columbia University)
We propose a general scheme for quality-controlled labeling of large-scale data using multiple labels from the crowd and a “few” ground truth labels from an expert of the field. Expert-labeled instances are used to assign weights to the expertise of each crowd labeler and to the difficulty of each instance. Ground truth labels for all instances are then approximated through those weights and the crowd labels. We argue that injecting a little expertise in the labeling process, will significantly improve the accuracy of the labeling task. Our empirical evaluation demonstrates that our methodology is efficient and effective as it gives better quality labels than majority voting and other state-of-the-art methods even in the presence of a large proportion of low-quality labelers in the crowd.
Web Resources Recommendation based on Dynamic Prediction of User Consumption on the Social Web
Rojas-Potosi, Luis Antonio (Universidad del Cauca) | Suarez-Meza, Luis Javier (Universidad del Cauca) | Ordoñez-Ante, Leandro (Universidad del Cauca) | Corrales, Juan Carlos (Universidad del Cauca)
The Web is a giant repository of resources (Service and content), where Discovery and Recommendation systems are used to deliver the best ranked list of relevant web resources that meet user requirements. Nowadays, these systems are based on the simulation and automation of the user search criteria, considering the relation between consumption trends and the different kinds of users’ relationships with their virtual and physical environment, based on the information from the Social Web and mobile device sensors among others. These systems are executed once an explicit query of the user has been received; however, there are resources that are useful in specific situations, where these resources have high probability to be consumed, but, due to absence of a query they are not recommended to the users. In this regard, the question is: how to make a successful Web Resource Recommendation without the user query? In order to answer the question, this research proposal presents a novel approach to Recommend Web Resources based on Dynamic Prediction of User Consumption on the Social Web, which emulates the user behavior, the resource dynamism and the context opportunities, in real time, catching the best situations to make an asynchronous (unexpected by the user) recommendation of a useful Resources; and boost Web Resources consumption.
Context tree selection and linguistic rhythm retrieval from written texts
Galves, Antonio, Galves, Charlotte, García, Jesús E., Garcia, Nancy L., Leonardi, Florencia
The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.
FoodMood: Measuring Global Food Sentiment One Tweet at a Time
Dixon, Natalie (Affect Lab Foundation) | Jakic, Bruno (AI Applied) | Lagerweij, Roderick (AI Applied) | Mooij, Mark (AI Applied) | Yudin, Ekaterina (Affect Lab Foundation)
Do Happy Meals really make us happy? Do salads make us blue? Is cake our comfort? FoodMood is an interactive data visualisation project that gives citizens a rare opportunity to engage and reflect, acknowledge, and understand the connection between emotion, obesity and food. The project explores the opportunities presented by the data-sharing world of today’s cities using global English-language tweets about food coupled with sentiment analysis. It aims to gain a better understanding of global food consumption patterns and its impact on the daily emotional well-being of people against the backdrop of country data such as Gross Domestic Product (GDP) and obesity levels. A key finding is that tweets can be used to find a relationship between certain foods, food sentiment and obesity levels in countries. Overall FoodMood shows a majority positive sentiment towards food. Other findings, although constantly evolving, indicate trends such as: globally meat enjoys a high sentiment rating and is often tweeted about; fast-food companies dominate the food consumption landscapes of most countries’ tweets although not all of them enjoy equal sentiment ratings across countries. Ultimately, FoodMood reveals a hidden layer of meaningful digital, social, and cultural data that provide a basis for further analysis.
Unsupervised Real-Time Company Name Disambiguation in Twitter
Muñoz, Agustín D. Delgado (UNED University) | Unanue, Raquel Martínez (UNED University) | García-Plaza, Alberto Pérez (UNED University) | Fresno, Víctor (UNED University)
This paper presents a new approach to disambiguate company names in the Twitter social network. We have focused on making lighter the processing of comparing company profiles with tweets in order to obtain a competitive real-time system. With this aim, we only use the home page of each company as information source to create a unique profile. On the other hand, we compute the similarity of a tweet in connection to a profile by comparing the content of the tweet with the profile. Both steps do not use any other external information source and all the process is developed in an unsupervised way. We have tested our application with the test WePS-3 CLEF ORM corpus obtaining encouraging results.
Visualizing Topic Models
Chaney, Allison June-Barlow (Princeton University) | Blei, David M. (Princeton University)
Managing large collections of documents is an important problem for many areas of science, industry, and culture. Probabilistic topic modeling offers a promising solution. Topic modeling is an unsupervised machine learning method that learns the underlying themes in a large collection of otherwise unorganized documents. This discovered structure summarizes and organizes the documents. However, topic models are high-level statistical tools—a user must scrutinize numerical distributions to understand and explore their results. In this paper, we present a method for visualizing topic models. Our method creates a navigator of the documents, allowing users to explore the hidden structure that a topic model discovers. These browsing interfaces reveal meaningful patterns in a collection, helping end-users explore and understand its contents in new ways. We provide open source software of our method.
Facebook and Privacy: The Balancing Act of Personality, Gender, and Relationship Currency
Quercia, Daniele (University of Cambridge) | Casas, Diego Las (Universidade Federal de Minas Gerais) | Pesce, Joao Paulo (Universidade Federal de Minas Gerais) | Stillwell, David (University of Cambridge) | Kosinski, Michal (University of Cambridge) | Almeida, Virgilio (Universidade Federal de Minas Gerais) | Crowcroft, Jon (University of Cambridge)
Social media profiles are telling examples of the everyday need for disclosure and concealment. The balance between concealment and disclosure varies across individuals, and personality traits might partly explain this variability. Experimental findings on the relationship between information disclosure and personality have been so far inconsistent. We thus study this relationship anew with 1,313 Facebook users in the United States using two personality tests: the big five personality test and the self-monitoring test. We model the process of information disclosure in a principled way using Item Response Theory and correlate the resulting user disclosure scores with personality traits. We find a correlation with the trait of Openness and observe gender effects, in that, men and women share equal amount of private information, but men tend to make it more publicly available, well beyond their social circles. Interestingly, geographic (e.g., residence, hometown) and work-related information is used as relationship currency, in that, it is selectively shared with social contacts and is rarely shared with the Facebook community at large.