Auditing algorithms has emerged as a methodology for holding algorithms accountable by testing whether they are fair. This process often relies on the repeated use of a platform to record inputs and their corresponding outputs. For example, to audit Google search, one repeatedly inputs queries and captures the received search pages. The goal is then to discover, in the collected data, patterns that will reveal the ``secrets'' of algorithmic decision making. This knowledge discovery process makes some algorithm auditing tasks great applications for data mining techniques. In this paper, we introduce one particular algorithm audit, that of Google's Top stories. We describe the process of data collection, exploration, and analysis for this application and share some of the gleaned insights. Concretely, our analysis suggests that Google might be trying to burst the famous ``filter bubble'' by choosing less known publishers for the 3rd position in the Top stories.
This paper is about the comparison of content-dependent and content-independent features for the identification of short texts author's age range and gender. Eight content-dependent features based on profiles of ngrams of words are used. In addition, ninety-eight content-independent features covering all the linguistic aspects of texts from phonology to discourse are used. These features were extracted from three corpora of different sizes and types. Experiments were conducted using four different machine learning algorithms combined with these features. The results show that content-dependent features do a better job for gender identification on the three corpora. However, content-independent features did better with the task of age range identification.
We focus on the problem of modeling deterministic equations over continuous variables in discrete Bayesian networks. This is typically achieved by a discretisation of both input and output variables and a degenerate quantification of the corresponding conditional probability tables. This approach, based on classical probabilities, cannot properly model the information loss induced by the discretisation. We show that a reliable modeling of such epistemic uncertainty can be instead achieved by credal sets, i.e., convex sets of probability mass functions. This transforms the original Bayesian network in a credal network, possibly returning interval-valued inferences, that are robust with respect to the information loss induced by the discretisation. Algorithmic strategies for an optimal choice of the discretisation bins are also discussed.
This paper presents experimental results showing that discourse structure is a useful element in identifying the focus of negation. We define features extracted from RST-like discourse trees. We experiment with the largest publicly available corpus and an off-the-shelf discourse parser. Results show that discourse structure is especially beneficial when predicting the focus of negations in long sentences.
The Winograd Schema (WS) challenge, proposed as an alternative to the Turing Test, has become the new standard for evaluating progress in natural language understanding (NLU). In this paper we will not however be concerned with how this challenge might be addressed. Instead, our aim here is threefold: (i) we will fir st formally „situate‟ the WS challenge in the data-information-knowledge continuum, suggesting where in that continuum a good WS resides; (ii) we will show that a WS is just a special case of a more general phenomenon in language understanding, namely the missing text phenomenon (henceforth, MTP) - in particular, we will argue that what we usually call thinking in the process of language understanding involves discovering a significant amount of „missing text‟ - text that is not explicitly stated, but is often implicitly assumed as shared background knowledge; and (iii) we conclude with a brief discussion on why MTP is inconsistent with the data-driven and machine learning approach to language understanding.
A Computer-Assisted Reading and Analysis of Texts (CARAT) process is a complex technology that connects language, text, information and knowledge theories with computational formalizations, statistical approaches, symbolic approaches, standard and non-standard logics, etc. This process should be, always, under the control of the user according to his subjectivity, his knowledge and the purpose of his analysis. It becomes important to design platforms to support the design of CARAT tools, their management, their adaptation to new needs and the experiments. Even, in the last years, several platforms for digging data, including textual data have emerged; they lack flexibility and sound formal foundations. We propose, in this paper, a formal model with strong logical foundations, based on typed applicative systems.
A hybrid recommender system fuses multiple data sources, usually with static and nonadjustable weightings, to deliver recommendations. One limitation of this approach is the problem to match user preference in all situations. In this paper, we present two user-controllable hybrid recommender interfaces, which offer a set of sliders to dynamically tune the impact of different sources of relevance on the final ranking. Two user studies were performed to design and evaluate the proposed interfaces.
In a variety of online settings involving interaction with end-users it is critical for the systems to adapt to changes in user preference. User preferences on items tend to change over time due to a variety of factors such as change in context, the task being performed, or other short-term or long-term external factors. Recommender systems, in particular need to be able to capture these dynamics in user preferences in order to remain tuned to the most current interests of users. In this work we present a recommendation framework which takes into account the dynamics of user preferences. We propose an approach based on Hidden Markov Models (HMM) to identify change-points in the sequence of user interactions which reflect significant changes in preference according to the sequential behavior of all the users in the data. The proposed framework leverages the identified change points to generate recommendations using a sequence-aware non-negative matrix factorization model. We empirically demonstrate the effectiveness of the HMM-based change detection method as compared to standard baseline methods. Additionally, we evaluate the performance of the proposed recommendation method and show that it compares favorably to state-of-the-art sequence-aware recommendation models.
The accuracy of Top-N recommendation task is challenged in the systems with mainly implicit user feedback considered. Adversarial training has presented successful results in identifying real data distributions in various domains (e.g. image processing). Nonetheless, adversarial training applied to recommendation is still challenged especially by interpretation of negative implicit feedback causing it to converge slowly as well as affecting its convergence stability. This is often attributed to high sparsity of the implicit feedback and discrete values characteristic from items recommendation. To face these challenges, we propose a novel model named convolutional adversarial latent factor model (CALF), which uses adversarial training in generative and discriminative models for implicit feedback recommendations. We assume that users prefer observed items over generated items and then apply pairwise product to model the user-item interactions. Additionally, the latent features become input data of our convolutional neural network (CNN) to learn correlations among embedding dimensions. Finally, Rao-Blackwellized sampling is adopted to deal with the discrete values optimizing CALF and stabilizing the training step. We conducted extensive experiments on three different benchmark datasets, where our proposed model demonstrates its efficiency for item recommendation.
Discovering causal relations in a knowledge base represents nowadays a challenging issue, as it gives a brand new way of understanding complex domains. In this paper, we present a method to combine an ontology with an object-oriented extension of the Bayesian networks (BNs), called probabilistic relational model (PRM), in order to help a user to check his/her assumption on causal relations between data and to discover new relationships. This assumption is also important as it guides the PRM construction and provide a learning under causal constraints.