rating data
Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports
Balachandar, Sidhika, Sadhuka, Shuvom, Berger, Bonnie, Pierson, Emma, Garg, Nikhil
Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, such as predicting infrastructure problems. In this setting, government officials wish to know in which neighborhoods incidents like potholes or rodent issues occur. The true state of incidents (e.g., street conditions) for each neighborhood is observed via government inspection ratings. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced reports, which are more densely observed but may be biased due to heterogeneous reporting behavior. First, for such settings, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect, standardize, and make publicly available a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. Finally, we show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or models that use only rating data, especially when rating data is sparse and reports are predictive of ratings. We also quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.
Discovering Semantic Subdimensions through Disentangled Conceptual Representations
Zhang, Yunhao, Wang, Shaonan, Lin, Nan, Dong, Xinyi, Li, Chong, Zong, Chengqing
Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.
Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items
Han, Seong Woo, Adฤฑgรผzel, Ozan, Carpenter, Bob
In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy. Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability. We further show how to constrain the bimodal posterior of these models to avoid (or if necessary, allow) adversarial raters. We validate our model's goodness of fit with posterior predictive checks, the Bayesian analogue of $\chi^2$ tests. Dawid and Skene's model is rejected by goodness of fit tests, whereas our new model, which adjusts for item heterogeneity, is not rejected. We illustrate our new model with two well-studied data sets, binary rating data for caries in dental X-rays and implication in natural language.
Discordance Minimization-based Imputation Algorithms for Missing Values in Rating Data
Park, Young Woong, Kim, Jinhak, Zhu, Dan
Ratings are frequently used to evaluate and compare subjects in various applications, from education to healthcare, because ratings provide succinct yet credible measures for comparing subjects. However, when multiple rating lists are combined or considered together, subjects often have missing ratings, because most rating lists do not rate every subject in the combined list. In this study, we propose analyses on missing value patterns using six real-world data sets in various applications, as well as the conditions for applicability of imputation algorithms. Based on the special structures and properties derived from the analyses, we propose optimization models and algorithms that minimize the total rating discordance across rating providers to impute missing ratings in the combined rating lists, using only the known rating information. The total rating discordance is defined as the sum of the pairwise discordance metric, which can be written as a quadratic function. Computational experiments based on real-world and synthetic rating data sets show that the proposed methods outperform the state-of-the-art general imputation methods in the literature in terms of imputation accuracy.
Applications of K Nearest Neighbor algorithm part1(Artificial Intelligence)
Abstract: Demands for minimum parameter setup in machine learning models are desirable to avoid time-consuming optimization processes. The k-Nearest Neighbors is one of the most effective and straightforward models employed in numerous problems. Despite its well-known performance, it requires the value of k for specific data distribution, thus demanding expensive computational efforts. This paper proposes a k-Nearest Neighbors classifier that bypasses the need to define the value of k. The model computes the k value adaptively considering the data distribution of the training set.
Make your Own Book and Movie Recommender System using Surprise
Surprise (stands for Simple Python RecommendatIon System Engine) is a Python library for building and analyzing recommender systems that deal with explicit rating data. It provides various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods, matrix factorization-based ( SVD, PMF, SVD, NMF), and many others. Also, various similarity measures (cosine, MSD, Pearsonโฆ) are built-in. In both, I will use collaborative filtering techniques and content-based techniques to filter items, and no worries I will explain the differences between them. I use here the MovieLens dataset.
Leveraging Cross Feedback of User and Item Embeddings for Variational Autoencoder based Collaborative Filtering
Jin, Yuan, Zhao, He, Liu, Ming, Du, Lan, Li, Yunfeng, Xu, Ruohua, Gao, Longxiang
Matrix factorization (MF) has been widely applied to collaborative filtering in recommendation systems. Its Bayesian variants can derive posterior distributions of user and item embeddings, and are more robust to sparse ratings. However, the Bayesian methods are restricted by their update rules for the posterior parameters due to the conjugacy of the priors and the likelihood. Neural networks can potentially address this issue by capturing complex mappings between the posterior parameters and the data. In this paper, we propose a variational auto-encoder based Bayesian MF framework. It leverages not only the data but also the information from the embeddings to approximate their joint posterior distribution. The approximation is an iterative procedure with cross feedback of user and item embeddings to the others' encoders. More specifically, user embeddings sampled in the previous iteration, alongside their ratings, are fed back into the item-side encoders to compute the posterior parameters for the item embeddings in the current iteration, and vice versa. The decoder network then reconstructs the data using the MF with the currently re-sampled user and item embeddings. We show the effectiveness of our framework in terms of reconstruction errors across five real-world datasets. We also perform ablation studies to illustrate the importance of the cross feedback component of our framework in lowering the reconstruction errors and accelerating the convergence.
Explainable Restricted Boltzmann Machines for Collaborative Filtering
Abdollahi, Behnoush, Nasraoui, Olfa
Most accurate recommender systems are black-box models, hiding the reasoning behind their recommendations. Yet explanations have been shown to increase the user's trust in the system in addition to providing other benefits such as scrutability, meaning the ability to verify the validity of recommendations. This gap between accuracy and transparency or explainability has generated an interest in automated explanation generation methods. Restricted Boltzmann Machines (RBM) are accurate models for CF that also lack interpretability. In this paper, we focus on RBM based collaborative filtering recommendations, and further assume the absence of any additional data source, such as item content or user attributes. We thus propose a new Explainable RBM technique that computes the top-n recommendation list from items that are explainable. Experimental results show that our method is effective in generating accurate and explainable recommendations.
Hierarchical Compound Poisson Factorization
Basbug, Mehmet E., Engelhardt, Barbara E.
Non-negative matrix factorization models based on a hierarchical Gamma-Poisson structure capture user and item behavior effectively in extremely sparse data sets, making them the ideal choice for collaborative filtering applications. Hierarchical Poisson factorization (HPF) in particular has proved successful for scalable recommendation systems with extreme sparsity. HPF, however, suffers from a tight coupling of sparsity model (absence of a rating) and response model (the value of the rating), which limits the expressiveness of the latter. Here, we introduce hierarchical compound Poisson factorization (HCPF) that has the favorable Gamma-Poisson structure and scalability of HPF to high-dimensional extremely sparse matrices. More importantly, HCPF decouples the sparsity model from the response model, allowing us to choose the most suitable distribution for the response. HCPF can capture binary, non-negative discrete, non-negative continuous, and zero-inflated continuous responses. We compare HCPF with HPF on nine discrete and three continuous data sets and conclude that HCPF captures the relationship between sparsity and response better than HPF.
Collaborative Filtering Tutorials Across Languages
Collaborative filtering is the process of filtering for information using techniques involving collaboration among multiple agents. Applications of collaborative filtering typically involve very large data sets. This article covers some good tutorials regarding collaborative filtering we came across in Python, Java and R. Crab engine aims to provide a rich set of components from which you can construct a customized recommender system from a set of algorithms. The tutorial is from official documentation of Crab. This article presents an implementation of the collaborative filtering algorithm, that filters information for a user based on a collection of user profiles.