Goto

Collaborating Authors

 Personal Assistant Systems


Choosing a Proxy Metric from Past Experiments

arXiv.org Machine Learning

In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.


Is Rotten Tomatoes Certified Rotten?

Slate

This week, Stephen and Dana are joined by guest host Kat Chow, journalist and author of the 2021 memoir Seeing Ghosts. The panel begins by wading through HELL, Chris Fleming's new hour-long comedy special that's both puzzling and delightfully goofy. Then, the three consider Astrakan, a deeply dark and unsettling first feature from director David Depesseville, and attempt to parse through the film's (intentionally?) Finally, they conclude by discussing Rotten Tomatoes, the widely used critical review aggregation site and subject of the recent Vulture exposé by Lane Brown, "The Decomposition of Rotten Tomatoes," which details a "gaming of the system" by Hollywood PR teams. In the exclusive Slate Plus segment, the panel dives into the 2023 U.S. Open, specifically the effect of extreme heat on gameplay and how the sport will need to contend with climate change going forward.


Apple Watch Series 9 can handle Siri requests without your iPhone

Engadget

It's September, which means the air is thick with the promise of fall, school is back in session, and Apple just revealed a new Apple Watch. This year, at its annual fall event, the company is showing off the Apple Watch Series 9. The Series 9 features a new processor, the S9 chip, and a quad-core neural engine, which promises 18-hour battery life and overall performance boosts. On the software side, watchOS 10 is poised to be the biggest UI overhaul in Apple Watch history, with a renewed focus on widgets, and a slew of app and input updates. The Series 9 is available to order today and it's due to hit the market on September 22.


RoDia: A New Dataset for Romanian Dialect Identification from Speech

arXiv.org Artificial Intelligence

Dialect identification is a critical task in speech processing and language technology, enhancing various applications such as speech recognition, speaker verification, and many others. While most research studies have been dedicated to dialect identification in widely spoken languages, limited attention has been given to dialect identification in low-resource languages, such as Romanian. To address this research gap, we introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We publicly release our dataset and code at https://github.com/codrut2/RoDia.


A Co-design Study for Multi-Stakeholder Job Recommender System Explanations

arXiv.org Artificial Intelligence

Recent legislation proposals have significantly increased the demand for eXplainable Artificial Intelligence (XAI) in many businesses, especially in so-called `high-risk' domains, such as recruitment. Within recruitment, AI has become commonplace, mainly in the form of job recommender systems (JRSs), which try to match candidates to vacancies, and vice versa. However, common XAI techniques often fall short in this domain due to the different levels and types of expertise of the individuals involved, making explanations difficult to generalize. To determine the explanation preferences of the different stakeholder types - candidates, recruiters, and companies - we created and validated a semi-structured interview guide. Using grounded theory, we structurally analyzed the results of these interviews and found that different stakeholder types indeed have strongly differing explanation preferences. Candidates indicated a preference for brief, textual explanations that allow them to quickly judge potential matches. On the other hand, hiring managers preferred visual graph-based explanations that provide a more technical and comprehensive overview at a glance. Recruiters found more exhaustive textual explanations preferable, as those provided them with more talking points to convince both parties of the match. Based on these findings, we describe guidelines on how to design an explanation interface that fulfills the requirements of all three stakeholder types. Furthermore, we provide the validated interview guide, which can assist future research in determining the explanation preferences of different stakeholder types.


Would You Rather Stay Home Alone or Online Date?: A Game for Single Women

The New Yorker

Would you rather spend a quiet evening by yourself, reading an awful book with a contrived plot and cringy dialogue . . . Would you rather go for a solo walk and get attacked by hissing Canada geese in heat . . . Would you rather go to a dog park on your own, receive weird looks from dog owners because you have no dog, and get your leg humped by three muddy puppies who smell like pee . . . Would you rather sit at home alone on a Saturday night and binge-watch "The Great British Bake Off" while on a strict no-carb, no-sugar diet . . . Would you rather go to a coffee shop by yourself and sit next to someone who starts loudly conducting a phone interview . . .


This wireless heads up display for your car is more than $150 off now

PCWorld

We're past the days of MapQuest, but people still spend a dangerous amount of time looking at their phones while driving. You need a better solution, and this 9″ Wireless Heads Up Car Display has you covered. Compatible with Apple CarPlay, Android Auto, and wireless compatible mirror linking functions, this display helps you navigate, control music playback, and manage calls via Siri or Google Assistant on a safer dashboard display that you won't have to look into your lap to use. The intuitive tool installs easily on your dashboard via a self-adhesive bracket that doesn't alter your stereo setup. Then, it gives you optimal visibility day and night with automatic brightness adjustment while 4Ω 3W speakers ensure you can hear your music and voice instructions easily.


Offline Recommender System Evaluation under Unobserved Confounding

arXiv.org Machine Learning

Off-Policy Estimation (OPE) methods allow us to learn and evaluate decision-making policies from logged data. This makes them an attractive choice for the offline evaluation of recommender systems, and several recent works have reported successful adoption of OPE methods to this end. An important assumption that makes this work is the absence of unobserved confounders: random variables that influence both actions and rewards at data collection time. Because the data collection policy is typically under the practitioner's control, the unconfoundedness assumption is often left implicit, and its violations are rarely dealt with in the existing literature. This work aims to highlight the problems that arise when performing off-policy estimation in the presence of unobserved confounders, specifically focusing on a recommendation use-case. We focus on policy-based estimators, where the logging propensities are learned from logged data. We characterise the statistical bias that arises due to confounding, and show how existing diagnostics are unable to uncover such cases. Because the bias depends directly on the true and unobserved logging propensities, it is non-identifiable. As the unconfoundedness assumption is famously untestable, this becomes especially problematic. This paper emphasises this common, yet often overlooked issue. Through synthetic data, we empirically show how na\"ive propensity estimation under confounding can lead to severely biased metric estimates that are allowed to fly under the radar. We aim to cultivate an awareness among researchers and practitioners of this important problem, and touch upon potential research directions towards mitigating its effects.


VideolandGPT: A User Study on a Conversational Recommender System

arXiv.org Artificial Intelligence

This paper investigates how large language models (LLMs) can enhance recommender systems, with a specific focus on Conversational Recommender Systems that leverage user preferences and personalised candidate selections from existing ranking models. We introduce VideolandGPT, a recommender system for a Video-on-Demand (VOD) platform, Videoland, which uses ChatGPT to select from a predetermined set of contents, considering the additional context indicated by users' interactions with a chat interface. We evaluate ranking metrics, user experience, and fairness of recommendations, comparing a personalised and a non-personalised version of the system, in a between-subject user study. Our results indicate that the personalised version outperforms the non-personalised in terms of accuracy and general user satisfaction, while both versions increase the visibility of items which are not in the top of the recommendation lists. However, both versions present inconsistent behavior in terms of fairness, as the system may generate recommendations which are not available on Videoland.


RecFusion: A Binomial Diffusion Process for 1D Data for Recommendation

arXiv.org Artificial Intelligence

In this paper we propose RecFusion, which comprise a set of diffusion models for recommendation. Unlike image data which contain spatial correlations, a user-item interaction matrix, commonly utilized in recommendation, lacks spatial relationships between users and items. We formulate diffusion on a 1D vector and propose binomial diffusion, which explicitly models binary user-item interactions with a Bernoulli process. We show that RecFusion approaches the performance of complex VAE baselines on the core recommendation setting (top-n recommendation for binary non-sequential feedback) and the most common datasets (MovieLens and Netflix). Our proposed diffusion models that are specialized for 1D and/or binary setups have implications beyond recommendation systems, such as in the medical domain with MRI and CT scans.