Goto

Collaborating Authors

 identifier




Cold Case: The Lost MNIST Digits

Neural Information Processing Systems

Although the popular MNIST dataset \citep{mnist} is derived from the NIST database \citep{nist-sd19}, precise processing steps of this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST dataset, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they enable us to investigate the impact of twenty-five years of MNIST experiments on the reported testing performances. Our results unambiguously confirm the trends observed by \citet{recht2018cifar,recht2019imagenet}: although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.


TikTok Is Now Collecting Even More Data About Its Users. Here Are the 3 Biggest Changes

WIRED

TikTok Is Now Collecting Even More Data About Its Users. According to its new privacy policy, TikTok now collects more data on its users, including their precise location, after majority ownership officially switched to a group based in the US. When TikTok users in the US opened the app today, they were greeted with a pop-up asking them to agree to the social media platform's new terms of service and privacy policy before they could resume scrolling. These changes are part of TikTok's transition to new ownership. In order to continue operating in the US, TikTok was compelled by the US government to transition from Chinese control to a new, American-majority corporate entity.


149 Million Usernames and Passwords Exposed by Unsecured Database

WIRED

This "dream wish list for criminals" includes millions of Gmail, Facebook, banking logins, and more. The researcher who discovered it suspects they were collected using infostealing malware. A database containing 149 million account usernames and passwords--including 48 million for Gmail, 17 million for Facebook, and 420,000 for the cryptocurrency platform Binance --has been removed after a researcher reported the exposure to the hosting provider. The longtime security analyst who discovered the database, Jeremiah Fowler, could not find indications of who owned or operated it, so he worked to notify the host, which took down the trove because it violated a terms of service agreement. In addition to email and social media logins for a number of platforms, Fowler also observed credentials for government systems from multiple countries as well as consumer banking and credit card logins and media streaming platforms.



Generative Retrieval Meets Multi-Graded Relevance

Neural Information Processing Systems

Generative retrieval represents a novel approach to information retrieval, utilizing an encoder-decoder architecture to directly produce relevant document identifiers (docids) for queries. While this method offers benefits, current implementations are limited to scenarios with binary relevance data, overlooking the potential for documents to have multi-graded relevance. Extending generative retrieval to accommodate multi-graded relevance poses challenges, including the need to reconcile likelihood probabilities for docid pairs and the possibility of multiple relevant documents sharing the same identifier. To address these challenges, we introduce a new framework called GRaded Generative Retrieval (GR$^2$). Our approach focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training. Firstly, we aim to create identifiers that are both semantically relevant and sufficiently distinct to represent individual documents effectively. This is achieved by jointly optimizing the relevance and distinctness of docids through a combination of docid generation and autoencoder models. Secondly, we incorporate information about the relationship between relevance grades to guide the training process. Specifically, we leverage a constrained contrastive training strategy to bring the representations of queries and the identifiers of their relevant documents closer together, based on their respective relevance grades.Extensive experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of our method.


Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked Data

Neural Information Processing Systems

Federated Learning (FL) is an evolving paradigm that enables multiple parties to collaboratively train models without sharing raw data. Among its variants, Vertical Federated Learning (VFL) is particularly relevant in real-world, cross-organizational collaborations, where distinct features of a shared instance group are contributed by different parties. In these scenarios, parties are often linked using fuzzy identifiers, leading to a common practice termed as . Existing models generally address either multi-party VFL or fuzzy VFL between two parties. Extending these models to practical multi-party fuzzy VFL typically results in significant performance degradation and increased costs for maintaining privacy.


Autoregressive Search Engines: Generating Substrings as Document Identifiers

Neural Information Processing Systems

Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems.


xtdml: Double Machine Learning Estimation to Static Panel Data Models with Fixed Effects in R

Polselli, Annalivia

arXiv.org Machine Learning

The double machine learning (DML) method combines the predictive power of machine learning with statistical estimation to conduct inference about the structural parameter of interest. This paper presents the R package `xtdml`, which implements DML methods for partially linear panel regression models with low-dimensional fixed effects, high-dimensional confounding variables, proposed by Clarke and Polselli (2025). The package provides functionalities to: (a) learn nuisance functions with machine learning algorithms from the `mlr3` ecosystem, (b) handle unobserved individual heterogeneity choosing among first-difference transformation, within-group transformation, and correlated random effects, (c) transform the covariates with min-max normalization and polynomial expansion to improve learning performance. We showcase the use of `xtdml` with both simulated and real longitudinal data.