fingerprint
Scalable Fingerprinting of Large Language Models
Model fingerprinting has emerged as a powerful tool for model owners to identify their shared model given API access. In order to lower false discovery rate, fight fingerprint leakage, and defend against coalitions of model users attempting to bypass detection, we argue that scaling up the number of fingerprints one can embed into a model, i.e. Scalability of fingerprints, is critical. Hence, we pose scalability as a crucial requirement for fingerprinting schemes. We experiment with fingerprint design at a scale significantly larger than previously considered, and introduce a new method, dubbed Perinucleus sampling, to generate scalable, persistent, and harmless fingerprints. We demonstrate that this scheme can add 24,576 fingerprints to a Llama-3.1-8B
MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation
Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While largescale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.
Measuring Fingerprints of Web-filtered Text Datasets and Fingerprint Propagation Through Training
We investigate fingerprints in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of fingerprints or biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar curation steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that small differences in filtering and processing pipelines induce fingerprints, that we find are evident in formatting, vocabulary, and content distributions. Such fingerprints can negatively impact cross-dataset generalization. Additionally, we show that these fingerprints propagate through training: sequences generated by models trained on those datasets can be accurately classified by a classifier trained on the original datasets. This can offer insights into data characteristics that are typically undisclosed by LLM developers, including pretraining mixture proportions and finetuning data sources.
Scalable Fingerprinting of Large Language Models
Model fingerprinting has emerged as a powerful tool for model owners to identify their shared model given API access. In order to lower false discovery rate, fight fingerprint leakage, and defend against coalitions of model users attempting to bypass detection, we argue that scaling up the number of fingerprints one can embed into a model, i.e. of fingerprints, is critical. Hence, we pose scalability as a crucial requirement for fingerprinting schemes. We experiment with fingerprint design at a scale significantly larger than previously considered, and introduce a new method, dubbed Perinucleus sampling, to generate scalable, persistent, and harmless fingerprints. We demonstrate that this scheme can add 24,576 fingerprints to a Llama-3.1-8B
CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving
Maintaining robust 3D perception under dynamic and unpredictable test-time conditions remains a critical challenge for autonomous driving systems. Existing test-time adaptation (TTA) methods often fail in high-variance tasks like 3D object detection due to unstable optimization and sharp minima. While recent model merging strategies based on linear mode connectivity (LMC) offer improved stability by interpolating between fine-tuned checkpoints, they are computationally expensive, requiring repeated checkpoint access and multiple forward passes. In this paper, we introduce CodeMerge, a lightweight and scalable model merging framework that bypasses these limitations by operating in a compact latent space. Instead of loading full models, CodeMerge represents each checkpoint with a low-dimensional fingerprint derived from the source model's penultimate features and constructs a key-value codebook. We compute merging coefficients using ridge leverage scores on these fingerprints, enabling efficient model composition without compromising adaptation quality. Our method achieves strong performance across challenging benchmarks, improving end-to-end 3D detection 14.9\% NDS on nuScenes-C and LiDAR-based detection by over 7.6\% mAP on nuScenes-to-KITTI, while benefiting downstream tasks such as online mapping, motion prediction and planning even without training.
Measuring Fingerprints of Web-filtered Text Datasets and Fingerprint Propagation Through Training
We investigate fingerprints in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of fingerprints or biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar curation steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that small differences in filtering and processing pipelines induce fingerprints, that we find are evident in formatting, vocabulary, and content distributions. Such fingerprints can negatively impact cross-dataset generalization. Additionally, we show that these fingerprints propagate through training: sequences generated by models trained on those datasets can be accurately classified by a classifier trained on the original datasets. This can offer insights into data characteristics that are typically undisclosed by LLM developers, including pretraining mixture proportions and finetuning data sources.
Experts issue urgent warning over doing a 'peace' sign in photos - amid fears hackers can steal your FINGERPRINTS and copy them
Married doctor's affair with glamorous younger woman explodes into Fatal Attraction-style court war... X-rated photo claims, leaked recordings and a sinister threat: 'I'll never stop' NBA rocked as Grizzlies star Brandon Clarke dies suddenly at 29... a month after being arrested on drug charges The unassuming apps all cheaters use to hide their affairs: Where to look on your partner's phone to see exactly what they are up to... and the subtle red flags to never ignore I've treated so many cocaine users. This is the one sign that makes it so obvious you have a problem, how it can kill you in a night... and the embarrassing sexual side effect you may not have heard of: DR PHILIPPA KAYE Explosive Supreme Court LEAK reveals stinging whispers about'belligerent' justice read the wild rants troubling both sides of the aisle Surge in cancer patients taking 20 cent'wonder drug' after Mel Gibson claims that friends beat incurable disease thanks to drug The'marry me' sex move that'll make even the most commitment-phobic of men beg to see you again... and it worked for THREE of my friends Trump's chilling'treason' note revealed as he hunts down Iran war leakers... and Israel bombshell sparks fury Hollywood's $350k matchmaker exposes the secret love lives of the rich and famous: Diva demands, fake names, NDAs... and how to know if your relationship is doomed Secret trove of injury photos that blow apart married tech mogul's family-man image revealed in explosive lawsuit: Bruises, beatings and forced sex acts he allegedly inflicted on girlfriend Furious argument explodes on CNN after panelist flagged Kevin O'Leary's old age during foul-mouthed fight about politics He knew Elizabeth Taylor's secrets. Johnny Depp came to him for answers. But Hollywood's greatest confidante buried a betrayal that was too dangerous to expose Experts issue urgent warning over doing a'peace' sign in photos - amid fears hackers can steal your FINGERPRINTS and copy them Your latest selfie could be giving hackers everything they need to crack your accounts, experts have warned. Cybersecurity researchers have issued an urgent warning against doing a'peace' sign in photos, amid fears that criminals could steal your fingerprints.
Deep Self-Dissimilarities as Powerful Visual Fingerprints
Features extracted from deep layers of classification networks are widely used as image descriptors. Here, we exploit an unexplored property of these features: their internal dissimilarity. While small image patches are known to have similar statistics across image scales, it turns out that the internal distribution of deep features varies distinctively between scales. We show how this deep self dissimilarity (DSD) property can be used as a powerful visual fingerprint. Particularly, we illustrate that full-reference and no-reference image quality measures derived from DSD are highly correlated with human preference. In addition, incorporating DSD as a loss function in training of image restoration networks, leads to results that are at least as photo-realistic as those obtained by GAN based methods, while not requiring adversarial training.