Plotting

Supplementary material for TopoSRL: Topology Preserving Self-Supervised Simplicial Representation Learning

Neural Information Processing Systems

Theorem 1. Minimizing the expected loss L The two views come from a probability distribution conditioned on original data distribution X, and X is as distributed as X. Suppose we have T -dimensional features. A similar result can be established for the second term in Equation (S4), which will reduce the variance of representations of simplices and their neighborhoods within the same augmented simplicial complex. In Table S1, we provide details about the datasets used in the experiments in the paper, namely, contact-high-school, contact-primary-school, senate-bills, and email-Enron. A simplex in contact-high-school and contact-primary-school represent a group of people who were in close proximity, and the classes are the classrooms that the students are in. In senate-bills, a simplex is the set of co-sponsors of bills that are put forth in the Senate, and the classes are the political party the sponsors belong to.



IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation Fan Lin

Neural Information Processing Systems

As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities.




An information-theoretic quantification of the content of communication between brain regions

Neural Information Processing Systems

Quantifying the amount, content and direction of communication between brain regions is key to understanding brain function. Traditional methods to analyze brain activity based on the Wiener-Granger causality principle quantify the overall information propagated by neural activity between simultaneously recorded brain regions, but do not reveal the information flow about specific features of interest (such as sensory stimuli). Here, we develop a new information theoretic measure termed Feature-specific Information Transfer (FIT), quantifying how much information about a specific feature flows between two regions.


On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training

Neural Information Processing Systems

Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, given a fixed pre-training dataset size, we find that the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that downstream performance depends monotonically on both types of diversity.




Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Neural Information Processing Systems

Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.