nucleus
Sampling from Flow Language Models via Marginal-Conditioned Bridges
Azangulov, Iskander, Zhang, Leo
Flow Language Models (FLMs) are a recently introduced class of language models which adapt continuous flow matching for one-hot encoded token sequences. Their denoisers have a special structure absent from generic continuous diffusion models: each block of the denoising mean is a posterior marginal distribution over the clean token at that position. Standard DDPM-style samplers collapse these marginals to a single conditional-mean endpoint and bridge toward this simplex-valued point, which is generally not a valid one-hot sequence. We argue that the natural sampler for an FLM is instead posterior-predictive. At each reverse step, we sample a clean one-hot endpoint from the factorized posterior defined by the FLM token marginals, and then sample the next continuous state from the analytic Ornstein--Uhlenbeck bridge conditioned on that endpoint. The method is training-free, uses the same model evaluations as standard sampling, and gives a principled interface for token-level decoding controls such as temperature scaling and nucleus truncation. We show that, under exact posterior marginals, the endpoint approximation error is exactly the conditional multi-information among token positions. The induced one-step bridge kernel preserves all token-wise posterior-predictive marginals and loses only the residual cross-position dependence. Finally, we prove a Girsanov path-space comparison showing that the marginal-conditioned bridge has a no-larger denoising-error term than the frozen conditional-mean bridge, with strict improvement whenever intermediate coordinate-wise bridge observations reveal additional information about the clean token. Experiments with FLMs show that the sampler improves the quality--diversity tradeoff. Code is available at: github.com/imbirik/mcb.
NIS3D: ACompletely Annotated Benchmark for Dense 3DNuclei Image Segmentation
The5 existing nuclei segmentation benchmarks either worked on 2D only or annotated6 a small number of 3D cells, perhaps due to the high cost of 3D annotation for7 large-scale data. To fulfill the critical need, we constructed NIS3D, a 3D, high8 cell density, large-volume, and completely annotated Nuclei Image Segmentation9 benchmark, assisted by our newly designed semi-automatic annotation software.10 NIS3D provides more than 22,000 cells across multiple most-used species in this11 area. Each cell is labeled by three independent annotators, so we can measure the12 variability of each annotation. A confidence score is computed for each cell, allow-13 ing more nuanced testing and performance comparison. A comprehensive review14 on the methods of segmenting 3D dense nuclei was conducted. The benchmark was15 used to evaluate the performance of several selected state-of-the-art segmentation16 algorithms. The best of current methods is still far away from human-level accuracy,17 corroborating the necessity of generating such a benchmark. The testing results18 also demonstrated the strength and weakness of each method and pointed out the19 directions of further methodological development.
De-AnonymizingTextby FingerprintingLanguageGeneration
Components of machine learning systems are not (yet) perceived as security hotspots. Secure coding practices, such as ensuring that no execution paths depend on confidential inputs, have not yet been adopted by ML developers. We initiate the study of code security of ML systems by investigating how nucleus sampling--a popular approach forgeneratingtext,used forapplications such as auto-completion--unwittingly leakstextstypedbyusers.