Industry
Sparse Functional Singular Value Decomposition for Biclustering and Triclustering Longitudinal Data
Zhao, Yue, Chekouo, Thierry, Safo, Sandra
Identifying subtypes of complex conditions, such as Inflammatory Bowel Disease (IBD), often requires capturing latent patterns in longitudinal omics data. However, these data are typically high-dimensional, sparsely sampled, and irregularly observed over time, posing substantial challenges for conventional (bi)clustering and functional data analysis methods. We propose Tri-SfSVD, a unified sparse functional Singular Value Decomposition framework for discovering biclusters and triclusters in longitudinal data. Unlike existing functional biclustering methods that rely on ad hoc imputation or enforce restrictive shape-homogeneity assumptions, Tri-SfSVD integrates continuous trajectory estimation with simultaneous subject, feature, and temporal selection within a single optimization framework. By imposing sparse penalties across subjects, variables, and temporal subregions, the proposed method works directly on observed data to uncover localized structures at the subject, subject-feature, and subject-feature-time levels. Extensive simulations demonstrate that Tri-SfSVD outperforms existing approaches in high-dimensional settings. Applied to IBD multi-omics data, the method identified three biclusters linking sample clusters with distinct IBD-related clinical characteristics to microbial pathway groups associated with specific bacterial taxa, providing interpretable subject-pathway associations for characterizing disease heterogeneity. Applied to multi-channel EEG data, the method identified three triclusters linking sample clusters with distinct alcohol-related phenotypes to localized brain activity patterns, including subgroup differences separated by temporal subregions within the same spatial region.
Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs
Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. To address this limitation, we introduce a heterogeneous Rust-Python streaming architecture that maps cross-company attention as a continuous-time graph driven directly from text. We show that on the ingestion side, a zero-copy Rust edge parses news records in $\sim$100 ns and scans the target equity universe in $\sim$1.2 $ฮผ$s. On the inference end, a multivariate Neural Hawkes Process featuring per-node continuous-time LSTM states and a bilinear latent projection propagates directed excitation, while an adaptive pruning rule bounds the computational cost of dynamic neighborhood updates. Combining these stages, we demonstrate an end-to-end processing latency of $\sim$13 ms per incoming news record on a single commodity CPU. Evaluated on a one-month temporal holdout of the FNSPID corpus (638 articles across 47 tickers), the system delivers a $1.70\times$ precision lift over random at the 90th-percentile next-day return threshold, and $3.36\times$ over a same-sector baseline. Crucially, removing the graph topology collapses precision to zero, confirming that the dynamic attention network is the sole driver of cross-company signal in this architecture.
Symmetric Divergence and Normalized Similarity: A Unified Topological Framework for Representation Analysis
Topological Data Analysis (TDA) offers a principled, intrinsic lens for comparing neural representations. However, existing paired topological divergences (e.g., RTD) are limited by heuristic asymmetry and, more critically, unbounded scores that depend on sample size, hindering reliable cross-scenario benchmarking. To address these challenges, we develop a unified topological toolkit serving two complementary needs: fine-grained structural diagnosis and robust, standardized evaluation. First, we complete the RTD framework by introducing Symmetric Representation Topology Divergence (SRTD) and its efficient variant SRTD-lite. Beyond resolving the theoretical asymmetry of prior variants, SRTD consolidates diagnostic information into a single, comprehensive cross-barcode signature. This allows for precise localization of structural discrepancies and serves as an effective optimization objective without the overhead of dual directional computations. Second, to enable reliable benchmarking across heterogeneous settings, we propose Normalized Topological Similarity (NTS). By measuring the rank correlation of hierarchical merge orders, NTS yields a scale-invariant metric bounded between -1 and 1, effectively overcoming the scale and sample-dependence of unnormalized divergences. Experiments across synthetic and real-world deep learning settings demonstrate that our toolkit captures functional shifts in CNNs missed by geometric measures and robustly maps LLM genealogy even under distance saturation, offering a rigorous, topology-aware perspective that complements measures like CKA.
Trajectory-Aware Node Contributions and the Limits of Static Controllability
Kuskova, Valentina, Zaytsev, Dmitry, Coppedge, Michael
A recurring data mining task in complex networks is to determine how individual nodes contribute to system behavior. Existing approaches rely on either static-graph centralities or control-theoretic quantities such as controllability Gramians, which assume linear, time-invariant dynamics. Estimated systems, however, are typically nonlinear and time-varying. We define "emergent contribution (EC)," a finite-horizon measure of a node's dynamical leverage: the metric-weighted energy of its impulse response accumulated along the system trajectory. Computed from the Jacobians of any differentiable model, EC is estimator-agnostic and reduces exactly to average controllability in the linear, time-invariant limit. Our contribution is a characterization of when the two measures agree and diverge. Using a controlled synthetic family with known ground-truth contribution, we construct a phase diagram spanning nonlinearity, regime structure, persistence, and perturbation amplitude. EC and average controllability agree under static or smoothly drifting dynamics and both track ground truth. Divergence emerges under persistent regime switching, is strongest under persistent sign reversal, and disappears when the sign reversal is removed. At extreme perturbation amplitudes, both measures degrade, identifying the limits of local linearization. We place five estimated real systems from several domains within this phase space. Their placement serves as a diagnostic of when EC provides information beyond static controllability and therefore justifies its additional computational cost. On one panel examined in depth, a twenty-seed retraining ensemble reveals a robust variance--leverage dissociation: nodes whose perturbations propagate widely despite low within-system variance, which is not recovered by static centralities nor variance-based summaries.
Conformal Risk Sharing: Certified Cost Allocation with Participation Guarantees
Sharing the financial impact of rare adverse events across a group can soften extreme individual burdens, but any participant made worse off by the arrangement has reason to leave. A credible mechanism must therefore provide each agent with a trustworthy cap on their future obligation and should be deployed only if the aggregate harm across participants is bounded. We formalise this as the Certified Allocation Problem: from finite data and without distributional assumptions, find a redistribution rule, produce obligation caps for every participant, and verify that no participant is made materially worse off. We propose Conformal Risk Sharing, which solves this problem by pairing an interpretable sharing policy with split conformal calibration. The sharing intensity is tuned on training data, while held-out calibration data produces distribution-free per-agent guarantees (valid under exchangeability). Experiments on synthetic and real-world data, including precipitation and energy-cooperative data, confirm that the framework can substantially reduce extreme obligations for high-risk agents while controlling harm to others.
Mamba-Assisted Non-Markovian Closure for Reduced-Order Modeling
Wei, Zhi-Feng, Qadeer, Saad, Stinis, Panos
Reduced-order modeling of high-dimensional dynamical systems is often hindered by the non-Markovian closure term that represents the effect of unresolved variables on the resolved dynamics. Inspired by the Mori--Zwanzig formalism, in which the closure takes the form of a memory functional of the resolved trajectory, we recast closure modeling as a sequence modeling problem and propose the Mamba-Assisted Closure (MAC) framework: a Mamba-based sequence model, trained to predict the closure from the resolved trajectory, is coupled with the reduced-order governing equations through a numerical integrator to advance the resolved variables in time. A key feature of the framework is its exploitation of the dual representation of state-space models -- the model is trained in a sequence-to-sequence fashion via the convolutional form, and deployed for step-by-step autoregressive rollout via the recurrent form, yielding both efficient long-trajectory training and constant per-step inference cost. On the viscous Burgers' equation and the chaotic two-scale Lorenz '96 system, the MAC model substantially outperforms the Markovian reduced-order model, the GRU-based sequence model, and the Wilks method in predictive accuracy and long-time rollout stability.
TabSODA: Tabular Diffusion based Imputation with Skip Pattern Detection and Ordinal Awareness
Chen, Yuyu, Kim, Taehyo, Shu, Hai, Feng, Yang
Missing data imputation in large-scale surveys faces two challenges that are not well handled by current tabular diffusion methods. First, \emph{structural skips}, cells made inapplicable by questionnaire design, should not be imputed but are often conflated with item nonresponse. Second, \emph{ordinal} responses encode ordered categories, yet most pipelines treat them as nominal levels through one-hot or analog-bit encodings. We introduce \textbf{TabSODA} (\textbf{Tab}ular diffusion with \textbf{S}kip pattern detection and \textbf{O}r\textbf{d}inal \textbf{A}wareness), an Expectation-Maximization (EM)-based diffusion imputer built on the Elucidated Diffusion Model (EDM) framework. TabSODA propagates structural skips through the denoising loss and reverse-time sampler, and represents ordinal variables with cumulative-probit scalar latents while retaining analog-bit encodings for nominal variables. When a codebook skip mask is available, TabSODA uses it directly; otherwise, the TabSODA+SKIP variant estimates the mask from raw responses and questionnaire order using a CART-based skip-pattern miner. On Population Assessment of Tobacco and Health (PATH) study and the National Survey on Drug Use and Health (NSDUH), two nationally representative U.S.\ surveys, TabSODA reduces ordinal MACE by up to $23.7\%$ and improves categorical accuracy by up to $9\%$ over the strongest baseline across MCAR, MAR, and MNAR masking. The skip miner achieves near-perfect precision on both datasets, allowing TabSODA+SKIP to closely track the codebook-mask variant.
Optimally taming biases in black-box models for efficient semiparametric estimation
Gu, Yihong, Yin, Qishuo, Cai, Tianxi, Fan, Jianqing
Modern semiparametric estimation often relies on flexible black-box machine learning methods to estimate nuisance functions, raising a fundamental question: how do nuisance estimation errors propagate into inference for low-dimensional target parameters? The dominant paradigm, exemplified by double machine learning (DML), yields error bounds in which nuisance estimation errors enter multiplicatively. While widely adopted, it remains unclear whether this multiplicative-rate dependence is optimal for black-box models. In this paper, we start by revisiting the partial linear model $Y = ฮผ_0(X)+T\cdotฮฒ_0+\varepsilon$ under a structure-agnostic setting, where the nuisance function $ฮผ_0$ is estimated using a generic machine learning model, with approximation error $ฮด^a_ฮผ$ and stochastic error $ฮด_ฮผ^s$. We show that the standard DML rate is not optimal in the regime where the auxiliary function $\mathbb{E}[T|X=x]$ cannot be consistently estimated. We propose a new estimator for $ฮฒ_0$ that achieves a sharper rate of $n^{-1/2}+ฮด^a_ฮผ+(ฮด_ฮผ^s)^2$ and establish a matching lower bound demonstrating its optimality. Our results reveal a new principle: the first-order stochastic error of nuisance estimation can be eliminated without imposing any additional assumptions. This also leads to a revised tuning strategy favoring under-smoothing, where $ฮด^a_ฮผ\asymp(ฮด_ฮผ^s)^2$, rather than the classical bias-variance trade-off $ฮด^a_ฮผ\asymp ฮด_ฮผ^s$. Under mild additional conditions, the estimator is asymptotically normal with minimal asymptotic variance. The proposed method extends to a broad class of semi-parametric linear functional estimation problems, including average treatment effect estimation. Our results imply that popular orthogonal score methods in semiparametric estimation with black-box nuisance learners can be substantially improved.
DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables
Wang, Ziqian, Fang, Chenxi, Zhang, Zhen
Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the constraint set or introduce substantial computational overhead for large-scale nonlinear problems. Here, we propose DiffSlack, a differentiable projection layer for nonlinear inequality-constrained neural prediction. DiffSlack reformulates inequalities as equalities with learnable slack variables, which are predicted as part of the augmented network output and provide a data-driven warm start for damped Gauss-Newton projection. The projection layer maps raw predictions onto the augmented feasible manifold while preserving end-to-end differentiability. A two-stage curriculum further stabilizes training and improves constraint satisfaction. We evaluate DiffSlack on vehicle path planning with 200 nonlinear inequality constraints from collision avoidance, curvature limits, and waypoint spacing. Compared with existing learning-based baselines, DiffSlack achieves a higher planning success rate and stronger geometric constraint satisfaction under a comparable inference budget. Ablation studies further show that the hard projection layer reduces sensitivity to supervision quality. Closed-loop tracking in CARLA and real-world vehicle experiments confirms the executability of the generated trajectories. These results demonstrate that DiffSlack provides a practical and scalable approach to embedding hard inequality constraints into neural networks for engineering applications.
HyFAD: Hybrid Time-Frequency Diffusion with Frequency-Aware Embedding for Time Series Imputation
Gao, Hongfan, Shen, Wangmeng, Yang, Bin, Hu, Jilin
Diffusion models have demonstrated strong performance in time series modeling due to their ability to progressively capture complex data distributions through iterative denoising. However, existing approaches struggle with frequency-sensitive denoising, high-frequency reconstruction and balancing global trends with local dynamics. To address these limitations, we propose \textbf{HyFAD}, a \textbf{Hy}brid time-frequency \textbf{D}iffusion model with \textbf{F}requency-\textbf{A}ware embedding for time series imputation. Built upon the DDPM paradigm, HyFAD adopts a coupled time-frequency diffusion framework, in which the reverse denoising proceeds sequentially from the time domain to the frequency domain, enabling coarse-to-fine generation. Specifically, the time-domain diffusion process captures low-frequency global trends, while the frequency-domain diffusion process refines high-frequency spectral components. We further introduce a frequency-aware step embedding that exploits the relationship between diffusion steps and spectral components, providing step-dependent spectral guidance and facilitates more accurate band-wise reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that HyFAD achieves state-of-the-art performance. Our source code is available at https://github.com/hongfangao/HyFAD.