Goto

Collaborating Authors

 impurity


ADebiasedMDIFeatureImportanceMeasurefor RandomForests

Neural Information Processing Systems

In particular, interpreting Random Forests (RFs) [2] and its variants [14, 28, 27, 29, 1, 12] has become an important area of research due to the wide ranging applications of RFs invarious scientific areas, such asgenome-wide association studies (GWAS)[7],gene expression microarray[13,23],andgeneregulatorynetworks[9].



Principled Federated Random Forests for Heterogeneous Data

arXiv.org Machine Learning

Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.


Empowering Decision Trees via Shape Function Branching

arXiv.org Artificial Intelligence

Decision trees are prized for their interpretability and strong performance on tabular data. Yet, their reliance on simple axis-aligned linear splits often forces deep, complex structures to capture non-linear feature effects, undermining human comprehension of the constructed tree. To address this limitation, we propose a novel generalization of a decision tree, the Shape Generalized Tree (SGT), in which each internal node applies a learnable axis-aligned shape function to a single feature, enabling rich, non-linear partitioning in one split. As users can easily visualize each node's shape function, SGTs are inherently interpretable and provide intuitive, visual explanations of the model's decision mechanisms. To learn SGTs from data, we propose ShapeCART, an efficient induction algorithm for SGTs. We further extend the SGT framework to bivariate shape functions (S$^2$GT) and multi-way trees (SGT$_K$), and present Shape$^2$CART and ShapeCART$_K$, extensions to ShapeCART for learning S$^2$GTs and SGT$_K$s, respectively. Experiments on various datasets show that SGTs achieve superior performance with reduced model size compared to traditional axis-aligned linear trees.


ACT: Agentic Classification Tree

arXiv.org Artificial Intelligence

When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable, and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.



comments, we organize our responses as follows

Neural Information Processing Systems

We thank the reviewers for their valuable feedback that will significantly improve our paper. This is indeed a limitation of Theorem 1. The CHIP data included in our simulation studies shows that MDI-oob works in this setting. We plan to add this plot in our supplementary material. Reviewers 2 and 3: Give theoretical/empirical evidence that MDI-oob can "debias" MDI. Empirically, we compute the MDI-oob for the first simulation.



Human-AI Synergy in Adaptive Active Learning for Continuous Lithium Carbonate Crystallization Optimization

arXiv.org Artificial Intelligence

As demand for high-purity lithium surges with the growth of the electric vehicle (EV) industry, cost-effective extraction from lower-grade North American sources like the Smackover Formation is critical. These resources, unlike high-purity South American brines, require innovative purification techniques to be economically viable. Continuous crystallization is a promising method for producing battery-grade lithium carbonate, but its optimization is challenged by a complex parameter space and limited data. This study introduces a Human-in-the-Loop (HITL) assisted active learning framework to optimize the continuous crystallization of lithium carbonate. By integrating human expertise with data-driven insights, our approach accelerates the optimization of lithium extraction from challenging sources. Our results demonstrate the framework's ability to rapidly adapt to new data, significantly improving the process's tolerance to critical impurities like magnesium from the industry standard of a few hundred ppm to as high as 6000 ppm. This breakthrough makes the exploitation of low-grade, impurity-rich lithium resources feasible, potentially reducing the need for extensive pre-refinement processes. By leveraging artificial intelligence, we have refined operational parameters and demonstrated that lower-grade materials can be used without sacrificing product quality. This advancement is a significant step towards economically harnessing North America's vast lithium reserves, such as those in the Smackover Formation, and enhancing the sustainability of the global lithium supply chain.


Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

arXiv.org Artificial Intelligence

Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.