selection process
Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference
Arruda, Jonas, Chervet, Sophie, Staudt, Paula, Wieser, Andreas, Hoelscher, Michael, Sermet-Gaudelus, Isabelle, Binder, Nadine, Opatowski, Lulla, Hasenauer, Jan
Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- (6 more...)
- Information Technology > Enterprise Applications > Customer Relationship Management (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.85)
Hypothesis on the Functional Advantages of the Selection-Broadcast Cycle Structure: Global Workspace Theory and Dealing with a Real-Time World
Nakanishi, Junya, Baba, Jun, Yoshikawa, Yuichiro, Kamide, Hiroko, Ishiguro, Hiroshi
This paper discusses the functional advantages of the Selection-Broadcast Cycle structure proposed by Global Workspace Theory (GWT), inspired by human consciousness, particularly focusing on its applicability to artificial intelligence and robotics in dynamic, real-time scenarios. While previous studies often examined the Selection and Broadcast processes independently, this research emphasizes their combined cyclic structure and the resulting benefits for real-time cognitive systems. Specifically, the paper identifies three primary benefits: Dynamic Thinking Adaptation, Experience-Based Adaptation, and Immediate Real-Time Adaptation. This work highlights GWT's potential as a cognitive architecture suitable for sophisticated decision-making and adaptive performance in unsupervised, dynamic environments. It suggests new directions for the development and implementation of robust, general-purpose AI and robotics systems capable of managing complex, real-world tasks.
Improved seeding strategies for k-means and k-GMM
Carrière, Guillaume, Cazals, Frédéric
We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle--conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization. Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (SSE for k-means, log-likelihood for k-GMM), at a modest overhead. In particular, for k-means, our methods improve on the recently designed multi-swap strategy, which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods. Practically, our most effective seeding methods are strong candidates to become one of the--if not the--standard techniques. From a theoretical perspective, our formalization of seeding opens the door to a new line of analytical approaches.
- Europe > France > Provence-Alpes-Côte d'Azur (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > Wales (0.04)
- Asia (0.04)
Sequential Cohort Selection
Nana, Hortence Phalonne, Dimitrakakis, Christos
We study the problem of fair cohort selection from an unknown population, with a focus on university admissions. We start with the one-shot setting, where the admission policy must be fixed in advance and remain transparent, before observing the actual applicant pool. In contrast, the sequential setting allows the policy to be updated across stages as new applicant data becomes available. This is achieved by optimizing admission policies using a population model, trained on data from previous admission cycles. We also study the fairness properties of the resulting policies in the one-shot setting, including meritocracy and group parity.
Enhancing Selection of Climate Tech Startups with AI -- A Case Study on Integrating Human and AI Evaluations in the ClimaTech Great Global Innovation Challenge
Turliuk, Jennifer, Sevilla, Alejandro, Gorza, Daniela, Hynes, Tod
This case study examines the ClimaTech Great Global Innovation Challenge's approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI's GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman's = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities.
- Banking & Finance > Trading (0.48)
- Banking & Finance > Capital Markets (0.34)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.62)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)
C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation
Sirlanci, Melih, Yagemann, Carter, Lin, Zhiqiang
Despite the effort in vulnerability detection over the last two decades, memory safety vulnerabilities continue to be a critical problem. Recent reports suggest that the key solution is to migrate to memory-safe languages. To this end, C-to-Rust transpilation becomes popular to resolve memory-safety issues in C programs. Recent works propose C-to-Rust transpilation frameworks; however, a comprehensive evaluation dataset is missing. Although one solution is to put together a large enough dataset, this increases the analysis time in automated frameworks as well as in manual efforts for some cases. In this work, we build a method to select functions from a large set to construct a minimized yet representative dataset to evaluate the C-to-Rust transpilation. We propose C2RUST-BENCH that contains 2,905 functions, which are representative of C-to-Rust transpilation, selected from 15,503 functions of real-world programs.
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)
A Survey for What Developers Require in AI-powered Tools that Aid in Component Selection in CBSD
Ansari, Mahdi Jaberzadeh, Barcomb, Ann
Although it has been more than four decades that the first components-based software development (CBSD) studies were conducted, there is still no standard method or tool for component selection which is widely accepted by the industry. The gulf between industry and academia contributes to the lack of an accepted tool. We conducted a mixed methods survey of nearly 100 people engaged in component-based software engineering practice or research to better understand the problems facing industry, how these needs could be addressed, and current best practices employed in component selection. We also sought to identify and prioritize quality criteria for component selection from an industry perspective. In response to the call for CBSD component selection tools to incorporate recent technical advances, we also explored the perceptions of professionals about AI-driven tools, present and envisioned.
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)
- North America > Canada > Alberta (0.28)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Research Report > Experimental Study (0.93)
Applicability of the Minimal Dominating Set for Influence Maximisation in Multilayer Networks
Czuba, Michał, Jia, Mingshan, Bródka, Piotr, Musial, Katarzyna
The minimal dominating set (MDS) is a well-established concept in network controllability and has been successfully applied in various domains, including sensor placement, network resilience, and epidemic containment. In this study, we adapt the local-improvement MDS routine and explore its potential for enhancing seed selection for influence maximisation in multilayer networks (MLN). We employ the Linear Threshold Model (LTM), which offers an intuitive representation of influence spread or opinion dynamics by accounting for peer influence accumulation. To ensure interpretability, we utilise rank-refining seed selection methods, with the results further filtered with MDS. Our findings reveal that incorporating MDS into the seed selection process improves spread only within a specific range of situations. Notably, the improvement is observed for larger seed set budgets, lower activation thresholds, and when an "AND" strategy is used to aggregate influence across network layers. This scenario reflects situations where an individual does not require the majority of their acquaintances to hold a target opinion, but must be influenced across all social circles.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
- Europe > United Kingdom > England > Dorset > Bournemouth (0.04)
- (5 more...)
PAIR: A Novel Large Language Model-Guided Selection Strategy for Evolutionary Algorithms
Ali, Shady, Ashraf, Mahmoud, Hegazy, Seif, Salem, Fatty, Mokhtar, Hoda, Gaber, Mohamed Medhat, Alrefaie, Mohamed Taher
Evolutionary Algorithms (EAs) employ random or simplistic selection methods, limiting their exploration of solution spaces and convergence to optimal solutions. The randomness in performing crossover or mutations may limit the model's ability to evolve efficiently. This paper introduces Preference-Aligned Individual Reciprocity (PAIR), a novel selection approach leveraging Large Language Models to emulate human-like mate selection, thereby introducing intelligence to the pairing process in EAs. PAIR prompts an LLM to evaluate individuals within a population based on genetic diversity, fitness level, and crossover compatibility, guiding more informed pairing decisions. We evaluated PAIR against a baseline method called LLM-driven EA (LMEA), published recently. Results indicate that PAIR significantly outperforms LMEA across various TSP instances, achieving lower optimality gaps and improved convergence. This performance is especially noticeable when combined with the flash thinking model, demonstrating increased population diversity to escape local optima. In general, PAIR provides a new strategy in the area of in-context learning for LLM-driven selection in EAs via sophisticated preference modelling, paving the way for improved solutions and further studies into LLM-guided optimization.
- Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Africa > Middle East > Egypt (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm
Li, Zhuo, Du, Yuhao, Jiao, Xiaoqi, Guo, Yiwen, Feng, Yuege, Wan, Xiang, Gao, Anningzhe, Hu, Jinpeng
Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing the overall value of selected data, focusing primarily on individual quality, and struggle to strike an effective balance between ensuring diversity and minimizing data point traversals. Therefore, this paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples when incorporated into the subset. Thanks to the advanced language understanding capabilities of LLMs, we utilize LLMs to evaluate the value of each option during the selection process. Furthermore, we design a greedy sampling process where samples are incrementally added to the subset, thereby improving efficiency by eliminating the need for exhaustive traversal of the entire dataset with the limited budget. Extensive experiments demonstrate that selected data from our method not only surpass the performance of the full dataset but also achieves competitive results with state-of-the-art (SOTA) studies, while requiring fewer selections. Moreover, we validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Anhui Province > Hefei (0.04)