categorical value
Improving Representation Learning of Complex Critical Care Data with ICU-BERT
Santos, Ricardo, Carreiro, André V., Peng, Xi, Gamboa, Hugo, Fröhlich, Holger
The multivariate, asynchronous nature of real-world clinical data, such as that generated in Intensive Care Units (ICUs), challenges traditional AI-based decision-support systems. These often assume data regularity and feature independence and frequently rely on limited data scopes and manual feature engineering. The potential of generative AI technologies has not yet been fully exploited to analyze clinical data. We introduce ICU-BERT, a transformer-based model pre-trained on the MIMIC-IV database using a multi-task scheme to learn robust representations of complex ICU data with minimal preprocessing. ICU-BERT employs a multi-token input strategy, incorporating dense embeddings from a biomedical Large Language Model to learn a generalizable representation of complex and multivariate ICU data. With an initial evaluation of five tasks and four additional ICU datasets, ICU-BERT results indicate that ICU-BERT either compares to or surpasses current performance benchmarks by leveraging fine-tuning. By integrating structured and unstructured data, ICU-BERT advances the use of foundational models in medical informatics, offering an adaptable solution for clinical decision support across diverse applications.
- Europe > Portugal (0.04)
- Europe > Germany (0.04)
- North America > United States (0.04)
Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient
Dinh, Duy-Tai, Fujinami, Tsutomu, Huynh, Van-Nam
The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clusterings obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.
- Asia > Singapore (0.04)
- Asia > Japan (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
Learning to Solve Abstract Reasoning Problems with Neurosymbolic Program Synthesis and Task Generation
Bednarek, Jakub, Krawiec, Krzysztof
The ability to think abstractly and reason by analogy is a prerequisite to rapidly adapt to new conditions, tackle newly encountered problems by decomposing them, and synthesize knowledge to solve problems comprehensively. We present TransCoder, a method for solving abstract problems based on neural program synthesis, and conduct a comprehensive analysis of decisions made by the generative module of the proposed architecture. At the core of TransCoder is a typed domain-specific language, designed to facilitate feature engineering and abstract reasoning. In training, we use the programs that failed to solve tasks to generate new tasks and gather them in a synthetic dataset. As each synthetic task created in this way has a known associated program (solution), the model is trained on them in supervised mode. Solutions are represented in a transparent programmatic form, which can be inspected and verified. We demonstrate TransCoder's performance using the Abstract Reasoning Corpus dataset, for which our framework generates tens of thousands of synthetic problems with corresponding solutions and facilitates systematic progress in learning.
- Europe > Poland > Greater Poland Province > Poznań (0.05)
- North America > United States > New York > New York County > New York City (0.04)
Linguistics from a topological viewpoint
Fortunately numbers are the dimensions of the k-th persistent homology there are many such suitable options, one option is the parameterized by the threshold r. p-Wasserstein distance with p > 0 being a parameter. Especially when p =, we call the -Wasserstein distance More than just counting topological structures with the bottleneck distance. We skip the exact definition of persistent Betti numbers, we can detect at which threshold p-Wasserstein distance here since it is too technical, the values a topological structure is born and dead.
- South America > Peru (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- South America > Colombia (0.04)
- (9 more...)
Generating Likely Counterfactuals Using Sum-Product Networks
Nemecek, Jiri, Pevny, Tomas, Marecek, Jakub
Due to user demand and recent regulation (GDPR, AI Act), decisions made by AI systems need to be explained. These decisions are often explainable only post hoc, where counterfactual explanations are popular. The question of what constitutes the best counterfactual explanation must consider multiple aspects, where "distance from the sample" is the most common. We argue that this requirement frequently leads to explanations that are unlikely and, therefore, of limited value. Here, we present a system that provides high-likelihood explanations. We show that the search for the most likely explanations satisfying many common desiderata for counterfactual explanations can be modeled using mixed-integer optimization (MIO). In the process, we propose an MIO formulation of a Sum-Product Network (SPN) and use the SPN to estimate the likelihood of a counterfactual, which can be of independent interest. A numerical comparison against several methods for generating counterfactual explanations is provided.
- Europe (0.28)
- North America > United States > New York > New York County > New York City (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (4 more...)
Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset
Singh, Usneek, Arora, Piyush, Ganesan, Shamika, Kumar, Mohit, Kulkarni, Siddhant, Joshi, Salil R.
We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the effectiveness did not extend to larger industry-scale datasets. The challenges identified include handling high-dimensional data, the necessity for efficient pre-processing of categorical and numerical features, and addressing substantial computational requirements. To overcome the identified challenges, the study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express. The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance. Focusing on temporal financial data modeling, the research aims to facilitate the systematic development and deployment of transformer-based models in real-world scenarios, emphasizing scalability.
Hybrid Models for Mixed Variables in Bayesian Optimization
Luo, Hengrui, Cho, Younghyun, Demmel, James W., Li, Xiaoye S., Liu, Yang
This paper presents a new type of hybrid models for Bayesian optimization (BO) adept at managing mixed variables, encompassing both quantitative (continuous and integer) and qualitative (categorical) types. Our proposed new hybrid models merge Monte Carlo Tree Search structure (MCTS) for categorical variables with Gaussian Processes (GP) for continuous ones. Addressing efficiency in searching phase, we juxtapose the original (frequentist) upper confidence bound tree search (UCTS) and the Bayesian Dirichlet search strategies, showcasing the tree architecture's integration into Bayesian optimization. Central to our innovation in surrogate modeling phase is online kernel selection for mixed-variable BO. Our innovations, including dynamic kernel selection, unique UCTS (hybridM) and Bayesian update strategies (hybridD), position our hybrid models as an advancement in mixed-variable surrogate models. Numerical experiments underscore the hybrid models' superiority, highlighting their potential in Bayesian optimization. Keywords: Gaussian processes, Monte Carlo tree search, categorical variables, online kernel selection. The discussion of different types of encodings can be found in Cerda et al. (2018). 1 Introduction Our motivating problem is to optimize a "black-box" function with "mixed" variables, lacking an analytic expression. "Mixed" signifies the function's input variables comprise both continuous (quantitative) and categorical (qualitative) variables, common in machine learning and scientific computing tasks like performance tuning of mathematical libraries and application codes at runtime and compile-time (Balaprakash et al., 2018). Bayesian optimization (BO) with Gaussian process (GP) surrogate models is a prevalent method for optimizing noisy, expensive black-box functions, primarily designed for continuous-variable functions (Shahriari et al., 2016; Sid-Lakhdar et al., 2020). Extending BO to mixed-variable functions presents theoretical and computational challenges due to variable type differences (Table 1). Continuous variables have uncountably many values with magnitudes and intrinsic ordering, allowing natural gradient definition. In contrast, categorical variables, having finitely many values without intrinsic ordering or magnitude, require encoding in the GP context, potentially inducing discontinuity and degrading GP performance (Luo et al., 2021). The empirical rule of thumb for handling an integer variable (Karlsson et al., 2020) is to treat it as a categorical variable if the number of integer values (i.e., number of categorical values) is small, or as a continuous variable with embedding (a.k.a.
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Africa > Sudan (0.04)
- Energy (1.00)
- Government > Regional Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
Machine Learning Based Missing Values Imputation in Categorical Datasets
Ishaq, Muhammad, iftikhar, Laila, Khan, Majid, Khan, Asfandyar, Khan, Arshad
This study explored the use of machine learning algorithms for predicting and imputing missing values in categorical datasets. We focused on ensemble models that use the error correction output codes (ECOC) framework, including SVM-based and KNN-based ensemble models, as well as an ensemble classifier that combines SVM, KNN, and MLP models. We applied these algorithms to three datasets: the CPU dataset, the hypothyroid dataset, and the Breast Cancer dataset. Our experiments showed that the machine learning algorithms were able to achieve good performance in predicting and imputing the missing values, with some variations depending on the specific dataset and missing value pattern. The ensemble models using the error correction output codes (ECOC) framework were particularly effective in improving the accuracy and robustness of the predictions, compared to individual models. However, there are also challenges and limitations to using deep learning for missing value imputation, including the need for large amounts of labeled data and the potential for overfitting. Further research is needed to evaluate the effectiveness and efficiency of deep learning algorithms for missing value imputation and to develop strategies for addressing the challenges and limitations that may arise.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- Europe > Switzerland (0.04)
- Europe > Netherlands (0.04)
- Asia > Pakistan > Khyber Pakhtunkhwa > Peshawar Division > Peshawar District > Peshawar (0.04)
- Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
Data Science Interview Guide. Data Science is quite a large and…
Data Science is quite a large and diverse field. As a result, it is really difficult to be a jack of all trades. Traditionally, Data Science would focus on mathematics, computer science and domain expertise. While I will briefly cover some computer science fundamentals, the bulk of this blog will mostly cover the mathematical basics one might either need to brush up on (or even take an entire course). In most data science workplaces, software skills are a must. While I understand most of you reading this are more math heavy by nature, realize the bulk of data science (dare I say 80%) is collecting, cleaning and processing data into a useful form.
How to perform ordinal encoding using sklearn? - The Security Buddy
A categorical variable contains categorical data, such as name, gender, address, etc. There are different types of categorical variables. A nominal categorical variable contains categorical data that cannot be ranked over each other. For example, name, address, gender, etc. But let's say there is a categorical variable that contains categorical data that can be ranked over each other.