Goto

Collaborating Authors

 independent feature



Dependency-aware synthetic tabular data generation

arXiv.org Artificial Intelligence

Synthetic tabular data is increasingly used in privacy-sensitive domains such as health care, but existing generative models often fail to preserve inter-attribute relationships. In particular, functional dependencies (FDs) and logical dependencies (LDs), which capture deterministic and rule-based associations between features, are rarely or often poorly retained in synthetic datasets. To address this research gap, we propose the Hierarchical Feature Generation Framework (HFGF) for synthetic tabular data generation. We created benchmark datasets with known dependencies to evaluate our proposed HFGF. The framework first generates independent features using any standard generative model, and then reconstructs dependent features based on predefined FD and LD rules. Our experiments on four benchmark datasets with varying sizes, feature imbalance, and dependency complexity demonstrate that HFGF improves the preservation of FDs and LDs across six generative models, including CTGAN, TVAE, and GReaT. Our findings demonstrate that HFGF can significantly enhance the structural fidelity and downstream utility of synthetic tabular data.


A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression

arXiv.org Artificial Intelligence

This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. Our contributions are three-fold: 1) we analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel; 2) we demonstrate the validity of the Gaussian Equivalent Property (GEP), which states that the generalization performance of KRR remains the same when the whitened features are replaced by standard Gaussian vectors, thereby shedding light on the success of previous analyzes under the Gaussian Design Assumption; 3) we derive novel bounds that improve over existing bounds across a broad range of setting such as (in)dependent feature vectors and various combinations of eigen-decay rates in the over/underparameterized regimes.


EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning Yuan Qi

Neural Information Processing Systems

For many real-world applications, we often need to select correlated variables-- such as genetic variations and imaging features associated with Alzheimer's disease--in a high dimensional space. The correlation between variables presents a challenge to classical variable selection methods. To address this challenge, the elastic net has been developed and successfully applied to many applications. Despite its great success, the elastic net does not exploit the correlation information embedded in the data to select correlated variables. To overcome this limitation, we present a novel hybrid model, EigenNet, that uses the eigenstructures of data to guide variable selection.


How to Verify the Assumptions of Linear Regression

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. Linear regression is a model that estimates the relationship between independent variables and a dependent variable using a straight line.


All About Decision Tree

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. The decision tree is one of the most powerful and important algorithms present in supervised machine learning.


How to Use PySpark for Data Processing and Machine Learning

#artificialintelligence

PySpark is an interface for Apache Spark in Python. PySpark is often used for large-scale data processing and machine learning. We just released a PySpark crash course on the freeCodeCamp.org Krish is a lead data scientist and he runs a popular YouTube channel. Apache Spark is written in the Scala programming language. To support Python with Spark, the Apache Spark community released a tool called PySpark. PySpark allows people to work with Resilient Distributed Datasets (RDDs) in Python through a library called Py4j. PiSpark is an interface for Apache Spark in Python is often used for large scale data processing and machine learning. Krish knack teaches this course. So we are going to start Apache Spark series. And specifically, if I talk about Spark, we will be focusing on how we can use spark with Python. So we are going to discuss about the library called pi Spark, we will try to understand everything why spark is actually required. And probably will also try to cover a lot of ...


Multiple Linear Regression Using Python and Scikit-learn

#artificialintelligence

This article was published as a part of the Data Science Blogathon. If you are on the path of learning data science, then you definitely have an understanding of what machine learning is. In today's digital world everyone knows what Machine Learning is because it was a trending digital technology across the world. Every step towards adaptation of the future world leads by this current technology, and this current technology is led by data scientists like you and me . Here we only discuss machine learning, If you don't know what it is, then we take a brief introduction to it: Machine learning is the study of the algorithms of computers, that improve automatically through experience and by the use of data. This is the simple definition of machine learning, and when we go into deep then we find that there are huge numbers of algorithms that are used in model building.


Fully Explained K-Nearest Neighbors with Python

#artificialintelligence

Hello Everyone, another article in the series fully explained machine learning algorithms. In this article, we will discuss the k nearest neighbor classification problem. A good article is like a flow of the story and readers get as much information in a small amount of time. So, we will discuss the supervised classification problem learning technique. The main goal is to predict the new data point based on samples near that data point.


Random Forests in Machine Learning

#artificialintelligence

This article was published as a part of the Data Science Blogathon. Random Forests are always referred to as black-box models. Let's try to crack open it and see what is inside it. Oops!!! Our plane has crashed, but fortunately, we all are safe. We are Data scientists, so we want to open the black box and see what random things have been recorded inside it. Yes, let's come to our topic.