Goto

Collaborating Authors

Chen

AAAI Conferences

The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attribute is automatically assigned with a weight measuring its individual contribution for the clusters. Experimental results on real-world data show outstanding performance of the proposed algorithm, especially in recognizing the biological patterns in DNA sequences.


Central Clustering of Categorical Data with Automated Feature Weighting

AAAI Conferences

The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attribute is automatically assigned with a weight measuring its individual contribution for the clusters. Experimental results on real-world data show outstanding performance of the proposed algorithm, especially in recognizing the biological patterns in DNA sequences.


Logistic Regression categorical data issues • /r/MachineLearning

@machinelearnbot

I have created a model in R using data with a lot of categorical data and it works well enough (70% classification rate). I need to transfer the code to python for launching to production, however when i transfer the code the results decrease considerably (40% classification rate). I think it may be to do with how I encode the categorical data, any ideas?


Proper train and test sets when using ML on a dataset? • /r/MachineLearning

@machinelearnbot

I just completed a take home assessment as part of the interview process for a company. I was told I didn't pass because my answer lacked proper training and test sets The data set consisted of a mix of categorical and numerical predictors, with the dependent variable being a numerical variable. I then removed all rows with NA values and generated boxplots for each predictor. For one variable, I replaced all of its outliers with the median. For some other variables that indicated percentage values, I did not remove the outliers because they did not seem like obvious outliers (for example, the boxplot showed that values greater than .1 were outliers, but all of those outliers still ranged from 0 to 1 so I didn't think they were typos) I then ran a Lasso linear regression model.


Visualizing the relationship between multiple variables

#artificialintelligence

Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical. For all the code in this post in one file, click here. The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let's demonstrate this on a small segment of the vehicles dataset from the fueleconomy package: Let's see how GGally::ggpairs() visualizes relationships between quantitative variables: The visualization changes a little when we have a mix of quantitative and categorical variables.