Extreme-K categorical samples problem

Chou, Elizabeth, McVey, Catie, Hsieh, Yin-Chen, Enriquez, Sabrina, Hsieh, Fushing

Jul-29-2020–arXiv.org Machine Learning

With histograms as its foundation, we develop Categorical Exploratory Data Analysis (CEDA) under the extreme-$K$ sample problem, and illustrate its universal applicability through four 1D categorical datasets. Given a sizable $K$, CEDA's ultimate goal amounts to discover by data's information content via carrying out two data-driven computational tasks: 1) establish a tree geometry upon $K$ populations as a platform for discovering a wide spectrum of patterns among populations; 2) evaluate each geometric pattern's reliability. In CEDA developments, each population gives rise to a row vector of categories proportions. Upon the data matrix's row-axis, we discuss the pros and cons of Euclidean distance against its weighted version for building a binary clustering tree geometry. The criterion of choice rests on degrees of uniformness in column-blocks framed by this binary clustering tree. Each tree-leaf (population) is then encoded with a binary code sequence, so is tree-based pattern. For evaluating reliability, we adopt row-wise multinomial randomness to generate an ensemble of matrix mimicries, so an ensemble of mimicked binary trees. Reliability of any observed pattern is its recurrence rate within the tree ensemble. A high reliability value means a deterministic pattern. Our four applications of CEDA illuminate four significant aspects of extreme-$K$ sample problems.

artificial intelligence, machine learning, tree geometry, (17 more...)

arXiv.org Machine Learning

Jul-29-2020

arXiv.org PDF

Add feedback

Country:
- Africa (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States
  - Colorado (0.04)
  - California > Yolo County
    - Davis (0.04)
- Europe > Norway
  - Northern Norway > Troms > Tromsø (0.04)
- Asia
  - Taiwan (0.05)
  - Macao (0.04)
  - Middle East > Jordan (0.04)
  - China (0.04)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment > Sports > Baseball (1.00)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning (1.00)
  - Data Science (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found