Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains

Eduardo, González Trigueros Jesús, Alejandro, Alonso Sánchez, Emilio, Muñoz Rivera, Jaqueline, Peñarán Prieto Mariana, Natalia, Mendoza González Camila

Dec-2-2025–arXiv.org Artificial Intelligence

This study analyzes the impact of heterogeneity ("Variety") in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a "complexity paradox": in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering -- specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding -- enabled simpler models to generalize effectively. This work provides a unified framework for algorithm selection based on data nature and infrastructure constraints.

data mining, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Dec-2-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report
  - Experimental Study (0.48)
  - New Finding (0.34)

Industry:
- Media > Film (0.46)
- Health & Medicine (0.46)

Technology:
- Information Technology
  - Data Science > Data Mining (1.00)
  - Artificial Intelligence
    - Natural Language > Text Processing (1.00)
    - Representation & Reasoning
      - Search (0.94)
      - Optimization (0.93)
    - Machine Learning
      - Statistical Learning (1.00)
      - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found