Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains
Eduardo, González Trigueros Jesús, Alejandro, Alonso Sánchez, Emilio, Muñoz Rivera, Jaqueline, Peñarán Prieto Mariana, Natalia, Mendoza González Camila
–arXiv.org Artificial Intelligence
This study analyzes the impact of heterogeneity ("Variety") in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a "complexity paradox": in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering -- specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding -- enabled simpler models to generalize effectively. This work provides a unified framework for algorithm selection based on data nature and infrastructure constraints.
arXiv.org Artificial Intelligence
Dec-2-2025
- Country:
- Europe > Sweden
- Östergötland County > Linköping (0.04)
- North America > Mexico
- Guanajuato (0.04)
- Europe > Sweden
- Genre:
- Research Report
- Experimental Study (0.48)
- New Finding (0.34)
- Research Report
- Industry:
- Health & Medicine (0.46)
- Media > Film (0.46)
- Technology: