AITopics | microdata

Collaborating Authors

microdata

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AnUncertaintyPrincipleisaPriceof Privacy-PreservingMicrodata

Neural Information Processing SystemsFeb-9-2026, 00:46:16 GMT

Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata isfamiliar and convenient for downstream users. However, there is a statistical price for this kind of convenience.

artificial intelligence, dataset, query, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
North America > United States > California (0.04)
Europe > Croatia > Zagreb County > Zagreb (0.04)

Industry: Government > Regional Government > North America Government > United States Government (0.95)

Technology:

Information Technology > Artificial Intelligence (0.47)
Information Technology > Data Science (0.47)

Add feedback

From Unstructured Data to Demand Counterfactuals: Theory and Practice

Christensen, Timothy, Compiani, Giovanni

arXiv.org Machine LearningJan-12-2026

Empirical models of demand for differentiated products rely on low-dimensional product representations to capture substitution patterns. These representations are increasingly proxied by applying ML methods to high-dimensional, unstructured data, including product descriptions and images. When proxies fail to capture the true dimensions of differentiation that drive substitution, standard workflows will deliver biased counterfactuals and invalid inference. We develop a practical toolkit that corrects this bias and ensures valid inference for a broad class of counterfactuals. Our approach applies to market-level and/or individual data, requires minimal additional computation, is efficient, delivers simple formulas for standard errors, and accommodates data-dependent proxies, including embeddings from fine-tuned ML models. It can also be used with standard quantitative attributes when mismeasurement is a concern. In addition, we propose diagnostics to assess the adequacy of the proxy construction and dimension. The approach yields meaningful improvements in predicting counterfactual substitution in both simulations and an empirical application.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2601.05374

Country: North America > United States (1.00)

Genre: Research Report (0.81)

Industry:

Automobiles & Trucks (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.92)
Information Technology > Information Management (0.84)

Add feedback

Interval Fisher's Discriminant Analysis and Visualisation

Pinheiro, Diogo, Oliveira, M. Rosário, Kravchenko, Igor, Oliveira, Lina

arXiv.org Machine LearningDec-16-2025

In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher's Discriminant Analysis to interval-valued data, using Moore's interval arithmetic and the Mallows' distance. Fisher's objective function is generalised to consider simultaneously the contributions of the centres and the ranges of intervals and is numerically maximised. The resulting discriminant directions are then used to classify interval-valued observations.To support visual assessment, we adapt the class map, originally introduced for conventional data, to classifiers that assign labels through minimum distance rules. We also extend the silhouette plot to this setting and use stacked mosaic plots to complement the visual display of class assignments. Together, these graphical tools provide insight into classifier performance and the strength of class membership. Applications to real datasets illustrate the proposed methodology and demonstrate its value in interpreting classification results for interval-valued data.

discriminant analysis, mallow, matrix, (16 more...)

arXiv.org Machine Learning

2512.11945

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.05)
Europe > Spain > Galicia > Madrid (0.05)
Asia > China > Hong Kong (0.05)
(7 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

Bedrick, Steven, Doğruöz, A. Seza, Nisioi, Sergiu

arXiv.org Artificial IntelligenceNov-20-2025

Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.

data mining, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2505.03025

Country:

North America > United States (1.00)
Asia (1.00)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)

Genre: Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Providers & Services (0.67)
Health & Medicine > Consumer Health (0.67)
Health & Medicine > Health Care Technology > Medical Record (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
(4 more...)

Add feedback

A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata

Tertulino, Rodrigo, Almeida, Ricardo

arXiv.org Artificial IntelligenceNov-18-2025

Identifying the determinants of academic success in basic education represents a central challenge for educational research and policymaking, particularly in a country with Brazil's vast dimensions and socioeconomic heterogeneity (Issah et al. 2023). A systemic approach is crucial, as student performance is influenced by a complex interplay of factors spanning individual, academic, socioeconomic, and institutional domains (Barrag an Moreno and Guzm an Rinc on 2025). The System of Assessment of Basic Education (SAEB), conducted by the National Institute for Educational Studies and Research An ısio Teixeira (INEP) (INEP 2025), provides a rich, multi-level dataset uniquely suited for such an analysis (Bonamino et al. 2010). The public availability of its anonymized microdata enables the research community to investigate the intricate relationships between student proficiency and a wide array of contextual factors, from socioeconomic backgrounds to school infrastructure and teacher profiles. Consequently, the SAEB microdata is an essential resource for data-driven research aimed at informing and evaluating educational policies in the country (Lundberg and Lee 2017b; Mazoni and Oliveira 2023). While traditional statistical methods are common, the Educational Data Mining (EDM) paradigm offers powerful tools for uncovering complex, non-linear patterns from such data (Romero and Ventura 2010). Furthermore, we demonstrate that by interpreting the model's classification results with XAI techniques, our method provides data-driven insights for educators and policymakers (Idrizi 2024). The primary objective of this research is thus to develop and evaluate a multi-level machine learning model to identify the key systemic factors associated with the academic performance of 9th-grade and high school students, using the SAEB microdata. Building upon this perspective, the study shifts its analytical focus from purely individual student interventions toward addressing the systemic determinants that shape educational outcomes in Brazilian basic education.

artificial intelligence, machine learning, student, (17 more...)

arXiv.org Artificial Intelligence

2510.22266

Country:

North America > United States (0.93)
South America (0.67)

Genre:

Research Report > New Finding (1.00)
Instructional Material (1.00)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Setting > Higher Education (0.69)
Education > Curriculum > Subject-Specific Education (0.67)
Education > Educational Setting > K-12 Education > Secondary School (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

639d79cc857a6c76c2723b7e014fccb0-Paper.pdf

Neural Information Processing SystemsAug-14-2025, 20:20:47 GMT

dataset, differential privacy, query, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
Europe > Croatia > Zagreb County > Zagreb (0.04)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
(2 more...)

Add feedback

Deep Contrastive Learning for Feature Alignment: Insights from Housing-Household Relationship Inference

Qian, Xiao, Dong, Shangjia, Davidson, Rachel

arXiv.org Artificial IntelligenceFeb-16-2025

Housing and household characteristics are key determinants of social and economic well-being, yet our understanding of their interrelationships remains limited. This study addresses this knowledge gap by developing a deep contrastive learning (DCL) model to infer housing-household relationships using the American Community Survey (ACS) Public Use Microdata Sample (PUMS). More broadly, the proposed model is suitable for a class of problems where the goal is to learn joint relationships between two distinct entities without explicitly labeled ground truth data. Our proposed dual-encoder DCL approach leverages co-occurrence patterns in PUMS and introduces a bisect K-means clustering method to overcome the absence of ground truth labels. The dual-encoder DCL architecture is designed to handle the semantic differences between housing (building) and household (people) features while mitigating noise introduced by clustering. To validate the model, we generate a synthetic ground truth dataset and conduct comprehensive evaluations. The model further demonstrates its superior performance in capturing housing-household relationships in Delaware compared to state-of-the-art methods. A transferability test in North Carolina confirms its generalizability across diverse sociodemographic and geographic contexts. Finally, the post-hoc explainable AI analysis using SHAP values reveals that tenure status and mortgage information play a more significant role in housing-household matching than traditionally emphasized factors such as the number of persons and rooms.

artificial intelligence, housing unit, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.11205

Country: North America > United States > North Carolina (0.25)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Industry:

Banking & Finance > Real Estate (0.55)
Health & Medicine > Health Care Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Evaluating utility in synthetic banking microdata applications

Caceres, Hugo E., Moews, Ben

arXiv.org Artificial IntelligenceOct-29-2024

Financial regulators such as central banks collect vast amounts of data, but access to the resulting fine-grained banking microdata is severely restricted by banking secrecy laws. Recent developments have resulted in mechanisms that generate faithful synthetic data, but current evaluation frameworks lack a focus on the specific challenges of banking institutions and microdata. We develop a framework that considers the utility and privacy requirements of regulators, and apply this to financial usage indices, term deposit yield curves, and credit card transition matrices. Using the Central Bank of Paraguay's data, we provide the first implementation of synthetic banking microdata using a central bank's collected information, with the resulting synthetic datasets for all three domain applications being publicly available and featuring information not yet released in statistical disclosure. We find that applications less susceptible to post-processing information loss, which are based on frequency tables, are particularly suited for this approach, and that marginal-based inference mechanisms to outperform generative adversarial network models for these applications. Our results demonstrate that synthetic data generation is a promising privacy-enhancing technology for financial regulators seeking to complement their statistical disclosure, while highlighting the crucial role of evaluating such endeavors in terms of utility and privacy requirements.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.22519

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > United Kingdom (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Law > Statutes (1.00)
Law > Business Law (1.00)
Information Technology > Security & Privacy (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(2 more...)

Add feedback

Value-Enriched Population Synthesis: Integrating a Motivational Layer

Aguilera, Alba, Albertí, Miquel, Osman, Nardine, Curto, Georgina

arXiv.org Artificial IntelligenceAug-18-2024

In recent years, computational improvements have allowed for more nuanced, data-driven and geographically explicit agent-based simulations. So far, simulations have struggled to adequately represent the attributes that motivate the actions of the agents. In fact, existing population synthesis frameworks generate agent profiles limited to socio-demographic attributes. In this paper, we introduce a novel value-enriched population synthesis framework that integrates a motivational layer with the traditional individual and household socio-demographic layers. Our research highlights the significance of extending the profile of agents in synthetic populations by incorporating data on values, ideologies, opinions and vital priorities, which motivate the agents' behaviour. This motivational layer can help us develop a more nuanced decision-making mechanism for the agents in social simulation settings. Our methodology integrates microdata and macrodata within different Bayesian network structures. This contribution allows to generate synthetic populations with integrated value systems that preserve the inherent socio-demographic distributions of the real population in any specific region.

data source, motivational, synthetic population, (17 more...)

arXiv.org Artificial Intelligence

2408.09407

Country:

Oceania > Australia (0.14)
Europe > Spain > Catalonia (0.05)
North America > Canada (0.04)
(9 more...)

Genre: Research Report (0.40)

Industry:

Government (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.87)

Add feedback

A Deep Generative Framework for Joint Households and Individuals Population Synthesis

Qian, Xiao, Gangwal, Utkarsh, Dong, Shangjia, Davidson, Rachel

arXiv.org Artificial IntelligenceJun-30-2024

Household and individual-level sociodemographic data are essential for understanding human-infrastructure interaction and policymaking. However, the Public Use Microdata Sample (PUMS) offers only a sample at the state level, while census tract data only provides the marginal distributions of variables without correlations. Therefore, we need an accurate synthetic population dataset that maintains consistent variable correlations observed in microdata, preserves household-individual and individual-individual relationships, adheres to state-level statistics, and accurately represents the geographic distribution of the population. We propose a deep generative framework leveraging the variational autoencoder (VAE) to generate a synthetic population with the aforementioned features. The methodological contributions include (1) a new data structure for capturing household-individual and individual-individual relationships, (2) a transfer learning process with pre-training and fine-tuning steps to generate households and individuals whose aggregated distributions align with the census tract marginal distribution, and (3) decoupled binary cross-entropy (D-BCE) loss function enabling distribution shift and out-of-sample records generation. Model results for an application in Delaware, USA demonstrate the ability to ensure the realism of generated household-individual records and accurately describe population statistics at the census tract level compared to existing methods. Furthermore, testing in North Carolina, USA yielded promising results, supporting the transferability of our method.

household, marginal distribution, microdata, (11 more...)

arXiv.org Artificial Intelligence

2407.01643

Country:

North America > United States > North Carolina (0.25)
North America > United States > Delaware > New Castle County > Newark (0.14)
Europe > United Kingdom (0.14)
(6 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Add feedback