Goto

Collaborating Authors

 official statistics


Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study

Schwerter, Jakob, Romero, Andrés, Dumpert, Florian, Pauly, Markus

arXiv.org Machine Learning

Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.


Leveraging Machine Learning for Official Statistics: A Statistical Manifesto

Puts, Marco, Salgado, David, Daas, Piet

arXiv.org Machine Learning

It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics.


A step towards the integration of machine learning and small area estimation

Żądło, Tomasz, Chwila, Adam

arXiv.org Machine Learning

The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with "borrowing strength" from other subpopulations and time periods, is considered.


The Applicability of Federated Learning to Official Statistics

Stock, Joshua, Hauke, Oliver, Weißmann, Julius, Federrath, Hannes

arXiv.org Artificial Intelligence

This work investigates the potential of Federated Learning (FL) for official statistics and shows how well the performance of FL models can keep up with centralized learning methods. FL is particularly interesting for official statistics because its utilization can safeguard the privacy of data holders, thus facilitating access to a broader range of data. By simulating three different use cases, important insights on the applicability of the technology are gained. The use cases are based on a medical insurance data set, a fine dust pollution data set and a mobile radio coverage data set - all of which are from domains close to official statistics. We provide a detailed analysis of the results, including a comparison of centralized and FL algorithm performances for each simulation. In all three use cases, we were able to train models via FL which reach a performance very close to the centralized model benchmarks. Our key observations and their implications for transferring the simulations into practice are summarized. We arrive at the conclusion that FL has the potential to emerge as a pivotal technology in future use cases of official statistics.


Changing Data Sources in the Age of Machine Learning for Official Statistics

De Boom, Cedric, Reusens, Michael

arXiv.org Artificial Intelligence

Data science has become increasingly essential for the production of official statistics, as it enables the automated collection, processing, and analysis of large amounts of data. With such data science practices in place, it enables more timely, more insightful and more flexible reporting. However, the quality and integrity of data-science-driven statistics rely on the accuracy and reliability of the data sources and the machine learning techniques that support them. In particular, changes in data sources are inevitable to occur and pose significant risks that are crucial to address in the context of machine learning for official statistics. This paper gives an overview of the main risks, liabilities, and uncertainties associated with changing data sources in the context of machine learning for official statistics. We provide a checklist of the most prevalent origins and causes of changing data sources; not only on a technical level but also regarding ownership, ethics, regulation, and public perception. Next, we highlight the repercussions of changing data sources on statistical reporting. These include technical effects such as concept drift, bias, availability, validity, accuracy and completeness, but also the neutrality and potential discontinuation of the statistical offering. We offer a few important precautionary measures, such as enhancing robustness in both data sourcing and statistical techniques, and thorough monitoring. In doing so, machine learning-based official statistics can maintain integrity, reliability, consistency, and relevance in policy-making, decision-making, and public discourse.


How international collaboration is advancing machine learning in official statistics

#artificialintelligence

New technologies and data sources have tremendous potential to improve statistical production. They offer a way to generate statistics in a more timely, accurate and cost-efficient manner. Yet, keeping up with the pace of change is challenging, especially for National Statistical Organisations (NSOs) that must innovate with care to maintain a "gold standard" in their outputs. International cooperation between NSOs and other official statistical bodies is one way to help accelerate change in a responsible way. In 2021, the Office for National Statistics (ONS) and the United Nations Economic Commission for Europe (UNECE) Machine Learning Group (ML 2021) demonstrated the benefits of international cooperation for technological advance.

  Country:
  Industry: Government (0.58)

Join the ONS-UNECE Machine Learning Group 2021 Webinar 19 November 2021

#artificialintelligence

Machine Learning (ML) holds great potential for statistical organisations. It can make the production of statistics more efficient by automating specific processes or assisting humans in carrying out the process. It also allows statistical organisations to use new types of data such as social media data and imagery. Many national statistical offices (NSOs) are investigating how ML can be used to increase the relevance and quality of official statistics in an environment of growing demands for trusted information, rapidly developing and accessible technologies, and numerous competitors. Machine learning is revolutionising the way statistical organisations produce statistics.


Dynamic Question Ordering in Online Surveys

Early, Kirstin, Mankoff, Jennifer, Fienberg, Stephen E.

arXiv.org Machine Learning

Online surveys have the potential to support adaptive questions, where later questions depend on earlier responses. Past work has taken a rule-based approach, uniformly across all respondents. We envision a richer interpretation of adaptive questions, which we call dynamic question ordering (DQO), where question order is personalized. Such an approach could increase engagement, and therefore response rate, as well as imputation quality. We present a DQO framework to improve survey completion and imputation. In the general survey-taking setting, we want to maximize survey completion, and so we focus on ordering questions to engage the respondent and collect hopefully all information, or at least the information that most characterizes the respondent, for accurate imputations. In another scenario, our goal is to provide a personalized prediction. Since it is possible to give reasonable predictions with only a subset of questions, we are not concerned with motivating users to answer all questions. Instead, we want to order questions to get information that reduces prediction uncertainty, while not being too burdensome. We illustrate this framework with an example of providing energy estimates to prospective tenants. We also discuss DQO for national surveys and consider connections between our statistics-based question-ordering approach and cognitive survey methodology.