Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach
Chivers, Benedict Delahaye, Wallbank, John, Cole, Steven J., Sebek, Ondrej, Stanley, Simon, Fry, Matthew, Leontidis, Georgios
Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs nonrain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex nonlinear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions. Keywords: machine learning, data imputation, gradient boosted trees, environmental sensor networks, precipitation, soil moisture 1. Introduction Precipitation data is of critical importance across multiple lines of enquiry, informing statistical models and analysis relating to weather forecasting, extreme weather events, climate change, water-resource management, droughts, flooding, agricultural impact, and hydroelectric power. Historical rainfall data can reveal long term trends in environmental hydrological issues with real-time data input allowing for immediate forecasting of future conditions. Distributed networks of rain gauges are typically used to provide precipitation data at the earth's surface at varying temporal resolutions and can cover large geographical areas (Kidd, 2001). As is the case in many databases, particularly those utilising physical sensors, the problem of missing data arises. Missing data can be a result of sensor failure, data storage/transmission failure, or post-collection quality control procedures resulting in removal of identified problem data (Blenkinsop et al., 2017). Missing data in precipitation databases represents a serious limitation for the effective use of the data. Given the global scale and importance of precipitation and meteorological data (Sun et al., 2018), developing solutions to missing data is of paramount importance for maximising information gain.
May-2-2020
- Country:
- Europe
- Western Europe (0.04)
- Netherlands (0.04)
- United Kingdom
- Wales (0.04)
- Scotland (0.04)
- Northern Ireland (0.04)
- England > Lincolnshire
- Lincoln (0.04)
- Switzerland > Basel-City
- Basel (0.04)
- Asia
- Europe
- Genre:
- Research Report (0.82)
- Industry:
- Energy (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Technology:
- Information Technology > Artificial Intelligence > Machine Learning
- Statistical Learning > Regression (1.00)
- Ensemble Learning (1.00)
- Neural Networks > Deep Learning (0.93)
- Decision Tree Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning