Recent developments in SCADA (Supervisory Control and Data Acquisition) systems for physical infrastructure, such as high pressure gas pipeline systems and electric grids, have generated enormous amounts of time series data. This data brings great opportunities for advanced knowledge discovery and data mining methods to identify system failures faster and earlier than operation experts. This paper presents our effort in collaboration with a utility company to solve a grand challenge; namely, to use advanced data mining methods to detect leaks on a high pressure gas transmission system. Leak detection models with unsupervised learning tasks were developed analyzing billions of data records to identify leaks of different sizes and impacts, with very low false positive rates. In particular, our solution was able to identify small leaks leading to rupture events. The model also identified small leaks not identifiable with current detection systems. Such high-fidelity early identification enables operation personnel to take preventive measures against possible catastrophic events. We then formulate several generic detection methods with models derived from time series anomaly detection methods. We show that our leak detection models are superior to the SCADA alarm system, a mass balance model and other generic time series anomaly detection models in terms of both detection accuracy and computation time.
A commonly used stochastic model for derivative and commodity market analysis is the Barndorff-Nielsen and Shephard (BN-S) model. Though this model is very efficient and analytically tractable, it suffers from the absence of long range dependence and many other issues. For this paper, the analysis is restricted to crude oil price dynamics. A simple way of improving the BN-S model with the implementation of various machine learning algorithms is proposed. This refined BN-S model is more efficient and has fewer parameters than other models which are used in practice as improvements of the BN-S model. The procedure and the model show the application of data science for extracting a "deterministic component" out of processes that are usually considered to be completely stochastic. Empirical applications validate the efficacy of the proposed model for long range dependence.
To realize efficient computational fluid dynamics (CFD) prediction of two-phase flow, a multi-scale framework was proposed in this paper by applying a physics-guided data-driven approach. Instrumental to this framework, Feature Similarity Measurement (FSM) technique was developed for error estimation in two-phase flow simulation using coarse-mesh CFD, to achieve a comparable accuracy as fine-mesh simulations with fast-running feature. By defining physics-guided parameters and variable gradients as physical features, FSM has the capability to capture the underlying local patterns in the coarse-mesh CFD simulation. Massive low-fidelity data and respective high-fidelity data are used to explore the underlying information relevant to the main simulation errors and the effects of phenomenological scaling. By learning from previous simulation data, a surrogate model using deep feedforward neural network (DFNN) can be developed and trained to estimate the simulation error of coarse-mesh CFD. The research documented supports the feasibility of the physics-guided deep learning methods for coarse mesh CFD simulations which has a potential for the efficient industrial design.
We consider Bayesian analysis of a class of multiple changepoint models. While there are a variety of efficient ways to analyse these models if the parameters associated with each segment are independent, there are few general approaches for models where the parameters are dependent. Under the assumption that the dependence is Markov, we propose an efficient online algorithm for sampling from an approximation to the posterior distribution of the number and position of the changepoints. In a simulation study, we show that the approximation introduced is negligible. We illustrate the power of our approach through fitting piecewise polynomial models to data, under a model which allows for either continuity or discontinuity of the underlying curve at each changepoint. This method is competitive with, or out-performs, other methods for inferring curves from noisy data; and uniquely it allows for inference of the locations of discontinuities in the underlying curve.
A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.