Network topology identification (TI) is an essential function for distributed energy resources management systems (DERMS) to organize and operate widespread distributed energy resources (DERs). In this paper, discriminant analysis (DA) is deployed to develop a network TI function that relies only on the measurements available to DERMS. The propounded method is able to identify the network switching configuration, as well as the status of protective devices. Following, to improve the TI resiliency against the interruption of communication channels, a quadratic programming optimization approach is proposed to recover the missing signals. By deploying the propounded data recovery approach and Bayes' theorem together, a benchmark is developed afterward to identify anomalous measurements. This benchmark can make the TI function resilient against cyber-attacks. Having a low computational burden, this approach is fast-track and can be applied in real-time applications. Sensitivity analysis is performed to assess the contribution of different measurements and the impact of the system load type and loading level on the performance of the proposed approach.
Kairouz, Peter, McMahan, H. Brendan, Avent, Brendan, Bellet, Aurélien, Bennis, Mehdi, Bhagoji, Arjun Nitin, Bonawitz, Keith, Charles, Zachary, Cormode, Graham, Cummings, Rachel, D'Oliveira, Rafael G. L., Rouayheb, Salim El, Evans, David, Gardner, Josh, Garrett, Zachary, Gascón, Adrià, Ghazi, Badih, Gibbons, Phillip B., Gruteser, Marco, Harchaoui, Zaid, He, Chaoyang, He, Lie, Huo, Zhouyuan, Hutchinson, Ben, Hsu, Justin, Jaggi, Martin, Javidi, Tara, Joshi, Gauri, Khodak, Mikhail, Konečný, Jakub, Korolova, Aleksandra, Koushanfar, Farinaz, Koyejo, Sanmi, Lepoint, Tancrède, Liu, Yang, Mittal, Prateek, Mohri, Mehryar, Nock, Richard, Özgür, Ayfer, Pagh, Rasmus, Raykova, Mariana, Qi, Hang, Ramage, Daniel, Raskar, Ramesh, Song, Dawn, Song, Weikang, Stich, Sebastian U., Sun, Ziteng, Suresh, Ananda Theertha, Tramèr, Florian, Vepakomma, Praneeth, Wang, Jianyu, Xiong, Li, Xu, Zheng, Yang, Qiang, Yu, Felix X., Yu, Han, Zhao, Sen
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
Such information includes: the database in modern hospital systems, usually known as Electronic Health Records (EHR), which store the patients' diagnosis, medication, laboratory test results, medical image data, etc.; information on various health behaviors tracked and stored by wearable devices, ubiquitous sensors and mobile applications, such as the smoking status, alcoholism history, exercise level, sleeping conditions, etc.; information collected by census or various surveys regarding sociodemographic factors of the target cohort; and information on people's mental health inferred from their social media activities or social networks such as Twitter, Facebook, etc. These health-related data come from heterogeneous sources, describe assorted aspects of the individual's health conditions. Such data is rich in structure and information which has great research potentials for revealing unknown medical knowledge about genomic epidemiology, disease developments and correlations, drug discoveries, medical diagnosis, mental illness prevention, health behavior adaption, etc. In real-world problems, the number of features relating to a certain health condition could grow exponentially with the development of new information techniques for collecting and measuring data. To reveal the causal influence between various factors and a certain disease or to discover the correlations among diseases from data at such a tremendous scale, requires the assistance of advanced information technology such as data mining, machine learning, text mining, etc. Machine learning technology not only provides a way for learning qualitative relationships among features and patients, but also the quantitative parameters regarding the strength of such correlations.
The vast majority of network datasets contains errors and omissions, although this is rarely incorporated in traditional network analysis. Recently, an increasing effort has been made to fill this methodological gap by developing network reconstruction approaches based on Bayesian inference. These approaches, however, rely on assumptions of uniform error rates and on direct estimations of the existence of each edge via repeated measurements, something that is currently unavailable for the majority of network data. Here we develop a Bayesian reconstruction approach that lifts these limitations by not only allowing for heterogeneous errors, but also for individual edge measurements without direct error estimates. Our approach works by coupling the inference approach with structured generative network models, which enable the correlations between edges to be used as reliable error estimates. Although our approach is general, we focus on the stochastic block model as the basic generative process, from which efficient nonparametric inference can be performed, and yields a principled method to infer hierarchical community structure from noisy data. We demonstrate the efficacy of our approach with a variety of empirical and artificial networks.
Today, the amount of data that can be retrieved from communications networks is extremely high and diverse (e.g., data regarding users behavior, traffic traces, network alarms, signal quality indicators, etc.). Advanced mathematical tools are required to extract useful information from this large set of network data. In particular, Machine Learning (ML) is regarded as a promising methodological area to perform network-data analysis and enable, e.g., automatized network self-configuration and fault management. In this survey we classify and describe relevant studies dealing with the applications of ML to optical communications and networking. Optical networks and system are facing an unprecedented growth in terms of complexity due to the introduction of a huge number of adjustable parameters (such as routing configurations, modulation format, symbol rate, coding schemes, etc.), mainly due to the adoption of, among the others, coherent transmission/reception technology, advanced digital signal processing and to the presence of nonlinear effects in optical fiber systems. Although a good number of research papers have appeared in the last years, the application of ML to optical networks is still in its early stage. In this survey we provide an introductory reference for researchers and practitioners interested in this field. To stimulate further work in this area, we conclude the paper proposing new possible research directions.
We present a new approach to estimating the interdependence of industries in an economy by applying data science solutions. By exploiting interfirm buyer--seller network data, we show that the problem of estimating the interdependence of industries is similar to the problem of uncovering the latent block structure in network science literature. To estimate the underlying structure with greater accuracy, we propose an extension of the sparse block model that incorporates node textual information and an unbounded number of industries and interactions among them. The latter task is accomplished by extending the well-known Chinese restaurant process to two dimensions. Inference is based on collapsed Gibbs sampling, and the model is evaluated on both synthetic and real-world datasets. We show that the proposed model improves in predictive accuracy and successfully provides a satisfactory solution to the motivated problem. We also discuss issues that affect the future performance of this approach.
Many aspects of the design of efficient crowdsourcing processes, such as defining workers bonuses, fair prices and time limits of the tasks, involve knowledge of the likely duration of the task at hand. In this work we introduce a new timesensitive Bayesian aggregation method that simultaneously estimates a tasks duration and obtains reliable aggregations of crowdsourced judgments. Our method, called BCCTime, uses latent variables to represent the uncertainty about the workers completion time, the tasks duration and the workers accuracy. To relate the quality of a judgment to the time a worker spends on a task, our model assumes that each task is completed within a latent time window within which all workers with a propensity to genuinely attempt the labelling task (i.e., no spammers) are expected to submit their judgments. In contrast, workers with a lower propensity to valid labelling, such as spammers, bots or lazy labellers, are assumed to perform tasks considerably faster or slower than the time required by normal workers. Specifically, we use efficient message-passing Bayesian inference to learn approximate posterior probabilities of (i) the confusion matrix of each worker, (ii) the propensity to valid labelling of each worker, (iii) the unbiased duration of each task and (iv) the true label of each task. Using two real- world public datasets for entity linking tasks, we show that BCCTime produces up to 11% more accurate classifications and up to 100% more informative estimates of a tasks duration compared to stateoftheart methods.
When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number $n$ of acquired samples (statistical replicates) is far fewer than the number $p$ of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size $n$ is fixed, and the dimension $p$ grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
A standard technique for understanding underlying dependency structures among a set of variables posits a shared conditional probability distribution for the variables measured on individuals within a group. This approach is often referred to as module networks, where individuals are represented by nodes in a network, groups are termed modules, and the focus is on estimating the network structure among modules. However, estimation solely from node-specific variables can lead to spurious dependencies, and unverifiable structural assumptions are often used for regularization. Here, we propose an extended model that leverages direct observations about the network in addition to node-specific variables. By integrating complementary data types, we avoid the need for structural assumptions. We illustrate theoretical and practical significance of the model and develop a reversible-jump MCMC learning procedure for learning modules and model parameters. We demonstrate the method accuracy in predicting modular structures from synthetic data and capability to learn influence structures in twitter data and regulatory modules in the Mycobacterium tuberculosis gene regulatory network.
In probabilistic approaches to classification and information extraction, one typically builds a statistical model of words under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data. For example, in information extraction from web pages, word formatting may be indicative of extraction category in different ways on different web pages. The difficulty with using such features is capturing and exploiting the new regularities encountered in previously unseen data. In this paper, we propose a hierarchical probabilistic model that uses both local/scope-limited features, such as word formatting, and global features, such as word content. The local regularities are modeled as an unobserved random parameter which is drawn once for each local data set. This random parameter is estimated during the inference process and then used to perform classification with both the local and global features--- a procedure which is akin to automatically retuning the classifier to the local regularities on each newly encountered web page. Exact inference is intractable and we present approximations via point estimates and variational methods. Empirical results on large collections of web data demonstrate that this method significantly improves performance from traditional models of global features alone.