RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs

arXiv.org Machine Learning

Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model-free knockoffs procedure introduced recently in Cand\`{e}s, Fan, Janson and Lv (2016) in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-free knockoffs method called graphical nonlinear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptotically controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure.


Unknown sparsity in compressed sensing: Denoising and inference

arXiv.org Machine Learning

The theory of Compressed Sensing (CS) asserts that an unknown signal $x\in\mathbb{R}^p$ can be accurately recovered from an underdetermined set of $n$ linear measurements with $n\ll p$, provided that $x$ is sufficiently sparse. However, in applications, the degree of sparsity $\|x\|_0$ is typically unknown, and the problem of directly estimating $\|x\|_0$ has been a longstanding gap between theory and practice. A closely related issue is that $\|x\|_0$ is a highly idealized measure of sparsity, and for real signals with entries not equal to 0, the value $\|x\|_0=p$ is not a useful description of compressibility. In our previous conference paper [Lop13] that examined these problems, we considered an alternative measure of "soft" sparsity, $\|x\|_1^2/\|x\|_2^2$, and designed a procedure to estimate $\|x\|_1^2/\|x\|_2^2$ that does not rely on sparsity assumptions. The present work offers a new deconvolution-based method for estimating unknown sparsity, which has wider applicability and sharper theoretical guarantees. In particular, we introduce a family of entropy-based sparsity measures $s_q(x):=\big(\frac{\|x\|_q}{\|x\|_1}\big)^{\frac{q}{1-q}}$ parameterized by $q\in[0,\infty]$. This family interpolates between $\|x\|_0=s_0(x)$ and $\|x\|_1^2/\|x\|_2^2=s_2(x)$ as $q$ ranges over $[0,2]$. For any $q\in (0,2]\setminus\{1\}$, we propose an estimator $\hat{s}_q(x)$ whose relative error converges at the dimension-free rate of $1/\sqrt{n}$, even when $p/n\to\infty$. Our main results also describe the limiting distribution of $\hat{s}_q(x)$, as well as some connections to Basis Pursuit Denosing, the Lasso, deterministic measurement matrices, and inference problems in CS.


A State-Space Approach to Dynamic Nonnegative Matrix Factorization

arXiv.org Machine Learning

Nonnegative matrix factorization (NMF) has been actively investigated and used in a wide range of problems in the past decade. A significant amount of attention has been given to develop NMF algorithms that are suitable to model time series with strong temporal dependencies. In this paper, we propose a novel state-space approach to perform dynamic NMF (D-NMF). In the proposed probabilistic framework, the NMF coefficients act as the state variables and their dynamics are modeled using a multi-lag nonnegative vector autoregressive (N-VAR) model within the process equation. We use expectation maximization and propose a maximum-likelihood estimation framework to estimate the basis matrix and the N-VAR model parameters. Interestingly, the N-VAR model parameters are obtained by simply applying NMF. Moreover, we derive a maximum a posteriori estimate of the state variables (i.e., the NMF coefficients) that is based on a prediction step and an update step, similarly to the Kalman filter. We illustrate the benefits of the proposed approach using different numerical simulations where D-NMF significantly outperforms its static counterpart. Experimental results for three different applications show that the proposed approach outperforms two state-of-the-art NMF approaches that exploit temporal dependencies, namely a nonnegative hidden Markov model and a frame stacking approach, while it requires less memory and computational power.


Deepening Data Capital Through Cloud-Based Machine Learning and Artificial Intelligence - Wikibon Research

@machinelearnbot

Business data is a capital asset. That's because, in a classical economic framework, data is a factor of production, is not depleted in the process of production, and gains value from human inputs, contributions, and sweat equity. As a capital asset, data's value goes far beyond the sunk cost of its physical instantiation in a database or even the cost of restoring if it were to be lost, stolen, or corrupted. Data's full value resides in the full range of potential business decisions, processes, engagements, and outcomes that it might support. And its value in those derives in great part on the data-driven insights that can be unlocked through analytic tools.


Introducing the Yahoo News Ranked Multi-label Corpus, a Novel Dataset to Improve Multilabel Learning

@machinelearnbot

Most content-based websites, like Yahoo News, HuffPost, or any given news site, organize their stories according to subject matter or in some similar way. You can imagine that websites with a huge amount of stories must need an automated method to filter or categorize them as the content is ingested into their systems. For example, algorithms that power Yahoo News label news articles with tags (e.g., Military conflict, Nuclear policy, Refugees) as they are ingested, and then display the content by subject matter and/or on a personalized feed. This well-known process of labeling content with all its relevant tags is known as Multilabel Learning (MLL). Up to now, whenever scientists and engineers use MLL to create their own specific models to label content however they like, they have used datasets that have pre-computed features like bag-of-words, or dense representations like doc2vec.


Amazon and Microsoft agree their voice assistants will talk (to each other)

#artificialintelligence

Those betting big on AI making voice the dominant user interface of the future are not betting so big as to believe their respective artificially intelligent voice assistants will be the sole vocal oracle that Internet users want or need. And so Microsoft's Satya Nadella and Amazon's Jeff Bezos are today announcing a tie-up, which will -- at an unspecified point later this year -- enable users of the latter's Alexa voice assistant to ask her to summon Microsoft's Cortana voice assistant to ask it to do stuff, and vice versa. Here are the pair's respective statements on the move: Quoth Satya Nadella, CEO, Microsoft: "Ensuring Cortana is available for our customers everywhere and across any device is a key priority for us. Bringing Cortana's knowledge, Office 365 integration, commitments, and reminders to Alexa is a great step toward that goal." Said Jeff Bezos, founder and CEO, Amazon: "The world is big and so multifaceted.


These Researchers Are Using AI and Bitcoin to Save Lives

#artificialintelligence

Sex trafficking is a serious issue, and it's made all the more difficult to stop due to the challenging of identifying its victims and perpetrators. However, a new tool that combines artificial intelligence (AI) and bitcoin may help us to end this illegal practice. Developed by PhD Candidate Rebecca Portnoff and her team at the University of California, the tool uses machine learning to sift through thousands upon thousands of online sex ads in order to identify patterns that can help investigators. The digital currency bitcoin is the most common form of payment used by sex traffickers on Backpage, a website largely associated with sex trafficking. In 2015, credit card companies began preventing people from using credit/debit cards on the site, forcing people to use bitcoin.


Machine Learning Part I: The Terminator Meets the Internet of Things

#artificialintelligence

The internet of things (IoT), machine-learning, deep learning, and artificial intelligence (AI) are concepts you've probably heard or read about, but chances are you may not fully understand their differences and the impact they can have on your business. This blog series will help breakdown how your data is handled, starting at the beginning with how the IoT has revolutionized how we interact with technology, all the way through to AI (Artificial Intelligence), into our ever-evolving our future. In its most simple form, the Internet of Things (IoT) is an internal network of devices that communicate, share, and interpret exchanged data. We will uncover the next step of the process – IoT data collection – in the next blog post in this series.


Machine Learning Part I: The Terminator Meets the Internet of Things

#artificialintelligence

The internet of things (IoT), machine-learning, deep learning, and artificial intelligence (AI) are concepts you've probably heard or read about, but chances are you may not fully understand their differences and the impact they can have on your business. We're here to help explain how they relate to each other, how they each manipulate your data, and why they matter. This blog series will help breakdown how your data is handled, starting at the beginning with how the IoT has revolutionized how we interact with technology, all the way through to AI (Artificial Intelligence), into our ever-evolving our future. To make things a little more interesting, we'll also talk about one of our favorite movies, The Terminator. If you haven't seen them, you may want to go stream then now.


Interview: How Connected Cars Can Learn from Fintech

@machinelearnbot

Hello David and Sam, A good way to prevent attackers from gaining control on a car or its systems is to only allow access to the data, and only indirectly. Why should any person or program ever have access to the car itself, or any of its devices? Mr. Shawki's thinking is correct here―when it comes to security for the IoT, legacy ideas have to give way to different thinking. The traditional mind-set among financial and industrial applications is that connected systems must follow a server-client model. The server has the data, and the client needs to connect to the server to get it.