Performance Analysis
Prepaid or Postpaid? That is the question. Novel Methods of Subscription Type Prediction in Mobile Phone Services
Liao, Yongjun, Du, Wei, Karsai, Márton, Sarraute, Carlos, Minnoni, Martin, Fleury, Eric
In this paper we investigate the behavioural differences between mobile phone customers with prepaid and postpaid subscriptions. Our study reveals that (a) postpaid customers are more active in terms of service usage and (b) there are strong structural correlations in the mobile phone call network as connections between customers of the same subscription type are much more frequent than those between customers of different subscription types. Based on these observations we provide methods to detect the subscription type of customers by using information about their personal call statistics, and also their egocentric networks simultaneously. The key of our first approach is to cast this classification problem as a problem of graph labelling, which can be solved by max-flow min-cut algorithms. Our experiments show that, by using both user attributes and relationships, the proposed graph labelling approach is able to achieve a classification accuracy of $\sim 87\%$, which outperforms by $\sim 7\%$ supervised learning methods using only user attributes. In our second problem we aim to infer the subscription type of customers of external operators. We propose via approximate methods to solve this problem by using node attributes, and a two-ways indirect inference method based on observed homophilic structural correlations. Our results have straightforward applications in behavioural prediction and personal marketing.
Cisco Embraces Machine Learning to Maintain Its Dominance -- The Motley Fool
Networking hardware giant Cisco Systems (NASDAQ:CSCO) has managed to maintain its dominance in the switching and routing markets despite significant shifts in the networking landscape. Software-defined networking and cloud computing have both threatened Cisco's business model of selling expensive, proprietary boxes in recent years. Even with these challenges, Cisco controlled 55.6% of the Ethernet switching market and 45% of the combined service provider and enterprise router market during the fourth quarter of 2016. This dominance doesn't mean that Cisco can sit on its hands. The company has been growing its software and services businesses, aiming to become a seller of solutions, not just hardware.
How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part 2
In the first part of this series, I introduced the Outbrain Click Prediction machine learning competition. That post described some preliminary and important data science tasks like exploratory data analysis and feature engineering performed for the competition, using a Spark cluster deployed on Google Dataproc. In this post, I describe the competition evaluation, the design of my cross-validation strategy and my baseline models using statistics and trees ensembles. In that competition, Kagglers were required to rank recommended ads by decreasing predicted likelihood of being clicked. Sponsored search advertising, contextual advertising, display advertising and real-time bidding auctions have all relied heavily on the ability of learned models to predict ad click–through rates (CTRs) accurately, quickly and reliably.
What Is Steganography?
You know all too well at this point that all sorts of digital attacks are lurking on the internet. You could encounter ransomware, a virus, or a sketchy phish at any moment. Even creepier, though, some malicious code can actually hide inside other, benign software--and be programmed to jump out when you aren't expecting it. Hackers are increasingly using this technique, known as steganography, to trick internet users and smuggle malicious payloads past security scanners and firewalls. Unlike cryptography, which works to obscure content so it can't be understood, steganography's goal is to hide the fact that content exists at all by embedding it something else. And since steganography is a concept, not a specific method of clandestine data delivery, it can be used in all sorts of ingenious (and worrying) attacks.
Blockchains for Artificial Intelligence » Brave New Coin
In recent years, Artificial Intelligence (AI) researchers have finally cracked problems that they've worked on for decades, from Go to human-level speech recognition. A key piece was the ability to gather and learn on mountains of data, which pulled error rates past the success line. In short, big data has transformed AI, to an almost unreasonable level. Blockchain technology could transform AI too, in its own particular ways. Some applications of blockchains to AI are mundane, like audit trails on AI models. Some appear almost unreasonable, like AI that can own itself -- AI DAOs. All of them are opportunities. This article will explore these applications. Before we discuss applications, let's first review what's different about blockchains compared to traditional big-data distributed databases like MongoDB. We can think of blockchains as "blue ocean" databases: they escape the "bloody red ocean" of sharks competing in an existing market, opting instead to be in a blue ocean of uncontested market space. Famous blue ocean examples are Wii for video game consoles (compromise raw performance, but have new mode of interaction), or Yellow Tail for wines (ignore the pretentious specs for wine lovers; make wine more accessible to beer lovers). By traditional database standards, traditional blockchains like Bitcoin are terrible: low throughput, low capacity, high latency, poor query support, and so on. But in blue-ocean thinking, that's ok, because blockchains introduced three new characteristics: decentralized / shared control, immutable / audit trails, and native assets / exchanges.
In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling
Marchant, Neil G., Rubinstein, Benjamin I. P.
Entity resolution (ER) presents unique challenges for evaluation methodology. While crowdsourcing platforms acquire ground truth, sound approaches to sampling must drive labelling efforts. In ER, extreme class imbalance between matching and non-matching records can lead to enormous labelling requirements when seeking statistically consistent estimates for rigorous evaluation. This paper addresses this important challenge with the OASIS algorithm: a sampler and F-measure estimator for ER evaluation. OASIS draws samples from a (biased) instrumental distribution, chosen to ensure estimators with optimal asymptotic variance. As new labels are collected OASIS updates this instrumental distribution via a Bayesian latent variable model of the annotator oracle, to quickly focus on unlabelled items providing more information. We prove that resulting estimates of F-measure, precision, recall converge to the true population values. Thorough comparisons of sampling methods on a variety of ER datasets demonstrate significant labelling reductions of up to 83% without loss to estimate accuracy.
MLDB.ai Blog
The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumés in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc. In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.
Cross-validation failure: small sample sizes lead to large error bars
Predictive models ground many state-of-the-art developments in statistical brain image analysis: decoding, MVPA, searchlight, or extraction of biomarkers. The principled approach to establish their validity and usefulness is cross-validation, testing prediction on unseen data. Here, I would like to raise awareness on error bars of cross-validation, which are often underestimated. Simple experiments show that sample sizes of many neuroimaging studies inherently lead to large error bars, eg $\pm$10% for 100 samples. The standard error across folds strongly underestimates them. These large error bars compromise the reliability of conclusions drawn with predictive models, such as biomarkers or methods developments where, unlike with cognitive neuroimaging MVPA approaches, more samples cannot be acquired by repeating the experiment across many subjects. Solutions to increase sample size must be investigated, tackling possible increases in heterogeneity of the data.
Analyzing Oscar Data
She graduated from the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on her final class project - Capstone, due on the 12th week of the program. The original article can be found here. Have you ever seen a marketing ad for a movie and thought, wow I have to see that! Then you go see it, it's a great film, the actor roles are amazing, in your book it's won an Oscar, and it's not even nominated?
Efficient Approximate Solutions to Mutual Information Based Global Feature Selection
Venkateswara, Hemanth, Lade, Prasanth, Lin, Binbin, Ye, Jieping, Panchanathan, Sethuraman
Mutual Information (MI) is often used for feature selection when developing classifier models. Estimating the MI for a subset of features is often intractable. We demonstrate, that under the assumptions of conditional independence, MI between a subset of features can be expressed as the Conditional Mutual Information (CMI) between pairs of features. But selecting features with the highest CMI turns out to be a hard combinatorial problem. In this work, we have applied two unique global methods, Truncated Power Method (TPower) and Low Rank Bilinear Approximation (LowRank), to solve the feature selection problem. These algorithms provide very good approximations to the NP-hard CMI based feature selection problem. We experimentally demonstrate the effectiveness of these procedures across multiple datasets and compare them with existing MI based global and iterative feature selection procedures.