Performance Analysis
Learning Concept Graphs from Online Educational Data
Liu, Hanxiao, Ma, Wanli, Yang, Yiming, Carbonell, Jaime
This paper addresses an open challenge in educational data mining, i.e., the problem of automatically mapping online courses from different providers (universities, MOOCs, etc.) onto a universal space of concepts, and predicting latent prerequisite dependencies (directed links) among both concepts and courses. We propose a novel approach for inference within and across course-level and concept-level directed graphs. In the training phase, our system projects partially observed course-level prerequisite links onto directed concept-level links; in the testing phase, the induced concept-level links are used to infer the unknown course-level prerequisite links. Whereas courses may be specific to one institution, concepts are shared across different providers. The bi-directional mappings enable our system to perform interlingua-style transfer learning, e.g. treating the concept graph as the interlingua and transferring the prerequisite relations across universities via the interlingua. Experiments on our newly collected datasets of courses from MIT, Caltech, Princeton and CMU show promising results.
Efficient AUC Optimization for Information Ranking Applications
Adequate evaluation of an information retrieval system to estimate future performance is a crucial task. Area under the ROC curve (AUC) is widely used to evaluate the generalization of a retrieval system. However, the objective function optimized in many retrieval systems is the error rate and not the AUC value. This paper provides an efficient and effective non-linear approach to optimize AUC using additive regression trees, with a special emphasis on the use of multi-class AUC (MAUC) because multiple relevance levels are widely used in many ranking applications. Compared to a conventional linear approach, the performance of the non-linear approach is comparable on binary-relevance benchmark datasets and is better on multi-relevance benchmark datasets.
My thoughts on big data and data science: no, it's not hype
Each time a credit card is swiped or processed online, an analytic algorithm is used to detect if it's fraudulent or not (and the answer must come in less than 3 seconds most of the time, with low false negative rate). Each time you do a Google search, an analytic engine determines witch search results to show you, and which ads to display. Each time someone posts something on Facebook, an analytic algorithm is run to determine if it must be rejected (promotion, spam, porn etc) or not. Each Tweet posted is analyzed by analytic algorithms (designed by a number of various companies) to detect new viral trends (for journalists), or disease spread, intelligence leaks or many other things. Each time you browse Amazon, the customized content delivered to you is analytically "calculated" to optimize Amazon's revenue.
Machine learning and social engineering attacks
In my last post I promised to use some real-world use cases from the recent Verizon Data Breach Digest report to illustrate potential ways that machine learning be can used to detect or prevent similar incidents. For my first example, I've chosen the case of a manufacturer whose designs for an innovative new model of heavy construction equipment were stolen following a social engineering attack. They were tipped off when a primary competitor, located on another continent, introduced a new piece of equipment that looked like an exact copy of a model recently developed by the victim company. To paraphrase the Verizon report, it went like this. The threat actors identified an employee who they suspected would have access to new product design they were after -- the chief design engineer.
Developing an ICU scoring system with interaction terms using a genetic algorithm
Gan, Chee Chun, Learmonth, Gerard
ICU mortality scoring systems attempt to predict patient mortality using predictive models with various clinical predictors. Examples of such systems are APACHE, SAPS and MPM. However, most such scoring systems do not actively look for and include interaction terms, despite physicians intuitively taking such interactions into account when making a diagnosis. One barrier to including such terms in predictive models is the difficulty of using most variable selection methods in high-dimensional datasets. A genetic algorithm framework for variable selection with logistic regression models is used to search for two-way interaction terms in a clinical dataset of adult ICU patients, with separate models being built for each category of diagnosis upon admittance to the ICU. The models had good discrimination across all categories, with a weighted average AUC of 0.84 (>0.90 for several categories) and the genetic algorithm was able to find several significant interaction terms, which may be able to provide greater insight into mortality prediction for health practitioners. The GA selected models had improved performance against stepwise selection and random forest models, and provides greater flexibility in terms of variable selection by being able to optimize over any modeler-defined model performance metric instead of a specific variable importance metric.
Datacratic MLDB
By using machine learning algorithms, we are increasingly able to use computers to perform intellectual tasks at a level approaching that of humans. Given that computers cost less than employees, many people are afraid that humans will therefore necessarily lose their jobs to computers. Contrary to this belief, in this article I show that even when a computer can perform a task more economically than a human, careful analysis suggests that humans and computers working together can sometimes yield even better business outcomes than simply replacing one with the other. Specifically, I show how a classifier with a reject option can increase worker productivity for certain types of tasks, and I show how to construct and tune such a classifier from a simple scoring function by using two thresholds. I begin with a parable featuring the same characters as the one from Part 1 of this Machine Learning Meets Economics series.
How this AI-human partnership takes cybersecurity to a new level
In the ongoing battle against cyber attacks, a man-machine collaboration could offer a new path to security. To keep up with cyber threats, the cybersecurity industry has turned to assistance from unsupervised artificial intelligence systems that operate independently from human analysts. But the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology in Cambridge, Mass., in partnership with the machine-learning startup PatternEx, is offering a fresh approach. Their new program, AI2, draws on what humans and machines each do best: It allows human analysts to build upon the large scale pattern recognition and learning capabilities of artificial intelligence. The industry standard right now is unsupervised machine learning, CSAIL research scientist Kalyan Veeramachaneni, who helped develop the program, says in a phone interview with The Christian Science Monitor.
Variational inference for rare variant detection in deep, heterogeneous next-generation sequencing data
The detection of rare variants is important for understanding the genetic heterogeneity in mixed samples. Recently, next-generation sequencing (NGS) technologies have enabled the identification of single nucleotide variants (SNVs) in mixed samples with high resolution. Yet, the noise inherent in the biological processes involved in next-generation sequencing necessitates the use of statistical methods to identify true rare variants. We propose a novel Bayesian statistical model and a variational expectation-maximization (EM) algorithm to estimate non-reference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of low coverage ($27\times$ and $298\times$) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many state-of-the-art algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants.
The Area Under an ROC Curve
ROC curves can also be constructed from clinical prediction rules. The graphs at right come from a study of how clinical findings predict strep throat (Wigton RS, Connor JL, Centor RM. In that study, the presence of tonsillar exudate, fever, adenopathy and the absence of cough all predicted strep. The curves were constructed by computing the sensitivity and specificity of increasing numbers of clinical findings (from 0 to 4) in predicting strep. The study compared patients in Virginia and Nebraska and found that the rule performed more accurately in Virginia (area under the curve .78)
MIT Develops AI That Detects 85 Percent of Cyber-Attacks
MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), together with researchers from security firm PatternEx, has revealed a new AI (Artificial Intelligence) system called AI2, which can detect 85 percent of cyber-attacks, with false positives rates five times smaller than existing solutions. The new system doesn't rely entirely on artificial intelligence (AI), but also on user input, something that researchers call analyst intuition (AI), hence its name of AI2. Researchers said they fed AI2 with over 3.6 billion lines of log files, allowing the system to scan the content with unsupervised machine-learning techniques. At the end of each day, the system presents its findings to a human operator, who then confirms or dismisses security alerts. This human feedback is then incorporated into AI2's learning system and used the next day for analyzing new logs. After their tests had concluded, MIT and PatternEx researchers said AI2 achieved an 85 percent accuracy rate in detecting cyber-attacks, which is 2.92 times better than similar automated cyber-attack detection systems used today.