Accuracy
Home The Data Science Bowl Passion. Curiosity. Purpose. Presented by Booz Allen and Kaggle
Lung cancer is one of the most common types of cancer, with nearly 225,000 new cases of the disease expected in the U.S. in 2016. Using a data set of high-resolution scans of lungs provided by the National Cancer Institute, participants will develop artificial intelligence algorithms to accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that prevents low-dose CT scans from being widely used for lung cancer detection. Competition results have the potential to advance our understanding of how all types of cancer develop and spread in the body. They'll also free radiologists to spend more time with patients.
Effective and Extensible Feature Extraction Method Using Genetic Algorithm-Based Frequency-Domain Feature Search for Epileptic EEG Multi-classification
In this paper, a genetic algorithm-based frequency-domain feature search (GAFDS) method is proposed for the electroencephalogram (EEG) analysis of epilepsy. In this method, frequency-domain features are first searched and then combined with nonlinear features. Subsequently, these features are selected and optimized to classify EEG signals. The extracted features are analyzed experimentally. The features extracted by GAFDS show remarkable independence, and they are superior to the nonlinear features in terms of the ratio of inter-class distance and intra-class distance. Moreover, the proposed feature search method can additionally search for features of instantaneous frequency in a signal after Hilbert transformation. The classification results achieved using these features are reasonable, thus, GAFDS exhibits good extensibility. Multiple classic classifiers (i.e., $k$-nearest neighbor, linear discriminant analysis, decision tree, AdaBoost, multilayer perceptron, and Na\"ive Bayes) achieve good results by using the features generated by GAFDS method and the optimized selection. Specifically, the accuracies for the two-classification and three-classification problems may reach up to 99% and 97%, respectively. Results of several cross-validation experiments illustrate that GAFDS is effective in feature extraction for EEG classification. Therefore, the proposed feature selection and optimization model can improve classification accuracy.
Smart buildings predict when critical systems are about to fail
Imagine a building that tells you โ before it happens โ that the heating is about to fail. Some companies are using machine learning to do just that. Software firm CGnal, based in Milan, Italy, recently analysed a year's worth of data from the heating and ventilation units in an Italian hospital. Sensors are now commonly built into heating, ventilation and air conditioning units, and the team had records such as temperature, humidity and electricity use, relating to appliances in operating theatres and first aid rooms as well as corridors. They trained a machine learning algorithm on data from the first half of 2015, looking for differences in the readings of similar appliances.
Multivariate Confidence Intervals
Korpela, Jussi, Oikarinen, Emilia, Puolamรคki, Kai, Ukkonen, Antti
Confidence intervals are a popular way to visualize and analyze data distributions. Unlike p-values, they can convey information both about statistical significance as well as effect size. However, very little work exists on applying confidence intervals to multivariate data. In this paper we define confidence intervals for multivariate data that extend the one-dimensional definition in a natural way. In our definition every variable is associated with its own confidence interval as usual, but a data vector can be outside of a few of these, and still be considered to be within the confidence area. We analyze the problem and show that the resulting confidence areas retain the good qualities of their one-dimensional counterparts: they are informative and easy to interpret. Furthermore, we show that the problem of finding multivariate confidence intervals is hard, but provide efficient approximate algorithms to solve the problem.
"Flow Size Difference" Can Make a Difference: Detecting Malicious TCP Network Flows Based on Benford's Law
Iorliam, Aamo, Tirunagari, Santosh, Ho, Anthony T. S., Li, Shujun, Waller, Adrian, Poh, Norman
Statistical characteristics of network traffic have attracted a significant amount of research for automated network intrusion detection, some of which looked at applications of natural statistical laws such as Zipf's law, Benford's law and the Pareto distribution. In this paper, we present the application of Benford's law to a new network flow metric "flow size difference", which have not been studied before by other researchers, to build an unsupervised flow-based intrusion detection system (IDS). The method was inspired by our observation on a large number of TCP flow datasets where normal flows tend to follow Benford's law closely but malicious flows tend to deviate significantly from it. The proposed IDS is unsupervised, so it can be easily deployed without any training. It has two simple operational parameters with a clear semantic meaning, allowing the IDS operator to set and adapt their values intuitively to adjust the overall performance of the IDS. We tested the proposed IDS on two (one closed and one public) datasets, and proved its efficiency in terms of AUC (area under the ROC curve). Our work showed the "flow size difference" has a great potential to improve the performance of any flow-based network IDSs.
Rare Disease Physician Targeting: A Factor Graph Approach
Cai, Yong, Wang, Yunlong, Dai, Dong
In rare disease physician targeting, a major challenge is how to identify physicians who are treating diagnosed or underdiagnosed rare diseases patients. Rare diseases have extremely low incidence rate. For a specified rare disease, only a small number of patients are affected and a fractional of physicians are involved. The existing targeting methodologies, such as segmentation and profiling, are developed under mass market assumption. They are not suitable for rare disease market where the target classes are extremely imbalanced. The authors propose a graphical model approach to predict targets by jointly modeling physician and patient features from different data spaces and utilizing the extra relational information. Through an empirical example with medical claim and prescription data, the proposed approach demonstrates better accuracy in finding target physicians. The graph representation also provides visual interpretability of relationship among physicians and patients. The model can be extended to incorporate more complex dependency structures. This article contributes to the literature of exploring the benefit of utilizing relational dependencies among entities in healthcare industry.
Google Says Its AI Catches 99.9 Percent of Gmail Spam
About a decade ago, spam brought email to near-ruin. The contest to save your inbox was on, with two of the world's biggest tech companies vying for the title of top spam-killer. By February 2012, Microsoft boasted that its spam filters were removing all but 3 percent of the junk messages from Hotmail, the company's online email service at the time. Google responded by claiming that its service, Gmail, removed all but about one percent of spam messages, adding that its false positives rate--legitimate mail misidentified as spam--was also about one percent. It was a point of pride for the two companies, particularly Microsoft, whose Hotmail service once carried such a poor reputation for spam. And the relative success of both showed that heuristic technologies--which identify spam based on a pre-defined rules--were working.
Hidden "Signature" in Online Photos Could Help Nab Child Abusers
Police may soon have a new way to catch pedophiles who distribute child abuse photos anonymously online. The technology could also help law enforcement agencies in other ways, such as identifying smartphone thieves who take pictures with the stolen gadgets and then post their snapshots on the Internet. Riccardo Satta, scientific project officer of the European Commission Joint Research Center's Institute for the Protection and Security of the Citizen, described the work he did with fellow researcher Pasquale Stirparo at the Computers, Privacy and Data Protection Conference in Brussels held in January.* The key is the ability to spot a unique, unremovable pattern--or signature--that each digital camera imprints on photographs. By comparing the signature from a specific camera with those found in images posted to social media, a forensic investigator would be able to establish that all the images had been taken by the same camera.
Scientists Report Initial Success With a Blood Test for Ovarian Cancer
The test is still experimental, unavailable to the public outside of clinical trials. Its developers say it needs further study in many more women to determine whether the early findings hold up. If it does come to market, it will not be for several years, and its use might initially be limited to women at high risk. Dr. Emanuel F. Petricoin, who helped create the test, said, ''I'm all too aware as an F.D.A. scientist of promising early results that start to fail as you go into the real world.'' Ovarian cancer is not common, but it is often deadly.