Goto

Collaborating Authors

Contact form

#artificialintelligence

I consent to having this website store my submitted information so they can respond to my inquiry. I consent to having this website store my submitted information so they can respond to my inquiry.


DeepCapture: Image Spam Detection Using Deep Learning and Data Augmentation

arXiv.org Machine Learning

Image spam emails are often used to evade text-based spam filters that detect spam emails with their frequently used keywords. In this paper, we propose a new image spam email detection tool called DeepCapture using a convolutional neural network (CNN) model. There have been many efforts to detect image spam emails, but there is a significant performance degrade against entirely new and unseen image spam emails due to overfitting during the training phase. To address this challenging issue, we mainly focus on developing a more robust model to address the overfitting problem. Our key idea is to build a CNN-XGBoost framework consisting of eight layers only with a large number of training samples using data augmentation techniques tailored towards the image spam detection task. To show the feasibility of DeepCapture, we evaluate its performance with publicly available datasets consisting of 6,000 spam and 2,313 non-spam image samples. The experimental results show that DeepCapture is capable of achieving an F1-score of 88%, which has a 6% improvement over the best existing spam detection model CNN-SVM [19] with an F1-score of 82%. Moreover, DeepCapture outperformed existing image spam detection solutions against new and unseen image datasets.


Why Are Libraries Failing At Web Archiving And Are We Losing Our Digital History?

Forbes - Tech

Last fall I was invited to speak at an event examining the state of web archiving by libraries and especially how libraries are handling the archival of online news, which is both high velocity and highly fluid. After spending a day and a half talking with librarians, archivists, journalists, information scientists, government officials and technologists, it was remarkable how little had changed since the first web archiving meeting I was invited to speak at in the Library of Congress three quarters of a decade ago. The talks were identical, the topics unchanged, the progress unmoved. Perhaps the best summary of the meeting came from a presentation by one of the Library of Congress' digital leads, who proudly announced the Library's new initiative to use RSS feeds to improve its crawling of news websites. She discussed how after years of archiving news websites, the Library had learned (more than 15 years after the commercial world) that trying to perform a traditional web crawl of each news site, beginning on its homepage and crawling breadth-first across the entire site made it difficult to archive rapidly-changing news sites that published content at high velocities.


CEUR-WS.org - CEUR Workshop Proceedings (free, open-access publishing, computer science/information systems/information technology)

#artificialintelligence

Workshop organizers can look here for information on how to submit electronic workshop proceedings to CEUR. The procedure takes few minutes. It requires the ability to upload files to a FTP server and a fax where you acknowledge the nature of CEUR Workshop Proceedings as a free service for academia and science. There were a few thousand real paper downloads (downloads by search engines and from the proceedings FTP archive not counted) per day from CEUR-WS.org. In Nov 2017, CEUR-WS.org had published about 35.000 papers distributed over roughly 1970 volumes.


The UEA multivariate time series classification archive, 2018

arXiv.org Machine Learning

In 2002, the UCR time series classification archive was first released with sixteen datasets. It gradually expanded, until 2015 when it increased in size from 45 datasets to 85 datasets. In October 2018 more datasets were added, bringing the total to 128. The new archive contains a wide range of problems, including variable length series, but it still only contains univariate time series classification problems. One of the motivations for introducing the archive was to encourage researchers to perform a more rigorous evaluation of newly proposed time series classification (TSC) algorithms. It has worked: most recent research into TSC uses all 85 datasets to evaluate algorithmic advances. Research into multivariate time series classification, where more than one series are associated with each class label, is in a position where univariate TSC research was a decade ago. Algorithms are evaluated using very few datasets and claims of improvement are not based on statistical comparisons. We aim to address this problem by forming the first iteration of the MTSC archive, to be hosted at the website www.timeseriesclassification.com. Like the univariate archive, this formulation was a collaborative effort between researchers at the University of East Anglia (UEA) and the University of California, Riverside (UCR). The 2018 vintage consists of 30 datasets with a wide range of cases, dimensions and series lengths. For this first iteration of the archive we format all data to be of equal length, include no series with missing data and provide train/test splits.