Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.
This paper describes an engine to optimize web publisher revenues from second-price auctions. These auctions are widely used to sell online ad spaces in a mechanism called real-time bidding (RTB). Optimization within these auctions is crucial for web publishers, because setting appropriate reserve prices can significantly increase revenue. We consider a practical real-world setting where the only available information before an auction occurs consists of a user identifier and an ad placement identifier. The real-world challenges we had to tackle consist mainly of tracking the dependencies on both the user and placement in an highly non-stationary environment and of dealing with censored bid observations. These challenges led us to make the following design choices: (i) we adopted a relatively simple non-parametric regression model of auction revenue based on an incremental time-weighted matrix factorization which implicitly builds adaptive users' and placements' profiles; (ii) we jointly used a non-parametric model to estimate the first and second bids' distribution when they are censored, based on an on-line extension of the Aalen's Additive model. Our engine is a component of a deployed system handling hundreds of web publishers across the world, serving billions of ads a day to hundreds of millions of visitors. The engine is able to predict, for each auction, an optimal reserve price in approximately one millisecond and yields a significant revenue increase for the web publishers.
The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code.
Edge intelligence refers to a set of connected systems and devices for data collection, caching, processing, and analysis in locations close to where data is captured based on artificial intelligence. The aim of edge intelligence is to enhance the quality and speed of data processing and protect the privacy and security of the data. Although recently emerged, spanning the period from 2011 to now, this field of research has shown explosive growth over the past five years. In this paper, we present a thorough and comprehensive survey on the literature surrounding edge intelligence. We first identify four fundamental components of edge intelligence, namely edge caching, edge training, edge inference, and edge offloading, based on theoretical and practical results pertaining to proposed and deployed systems. We then aim for a systematic classification of the state of the solutions by examining research results and observations for each of the four components and present a taxonomy that includes practical problems, adopted techniques, and application goals. For each category, we elaborate, compare and analyse the literature from the perspectives of adopted techniques, objectives, performance, advantages and drawbacks, etc. This survey article provides a comprehensive introduction to edge intelligence and its application areas. In addition, we summarise the development of the emerging research field and the current state-of-the-art and discuss the important open issues and possible theoretical and technical solutions.
Distributed learning has become a critical enabler of the massively connected world envisioned by many. This article discusses four key elements of scalable distributed processing and real-time intelligence -- problems, data, communication and computation. Our aim is to provide a fresh and unique perspective about how these elements should work together in an effective and coherent manner. In particular, we provide a selective review about the recent techniques developed for optimizing non-convex models (i.e., problem classes), processing batch and streaming data (i.e., data types), over the networks in a distributed manner (i.e., communication and computation paradigm). We describe the intuitions and connections behind a core set of popular distributed algorithms, emphasizing how to trade off between computation and communication costs. Practical issues and future research directions will also be discussed. We are living in a highly connected world, and it will become exponentially more connected in a decade. These devices collect a huge amount of real-time data, perform complex computational tasks, and provide vital services which significantly improve our lives and enrich our collective productivity. THC, MH, HTW are ordered alphabetically, and contributed equally. MH is the corresponding author. THC is with the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China. MH and XZ are with the ECE Department, University of Minnesota, MN, USA. HTW is with the Department of SEEM, The Chinese University of Hong Kong, Hong Kong SAR, China.
--Global physical event detection has traditionally relied on dense coverage of physical sensors around the world; while this is an expensive undertaking, there have not been alternatives until recently. The ubiquity of social networks and human sensors in the field provides a tremendous amount of real-time, live data about true physical events from around the world. However, while such human sensor data have been exploited for retrospective large-scale event detection, such as hurricanes or earthquakes, they has been limited to no success in exploiting this rich resource for general physical event detection. Prior implementation approaches have suffered from the concept drift phenomenon, where real-world data exhibits constant, unknown, unbounded changes in its data distribution, making static machine learning models ineffective in the long term. We propose and implement an end-to-end collaborative drift adaptive system that integrates corroborative and probabilistic sources to deliver real-time predictions. Furthermore, out system is adaptive to concept drift and performs automated continuous learning to maintain high performance. We demonstrate our approach in a real-time demo available online for landslide disaster detection, with extensibility to other real-world physical events such as flooding, wildfires, hurricanes, and earthquakes. Physical event detection, such as extreme weather events or traffic accidents have long been the domain of static event processors operating on numeric sensor data or human actors manually identifying event types. However, the emergence of big data and associated data processing and analytics tools and systems have led to several applications in large-scale event and trend detection in the streaming domain -. However, it is important to note that many of these works are a form of retrospective analysis, as opposed to true real-time event detection, since they perform analyses on cleaned and processed data within a short-time frame in the past, with the assumption that their approaches are sustainable and will continue to function over time.
The models are updated using a CNN, which ensures robustness to noise, scaling and minor variations of the targets' appearance. As with many other related approaches, an online implementation offloads most of the processing to an external server leaving the embedded device from the vehicle to carry out only minor and frequently-needed tasks. Since quick reactions of the system are crucial for proper and safe vehicle operation, performance and a rapid response of the underlying software is essential, which is why the online approach is popular in this field. Also in the context of ensuring robustness and stability, some authors apply fusion techniques to information extracted from CNN layers. It has been previously mentioned that important correlations can be drawn from deep and shallow layers which can be exploited together for identifying robust features in the data.
With the explosive growth of e-commerce and the booming of e-payment, detecting online transaction fraud in real time has become increasingly important to Fintech business. To tackle this problem, we introduce the TitAnt, a transaction fraud detection system deployed in Ant Financial, one of the largest Fintech companies in the world. The system is able to predict online real-time transaction fraud in mere milliseconds. We present the problem definition, feature extraction, detection methods, implementation and deployment of the system, as well as empirical effectiveness. Extensive experiments have been conducted on large real-world transaction data to show the effectiveness and the efficiency of the proposed system.
Recently, deep learning models play more and more important roles in contents recommender systems. However, although the performance of recommendations is greatly improved, the "Matthew effect" becomes increasingly evident. While the head contents get more and more popular, many competitive long-tail contents are difficult to achieve timely exposure because of lacking behavior features. This issue has badly impacted the quality and diversity of recommendations. To solve this problem, look-alike algorithm is a good choice to extend audience for high quality long-tail contents. But the traditional look-alike models which widely used in online advertising are not suitable for recommender systems because of the strict requirement of both real-time and effectiveness. This paper introduces a real-time attention based look-alike model (RALM) for recommender systems, which tackles the challenge of conflict between real-time and effectiveness. RALM realizes real-time look-alike audience extension benefiting from seeds-to-user similarity prediction and improves the effectiveness through optimizing user representation learning and look-alike learning modeling. For user representation learning, we propose a novel neural network structure named attention merge layer to replace the concatenation layer, which significantly improves the expressive ability of multi-fields feature learning. On the other hand, considering the various members of seeds, we design global attention unit and local attention unit to learn robust and adaptive seeds representation with respect to a certain target user. At last, we introduce seeds clustering mechanism which not only reduces the time complexity of attention units prediction but also minimizes the loss of seeds information at the same time. According to our experiments, RALM shows superior effectiveness and performance than popular look-alike models.
Yet there's also no point in accessing richer sources of data unless you have an architecture that can consume it. An AI-ready architecture is able to address different shapes and granularities of data such as transactions, logs, geospatial information, sensors and social. In addition, real-time time-series data is key to the constant feed of input that propels data-driven devices, from smart-home appliances and health devices to self-driving cars. Make sure your AI architecture has the capability to consume different data structures in different time dimensions, especially real time. Is your organization identifying and classifying data at the point of ingestion?