Noname manuscript No. (will be inserted by the editor)Time series classification for varying length series Chang Wei Tan · Fran cois Petitjean· Eamonn Keogh · Geoffrey I. Webb the date of receipt and acceptance should be inserted later Abstract Research into time series classification has tended to focus on the case of series of uniform length. However, it is common for real-world time series data to have unequal lengths. Differing time series lengths may arise from a number of fundamentally different mechanisms. In this work, we identify and evaluate two classes of such mechanisms - variations in sampling rate relative to the relevant signal and variations between the start and end points of one time series relative to one another. We investigate how time series generated by each of these classes of mechanism are best addressed for time series classification. We perform extensive experiments and provide practical recommendations on how variations in length should be handled in time series classification. Keywords Time Series Classification, Proximity Forest, Dynamic Time Warping 1 Introduction Time series classification (TSC) is an important task in many modern world applications such as remote sensing (Pelletier et al., 2019; Petitjean et al., 2012), astronomy (Batista et al., 2011), speech recognition (Hamooni et al., 2016), and insect classification (Chen et al., 2014). The time series to be classified are the observed outputs generated by some process. The classification task often relates to identifying the class of the process that generated the series. Each class of process might be considered as a realization of one or more ideals (in the Platonic sense) or prototypes. The resulting time series can then beChang Wei Tan · Fran cois Petitjean· Geoffrey I. Webb Faculty of Information Technology 25 Exhibition Walk Monash University, Melbourne VIC 3800, Australia Email: firstname.lastname@example.org,email@example.com,firstname.lastname@example.org An observed time series might differ from the ideal in many ways. Much of the research on time series distance measures in the last decade can be seen as the introduction of techniques to mitigate these differences, either as a preprocessing step or directly in a distance measure. For example, variations in amplitude and offset are typically addressed in time series classification by normalization of the series (Rakthanmanon et al., 2012). Some observed values may be erroneous and might be addressed by outlier detection (Basu and Meckesheimer, 2007) and subsequent reinterpolation (Pelletier et al., 2019).
Many time series data mining problems can be solved with repeated use of distance measure. Examples of such tasks include similarity search, clustering, classification, anomaly detection and segmentation. For over two decades it has been known that the Dynamic Time Warping (DTW) distance measure is the best measure to use for most tasks, in most domains. Because the classic DTW algorithm has quadratic time complexity, many ideas have been introduced to reduce its amortized time, or to quickly approximate it. One of the most cited approximate approaches is FastDTW. The FastDTW algorithm has well over a thousand citations and has been explicitly used in several hundred research efforts. In this work, we make a surprising claim. In any realistic data mining application, the approximate FastDTW is much slower than the exact DTW. This fact clearly has implications for the community that uses this algorithm: allowing it to address much larger datasets, get exact results, and do so in less time. Our observation also has a more sobering lesson for the community. This work may serve as a reminder to the community to exercise more caution in uncritically accepting published results.
This paper presents an efficient approach for subsequence search in data streams. The problem consists in identifying coherent repetitions of a given reference time-series, eventually multi-variate, within a longer data stream. Dynamic Time Warping (DTW) is the metric most widely used to implement pattern query, but its computational complexity is a well-known issue. In this paper we present an approach aimed at learning a kernel able to approximate DTW to be used for efficiently analyse streaming data collected from wearable sensors, reducing the burden of computation. Contrary to kernel, DTW allows for comparing time series with different length. Thus, to use a kernel, a feature embedding is used to represent a time-series as a fixed length vector. Each vector component is the DTW between the given time-series and a set of 'basis' series, usually randomly chosen. The vector size is the number of basis series used for the feature embedding. Searching for the portion of the data stream minimizing the DTW with the reference subsequence leads to a global optimization problem. The proposed approach has been validated on a benchmark dataset related to the identification of users depending on their walking activity. A comparison with a traditional DTW implementation is also provided.
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one dataset from the archive. The original incarnation of the archive had sixteen datasets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 datasets to 85 datasets. This paper introduces and will focus on the new data expansion from 85 to 128 datasets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-Nearest Neighbor classification), a large fraction may be misattributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.
The classification of time series data is a well-studied problem with numerous practical applications, such as medical diagnosis and speech recognition. A popular and effective approach is to classify new time series in the same way as their nearest neighbours, whereby proximity is defined using Dynamic Time Warping (DTW) distance, a measure analogous to sequence alignment in bioinformatics. However, practitioners are not only interested in accurate classification, they are also interested in why a time series is classified a certain way. To this end, we introduce here the problem of finding a minimum length subsequence of a time series, the removal of which changes the outcome of the classification under the nearest neighbour algorithm with DTW distance. Informally, such a subsequence is expected to be relevant for the classification and can be helpful for practitioners in interpreting the outcome. We describe a simple but optimized implementation for detecting these subsequences and define an accompanying measure to quantify the relevance of every time point in the time series for the classification. In tests on electrocardiogram data we show that the algorithm allows discovery of important subsequences and can be helpful in detecting abnormalities in cardiac rhythms distinguishing sick from healthy patients.