significant pattern
Temporal Window Smoothing of Exogenous Variables for Improved Time Series Prediction
Kamal, Mustafa, Hashem, Niyaz Bin, Krambroeckers, Robin, Mohammed, Nabeel, Rahman, Shafin
Although most transformer-based time series forecasting models primarily depend on endogenous inputs, recent state-of-the-art approaches have significantly improved performance by incorporating external information through exogenous inputs. However, these methods face challenges, such as redundancy when endogenous and exogenous inputs originate from the same source and limited ability to capture long-term dependencies due to fixed look-back windows. In this paper, we propose a method that whitens the exogenous input to reduce redundancy that may persist within the data based on global statistics. Additionally, our approach helps the exogenous input to be more aware of patterns and trends over extended periods. By introducing this refined, globally context-aware exogenous input to the endogenous input without increasing the lookback window length, our approach guides the model towards improved forecasting. Our approach achieves state-of-the-art performance in four benchmark datasets, consistently outperforming 11 baseline models. These results establish our method as a robust and effective alternative for using exogenous inputs in time series forecasting.
- Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- (3 more...)
- Research Report > Promising Solution (0.48)
- Overview > Innovation (0.34)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.82)
Efficient Discovery of Significant Patterns with Few-Shot Resampling
Pellegrina, Leonardo, Vandin, Fabio
Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists. We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy (0.04)
- (3 more...)
- Research Report > Experimental Study (0.48)
- Research Report > New Finding (0.46)
- Information Technology (0.48)
- Health & Medicine (0.46)
DataVinci: Learning Syntactic and Semantic String Repairs
Singh, Mukul, Cambronero, José, Gulwani, Sumit, Le, Vu, Negreanu, Carina, Verbruggen, Gust
String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors. DataVinci can automatically derive edits to the data error based on the majority patterns and constraints learned over other columns without the need for further user interaction. To handle strings with both syntactic and semantic substrings, DataVinci uses an LLM to abstract (and re-concretize) portions of strings that are semantic prior to learning majority patterns and deriving edits. Because not all data can result in majority patterns, DataVinci leverages execution information from an existing program (which reads the target data) to identify and correct data repairs that would not otherwise be identified. DataVinci outperforms 7 baselines on both error detection and repair when evaluated on 4 existing and new benchmarks.
- North America > United States > New York (0.04)
- North America > United States > Nevada (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (4 more...)
Unsupervised Context Sensitive Language Acquisition from a Large Corpus
Solan, Zach, Horn, David, Ruppin, Eytan, Edelman, Shimon
We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be "intermediate" for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (7 more...)
Unsupervised Context Sensitive Language Acquisition from a Large Corpus
Solan, Zach, Horn, David, Ruppin, Eytan, Edelman, Shimon
We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be "intermediate" for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (7 more...)
Unsupervised Context Sensitive Language Acquisition from a Large Corpus
Solan, Zach, Horn, David, Ruppin, Eytan, Edelman, Shimon
We describe a pattern acquisition algorithm that learns, in an unsupervised fashion,a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structuredknowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentencesas paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented bytrees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected toa standard test of English as a Second Language (ESL) proficiency. Theresults are encouraging: the model attains a level of performance consideredto be "intermediate" for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (6 more...)