Galbrun, Esther
Fast Redescription Mining Using Locality-Sensitive Hashing
Karjalainen, Maiju, Galbrun, Esther, Miettinen, Pauli
A redescription is a pattern that characterises roughly the same entities in two different ways, and redescription mining is the task of automatically extracting redescriptions from the input dataset, given user-defined constraints. Redescription mining has found applications in various fields of science, such as ecometrics. Ecometrics aims to identify and model the functional relationships between traits of organisms and their environments [5, 7]. For instance, the teeth of large plant-eating mammals are adapted to the food that is available in their environment, which in turn depends on the climatic conditions, potentially allowing one to reason about the climate in the past based on the fossil record. To apply redescription mining in this context, the entities in the dataset represent localities, with two sets of attributes recording respectively the distribution of dental traits among species and the climatic conditions at each locality [11, 19].
Discovering Useful Compact Sets of Sequential Rules in a Long Sequence
Bourrand, Erwan, Galárraga, Luis, Galbrun, Esther, Fromont, Elisa, Termier, Alexandre
We are interested in understanding the underlying generation process for long sequences of symbolic events. To do so, we propose COSSU, an algorithm to mine small and meaningful sets of sequential rules. The rules are selected using an MDL-inspired criterion that favors compactness and relies on a novel rule-based encoding scheme for sequences. Our evaluation shows that COSSU can successfully retrieve relevant sets of closed sequential rules from a long sequence. Such rules constitute an interpretable model that exhibits competitive accuracy for the tasks of next-element prediction and classification.
The Minimum Description Length Principle for Pattern Mining: A Survey
Galbrun, Esther
The aim of this document is to review the development of pattern mining methods based on and inspired from the Minimum Description Length (MDL) principle. Although this is an unrealistic goal, we strive for completeness. The reader is expected to be familiar with common pattern mining tasks and techniques, but not necessarily with concepts from information theory and coding, of which we therefore give an outline in Section 2. Background work is covered in Section 3, starting with the theory behind the MDL principle and similar principles, going over a few examples of uses of the principle in the adjacent fields of machine learning and natural language processing, and ending with a review of data mining methods that involve practical compression as a tool or that consider the problem of selecting patterns.