Tree Learning: Optimal Algorithms and Sample Complexity
Avdiukhin, Dmitrii, Yaroslavtsev, Grigory, Vainstein, Danny, Fischer, Orr, Das, Sauman, Mirza, Faraz
–arXiv.org Artificial Intelligence
The algorithmic problem of constructing hierarchical data representations has been of major importance for many decades, due to its applications in statistics [Ward Jr, 1963, Gower and Ross, 1969], entomology [Michener and Sokal, 1957], plant biology [Sorensen, 1948], genomics [Eisen et al., 1998] and other domains. Efficient collection of labeled data is a problem of key importance for construction of hierarchical data representations. In this paper we consider the problem of constructing tree representations of data from labeled samples, focusing on understanding the optimal number of samples required for this task. The most basic type of a label that allows one to construct a tree representation consists of a triplet of points (x, y, z) labeled according to the induced hierarchical structure within the triplet. For example, given images of a cat, a dog, and a plane, the label would describe the cat and the dog as being more similar to each other than to the plane. This "odd one out" type of label is the simplest one to collect in a crowdsourcing setting. In this paper, we focus on understanding the number of such labeled samples required in order to construct a tree representation of the underlying data, which enables one to accurately predict subsequent labels in the future.
arXiv.org Artificial Intelligence
Feb-9-2023