joinability
Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns
Guo, Yuxiang, Mao, Yuren, Hu, Zhonghao, Chen, Lu, Gao, Yunjun
Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, Snoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection.To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that Snoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency--being at least 5 orders of magnitude faster than cell-level solutions, and 3.5x faster than existing column-level methods.
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
Dong, Yuyang, Xiao, Chuan, Nozawa, Takuma, Enomoto, Masafumi, Oyamada, Masafumi
Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is logarithmic in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, Deepjoin generalizes to large datasets and its precision consistently outperforms other approximate solutions'. Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.
Syntactic Confluence Criteria for Positive/Negative-Conditional Term Rewriting Systems
We study the combination of the following already known ideas for showing confluence of unconditional or conditional term rewriting systems into practically more useful confluence criteria for conditional systems: Our syntactical separation into constructor and non-constructor symbols, Huet's introduction and Toyama's generalization of parallel closedness for non-noetherian unconditional systems, the use of shallow confluence for proving confluence of noetherian and non-noetherian conditional systems, the idea that certain kinds of limited confluence can be assumed for checking the fulfilledness or infeasibility of the conditions of conditional critical pairs, and the idea that (when termination is given) only prime superpositions have to be considered and certain normalization restrictions can be applied for the substitutions fulfilling the conditions of conditional critical pairs. Besides combining and improving already known methods, we present the following new ideas and results: We strengthen the criterion for overlay joinable noetherian systems, and, by using the expressiveness of our syntactical separation into constructor and non-constructor symbols, we are able to present criteria for level confluence that are not criteria for shallow confluence actually and also able to weaken the severe requirement of normality (stiffened with left-linearity) in the criteria for shallow confluence of noetherian and non-noetherian conditional systems to the easily satisfied requirement of quasi-normality. Finally, the whole paper may also give a practically useful overview of the syntactical means for showing confluence of conditional term rewriting systems.