Many autonomous and heterogeneous information sources are becoming increasingly available to users through the Internet, especially through the World Wide Web. In order to make the information available in a consolidated, uniform, and efficient manner, it is necessary to integrate the different information sources. The integration of Internet sources poses several challenges that have not been sufficiently addressed by work on the integration of corporate databases residing on an Intranet [LMR90]. We believe that the most important ones are heterogeneity, large number of sources, redundancy, availability, source autonomy, and diverse access methods and querying interfaces.
We present a computationally efficient technique to compute the distance of high-dimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees and on hashing . For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average.
The goal of this notebook is to build and analyse a map of the 10,000 most popular subreddits on Reddit. To do this we need a means to measure the similarity of two subreddits. In a great article on FiveThirtyEight Trevor Martin did an analysis of subreddits by considering the overlaps of users commenting on two different subreddits. Our interest is a little broader -- we want to map out and visualize the space of subreddits, and attempt to cluster subreddits into their natural groups. With that done we can then explore some of the clusters and find interesting stories to tell.
There will be gift-giving as well as trees and menorahs aplenty this weekend as Christmas and Hanukkah exactly overlap for the first time in almost 30 years. Hanukkah may be seen by many non-Jews as the Jewish equivalent at Christmas, but it is rare that the two holidays coincide and even rarer still that they begin at the same time. But, in 2016, the first day of Hanukkah will be Dec. 25, although in the Hebrew calendar days begin at sundown the day before meaning the celebrations will actually get going on Christmas Eve. The last time that happened was in 1978. The two holidays do overlap more frequently, the last time being five years ago when Christmas fell midway between the eight days of Hanukkah.
Semantic similarity or inversely, semantic distance measures are useful in a variety of circumstances, from spell checking applications to a lightweight replacement for parsing within a natural language engine. Within this work, we examine the (Jiang & Conrath 1997) algorithm; evaluated by (Budanitsky & Hirst 2000) as being the best performing, and subject the algorithm to a series of tests. We also propose a novel technique which corrects a crucial weakness of the original algorithm, and show that its application improves semantic distance measures for cases where the underlying linguistic network causes deficiencies.