Chenthamarakshan, Vijil (IBM T J Watson Research Center Yorktown Heights) | Melville, Prem (IBM T J Watson Research Center Yorktown Heights) | Sindhwani, Vikas (IBM T J Watson Research Center Yorktown Heights) | Lawrence, Richard D (IBM T J Watson Research Center Yorktown Heights)
The rapid construction of supervised text classification models is becoming a pervasive need across many modern applications. To reduce human-labeling bottlenecks, many new statistical paradigms (e.g., active, semi-supervised, transfer and multi-task learning) have been vigorously pursued in recent literature with varying degrees of empirical success. Concurrently, the emergence of Web 2.0 platforms in the last decade has enabled a world-wide, collaborative human effort to construct a massive ontology of concepts with very rich, detailed and accurate descriptions. In this paper we propose a new framework to extract supervisory information from such ontologies and complement it with a shift in human effort from direct labeling of examples in the domain of interest to the much more efficient identification of concept-class associations. Through empirical studies on text categorization problems using the Wikipedia ontology, we show that this shift allows very high-quality models to be immediately induced at virtually no cost.
This paper presents an approach to acquire knowledge from Wikipedia categories and the category network. Many Wikipedia categories have complex names which reflect human classification and organizing instances, and thus encode knowledge about class attributes, taxonomic and other semantic relations. We decode the names and refer back to the network to induce relations between concepts in Wikipedia represented through pages or categories. The category structure allows us to propagate a relation detected between constituents of a category name to numerous concept links. The results of the process are evaluated against ResearchCyc and a subset also by human judges. The results support the idea that Wikipedia category names are a rich source of useful and accurate knowledge.
Integration of ontologies begins with establishing mappings between their concept entries. We map categories from the largest manually-built ontology, Cyc, onto Wikipedia articles describing corresponding concepts. Our method draws both on Wikipedia's rich but chaotic hyperlink structure and Cyc's carefully defined taxonomic and commonsense knowledge. On 9,333 manual alignments by one person, we achieve an F-measure of 90%; on 100 alignments by six human subjects the average agreement of the method with the subject is close to their agreement with each other. We cover 62.8% of Cyc categories relating to commonsense knowledge and discuss what further information might be added to Cyc given this substantial new alignment.
We discuss the problems associated with versioning ontologies in distributed environments. This is an important issue because ontologies can be of great use in structuring and querying intemet information, but many of the Intemet's characteristics, such as distributed ownership, rapid evolution, and heterogeneity, make ontology management difficult. We present SHOE, a web-based knowledge representation language that supports multiple versions of ontologies. We then discuss the features of SHOE that address ontology versioning, the affects of ontology revision on SHOE web pages, and methods for implementing ontology integration using SHOE's extension and version mechanisms. 1. Introduction As the use of ontologies becomes more prevalent, there is a more pressing need for good ontology management schemes. This is especially true once an ontology has been used to structure data, since changing it can be very expensive. Often the solution is to "get it right the first time", however, in long term applications, there is always the chance that new information will be discovered or that different features of the domain will become important. Therefore, we must think of ontology development as an ongoing process. In a centralized environment, it may be possible to coordinate ontology revisions with corresponding revisions to the data that was structured using the ontology. However, as the volume of data increases this become more difficult.
We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexicosyntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.