Information Extraction
Exploiting Background Knowledge to Build Reference Sets for Information Extraction
Michelson, Matthew (Fetch Technologies) | Knoblock, Craig A. (University of Southern California / Information Sciences Institute)
Previous work on information extraction from unstructured, ungrammatical text (e.g. classified ads) showed that exploiting a set of background knowledge, called a "reference set," greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often difficult, if not impossible. Further, even if a source is found, it might not overlap well with the text for extraction. In this paper we present an approach to building the reference set directly from the text itself. Our approach eliminates the need to find the source for the reference set, and ensures better overlap between the text and reference set. Starting with a small amount of background knowledge, our technique constructs tuples representing the entities in the text to form a reference set. Our results show that our method outperforms manually constructed reference sets, since hand built reference sets may not overlap with the entities in the unstructured, ungrammatical text. We also ran experiments comparing our method to the supervised approach of Conditional Random Fields (CRFs) using simple, generic features. These results show our method achieves an improvement in F1-measure for 6/9 attributes and is competitive in performance on the others, and this is without training data.
Expanding Domain Sentiment Lexicon through Double Propagation
Qiu, Guang (College of Computer Science, Zhejiang University) | Liu, Bing (Department of Computer Science, University of Illinois at Chicago) | Bu, Jiajun (College of Computer Science, Zhejiang University) | Chen, Chun (College of Computer Science, Zhejiang University)
In most sentiment analysis applications, the sentiment lexicon plays a key role. However, it is hard, if not impossible, to collect and maintain a universal sentiment lexicon for all application domains because different words may be used in different domains. The main existing technique extracts such sentiment words from a large domain corpus based on different conjunctions and the idea of sentiment coherency in a sentence. In this paper, we propose a novel propagation approach that exploits the relations between sentiment words and topics or product features that the sentiment words modify, and also sentiment words and product features themselves to extract new sentiment words. As the method propagates information through both sentiment words and features, we call it double propagation. The extraction rules are designed based on relations described in dependency trees. A new method is also proposed to assign polarities to newly discovered sentiment words in a domain. Experimental results show that our approach is able to extract a large number of new sentiment words. The polarity assignment method is also effective.
Creating Relational Data from Unstructured and Ungrammatical Data Sources
Michelson, M., Knoblock, C. A.
In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration difficult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search difficult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a "reference set." The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and efficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.
MITA: An Information-Extraction Approach to the Analysis of Free-Form Text in Life Insurance Applications
Glasgow, Barry, Mandell, Alan, Binney, Dan, Ghemri, Lila, Fisher, David
MetLife processes over 260,000 life insurance applications a year. MetLife's intelligent text analyzer (MITA) uses the information-extraction technique of natural language processing to structure the extensive textual fields on a life insurance application. MITA is currently processing 20,000 life insurance applications a month. Eighty-nine percent of the textual fields processed by MITA exceed the established confidence-level threshold and are potentially available for further analysis by domain-specific analyzers.
MITA: An Information-Extraction Approach to the Analysis of Free-Form Text in Life Insurance Applications
Glasgow, Barry, Mandell, Alan, Binney, Dan, Ghemri, Lila, Fisher, David
MetLife processes over 260,000 life insurance applications a year. Underwriting of these applications is labor intensive. Automation is difficult because the applications include many free-form text fields. MetLife's intelligent text analyzer (MITA) uses the information-extraction technique of natural language processing to structure the extensive textual fields on a life insurance application. Knowledge engineering, with the help of underwriters as domain experts, was performed to elicit significant concepts for both medical and occupational textual fields. A corpus of 20,000 life insurance applications provided the syntactical and semantic patterns in which these underwriting concepts occur. These patterns, in conjunction with the concepts, formed the frameworks for information extraction. Extension of the information-extraction work developed by Wendy Lehnert was used to populate these frameworks with classes obtained from the systematized nomenclature of human and veterinary medicine and the Dictionary of Occupational Titles ontologies. These structured frameworks can then be analyzed by conventional knowledge-based systems. MITA is currently processing 20,000 life insurance applications a month. Eighty-nine percent of the textual fields processed by MITA exceed the established confidence-level threshold and are potentially available for further analysis by domain-specific analyzers.
Empirical Methods in Information Extraction
This article surveys the use of empirical, machine-learning methods for a particular natural language-understanding task-information extraction. The author presents a generic architecture for information-extraction systems and then surveys the learning algorithms that have been developed to address the problems of accuracy, portability, and knowledge acquisition for each component of the architecture.
Empirical Methods in Information Extraction
This article surveys the use of empirical, machine-learning methods for a particular natural language-understanding task-information extraction. The author presents a generic architecture for information-extraction systems and then surveys the learning algorithms that have been developed to address the problems of accuracy, portability, and knowledge acquisition for each component of the architecture.
Wrap-Up: a Trainable Discourse Module for Information Extraction
The vast amounts of on-line text now available have ledto renewed interest in information extraction (IE) systems thatanalyze unrestricted text, producing a structured representation ofselected information from the text. This paper presents a novel approachthat uses machine learning to acquire knowledge for some of the higher level IE processing. Wrap-Up is a trainable IE discourse component that makes intersentential inferences and identifies logicalrelations among information extracted from the text. Previous corpus-based approaches were limited to lower level processing such as part-of-speech tagging, lexical disambiguation, and dictionary construction. Wrap-Up is fully trainable, and not onlyautomatically decides what classifiers are needed, but even derives the featureset for each classifier automatically. Performance equals that of a partially trainable discourse module requiring manual customization for each domain.
Pattern Matching and Discourse Processing in Information Extraction from Japanese Text
Kitani, T., Eriguchi, Y., Hara, M.
Information extraction is the task of automaticallypicking up information of interest from an unconstrained text. Informationof interest is usually extracted in two steps. First, sentence level processing locates relevant pieces of information scatteredthroughout the text; second, discourse processing merges coreferential information to generate the output. In the first step, pieces of information are locally identified without recognizing any relationships among them. A key word search or simple patternsearch can achieve this purpose. The second step requires deeperknowledge in order to understand relationships among separately identified pieces of information. Previous information extraction systems focused on the first step, partly because they were not required to link up each piece of information with other pieces. To link the extracted pieces of information and map them onto a structuredoutput format, complex discourse processing is essential. This paperreports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor. Evaluationresults show a high level of system performance which approaches human performance.
Automatically constructing a dictionary for information extraction tasks
Knowledge-based natural language processing systems have achieved good success with certain tasks but they are often criticized because they depend on a domain-specific dictionary that requires a great deal of manual knowledge engineering. This knowledge engineering bottleneck makes knowledge-based NLP systems impractical for real-world applications because they cannot be easily scaled up or ported to new domains. In response to this problem, we developed a system called AutoSlog that automatically builds a domain-specific dictionary of concepts for extracting information from text. Using AutoSlog, we constructed a dictionary for the domain of terrorist event descriptions in only 5 person-hours. We then compared the AutoSlog dictionary with a handcrafted dictionary that was built by two highly skilled graduate students and required approximately 1500 person-hours of effort. We evaluated the two dictionaries using two blind test sets of 100 texts each. Overall, the AutoSlog dictionary achieved 98% of the performance of the handcrafted dictionary. On the first test set, the Auto-Slog dictionary obtained 96.3% of the performance of the handcrafted dictionary. On the second test set, the overall scores were virtually indistinguishable with the AutoSlog dictionary achieving 99.7% of the performance of the handcrafted dictionary.