AITopics | Information Extraction

Collaborating Authors

Information Extraction

News Overviews Instructional Materials AI-Alerts Classics

Exploiting Background Knowledge to Build Reference Sets for Information Extraction

Michelson, Matthew (Fetch Technologies) | Knoblock, Craig A. (University of Southern California / Information Sciences Institute)

AAAI ConferencesJun-23-2009

Previous work on information extraction from unstructured, ungrammatical text (e.g. classified ads) showed that exploiting a set of background knowledge, called a "reference set," greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often difficult, if not impossible. Further, even if a source is found, it might not overlap well with the text for extraction. In this paper we present an approach to building the reference set directly from the text itself. Our approach eliminates the need to find the source for the reference set, and ensures better overlap between the text and reference set. Starting with a small amount of background knowledge, our technique constructs tuples representing the entities in the text to form a reference set. Our results show that our method outperforms manually constructed reference sets, since hand built reference sets may not overlap with the entities in the unstructured, ungrammatical text. We also ran experiments comparing our method to the supervised approach of Conditional Random Fields (CRFs) using simple, generic features. These results show our method achieves an improvement in F1-measure for 6/9 attributes and is competitive in performance on the others, and this is without training data.

entity tree, extraction, model number, (16 more...)

AAAI Conferences

Twenty-First International Joint Conference on Artificial Intelligence

Country: North America > United States > California > Los Angeles County > El Segundo (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Automobiles & Trucks > Manufacturer (1.00)
Transportation > Passenger (0.94)
Transportation > Ground > Road (0.94)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.85)
Information Technology > Data Science > Data Mining > Text Mining (0.61)

Add feedback

Expanding Domain Sentiment Lexicon through Double Propagation

Qiu, Guang (College of Computer Science, Zhejiang University) | Liu, Bing (Department of Computer Science, University of Illinois at Chicago) | Bu, Jiajun (College of Computer Science, Zhejiang University) | Chen, Chun (College of Computer Science, Zhejiang University)

AAAI ConferencesJun-23-2009

In most sentiment analysis applications, the sentiment lexicon plays a key role. However, it is hard, if not impossible, to collect and maintain a universal sentiment lexicon for all application domains because different words may be used in different domains. The main existing technique extracts such sentiment words from a large domain corpus based on different conjunctions and the idea of sentiment coherency in a sentence. In this paper, we propose a novel propagation approach that exploits the relations between sentiment words and topics or product features that the sentiment words modify, and also sentiment words and product features themselves to extract new sentiment words. As the method propagates information through both sentiment words and features, we call it double propagation. The extraction rules are designed based on relations described in dependency trees. A new method is also proposed to assign polarities to newly discovered sentiment words in a domain. Experimental results show that our approach is able to extract a large number of new sentiment words. The polarity assignment method is also effective.

polarity, relation, sentiment word, (15 more...)

AAAI Conferences

Twenty-First International Joint Conference on Artificial Intelligence

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

Add feedback

Creating Relational Data from Unstructured and Ungrammatical Data Sources

Michelson, M., Knoblock, C. A.

Journal of Artificial Intelligence ResearchMar-28-2008

In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructured, ungrammatical data "posts." The unstructured nature of posts makes query and integration difficult because the attributes are embedded within the text. Also, these attributes do not conform to standardized values, which prevents queries based on a common attribute value. The schema is unknown and the values may vary dramatically making accurate search difficult. Creating relational data for easy querying requires that we define a schema for the embedded attributes and extract values from the posts while standardizing these values. Traditional information extraction (IE) is inadequate to perform this task because it relies on clues from the data, such as structure or natural language, neither of which are found in posts. Furthermore, traditional information extraction does not incorporate data cleaning, which is necessary to accurately query and integrate the source. The two-step approach described in this paper creates relational data sets from unstructured and ungrammatical text by addressing both issues. To do this, we require a set of known entities called a "reference set." The first step aligns each post to each member of each reference set. This allows our algorithm to define a schema over the post and include standard values for the attributes defined by this schema. The second step performs information extraction for the attributes, including attributes not easily represented by reference sets, such as a price. In this manner we create a relational structure over previously unstructured data, supporting deep and accurate queries over the data as well as standard values for integration. Our experimental results show that our technique matches the posts to the reference set accurately and efficiently and outperforms state-of-the-art extraction systems on the extraction task from posts.

algorithm, extraction, phoebus, (15 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.2409

AI Access Foundation

10541

Journal of Artificial Intelligence Research

Country:

North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Education > Educational Setting > Online (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(5 more...)

Add feedback

MITA: An Information-Extraction Approach to the Analysis of Free-Form Text in Life Insurance Applications

Glasgow, Barry, Mandell, Alan, Binney, Dan, Ghemri, Lila, Fisher, David

AI MagazineMar-15-1998

MetLife processes over 260,000 life insurance applications a year. MetLife's intelligent text analyzer (MITA) uses the information-extraction technique of natural language processing to structure the extensive textual fields on a life insurance application. MITA is currently processing 20,000 life insurance applications a month. Eighty-nine percent of the textual fields processed by MITA exceed the established confidence-level threshold and are potentially available for further analysis by domain-specific analyzers.

artificial intelligence, banking & finance, life insurance application, (10 more...)

AI Magazine

Industry:

Banking & Finance > Risk Management (1.00)
Banking & Finance > Insurance (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.75)

Add feedback

MITA: An Information-Extraction Approach to the Analysis of Free-Form Text in Life Insurance Applications

Glasgow, Barry, Mandell, Alan, Binney, Dan, Ghemri, Lila, Fisher, David

AI MagazineMar-15-1998

MetLife processes over 260,000 life insurance applications a year. Underwriting of these applications is labor intensive. Automation is difficult because the applications include many free-form text fields. MetLife's intelligent text analyzer (MITA) uses the information-extraction technique of natural language processing to structure the extensive textual fields on a life insurance application. Knowledge engineering, with the help of underwriters as domain experts, was performed to elicit significant concepts for both medical and occupational textual fields. A corpus of 20,000 life insurance applications provided the syntactical and semantic patterns in which these underwriting concepts occur. These patterns, in conjunction with the concepts, formed the frameworks for information extraction. Extension of the information-extraction work developed by Wendy Lehnert was used to populate these frameworks with classes obtained from the systematized nomenclature of human and veterinary medicine and the Dictionary of Occupational Titles ontologies. These structured frameworks can then be analyzed by conventional knowledge-based systems. MITA is currently processing 20,000 life insurance applications a month. Eighty-nine percent of the textual fields processed by MITA exceed the established confidence-level threshold and are potentially available for further analysis by domain-specific analyzers.

data mining, mita, natural language, (20 more...)

AI Magazine

Country: North America > United States > California > San Francisco County > San Francisco (0.16)

Industry:

Banking & Finance > Insurance (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.67)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
(2 more...)

Add feedback

Empirical Methods in Information Extraction

Cardie, Claire

AI MagazineDec-15-1997

This article surveys the use of empirical, machine-learning methods for a particular natural language-understanding task-information extraction. The author presents a generic architecture for information-extraction systems and then surveys the learning algorithms that have been developed to address the problems of accuracy, portability, and knowledge acquisition for each component of the architecture.

artificial intelligence, information extraction, survey article, (4 more...)

AI Magazine

Genre: Overview (0.87)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)

Add feedback

Empirical Methods in Information Extraction

Cardie, Claire

AI MagazineDec-15-1997

data mining, information, natural language, (17 more...)

AI Magazine

Country: North America > United States > California > San Francisco County > San Francisco (0.15)

Genre: Overview (0.66)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)

Add feedback

Wrap-Up: a Trainable Discourse Module for Information Extraction

Soderland, S., Lehnert, W.

Journal of Artificial Intelligence ResearchDec-1-1994

The vast amounts of on-line text now available have ledto renewed interest in information extraction (IE) systems thatanalyze unrestricted text, producing a structured representation ofselected information from the text. This paper presents a novel approachthat uses machine learning to acquire knowledge for some of the higher level IE processing. Wrap-Up is a trainable IE discourse component that makes intersentential inferences and identifies logicalrelations among information extracted from the text. Previous corpus-based approaches were limited to lower level processing such as part-of-speech tagging, lexical disambiguation, and dictionary construction. Wrap-Up is fully trainable, and not onlyautomatically decides what classifiers are needed, but even derives the featureset for each classifier automatically. Performance equals that of a partially trainable discourse module requiring manual customization for each domain.

company type, neg, packaging type, (14 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.68

AI Access Foundation

10125

Journal of Artificial Intelligence Research

Country:

Asia > Indonesia > Java > Jakarta > Jakarta (0.06)
Asia > Japan > Honshū > Chūbu > Aichi Prefecture > Nagoya (0.05)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.71)
Information Technology > Data Science > Data Mining > Text Mining (0.60)

Add feedback

Pattern Matching and Discourse Processing in Information Extraction from Japanese Text

Kitani, T., Eriguchi, Y., Hara, M.

Journal of Artificial Intelligence ResearchAug-1-1994

Information extraction is the task of automaticallypicking up information of interest from an unconstrained text. Informationof interest is usually extracted in two steps. First, sentence level processing locates relevant pieces of information scatteredthroughout the text; second, discourse processing merges coreferential information to generate the output. In the first step, pieces of information are locally identified without recognizing any relationships among them. A key word search or simple patternsearch can achieve this purpose. The second step requires deeperknowledge in order to understand relationships among separately identified pieces of information. Previous information extraction systems focused on the first step, partly because they were not required to link up each piece of information with other pieces. To link the extracted pieces of information and map them onto a structuredoutput format, complex discourse processing is essential. This paperreports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor. Evaluationresults show a high level of system performance which approaches human performance.

information extraction, pattern matching and discourse processing, tanabe pharmaceutical, (3 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.53

AI Access Foundation

10119

Journal of Artificial Intelligence Research

Country: Asia > Japan (0.08)

Industry: Health & Medicine (0.37)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.91)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.91)

Add feedback

Automatically constructing a dictionary for information extraction tasks

Riloff, E.

ClassicsFeb-1-1993

Knowledge-based natural language processing systems have achieved good success with certain tasks but they are often criticized because they depend on a domain-specific dictionary that requires a great deal of manual knowledge engineering. This knowledge engineering bottleneck makes knowledge-based NLP systems impractical for real-world applications because they cannot be easily scaled up or ported to new domains. In response to this problem, we developed a system called AutoSlog that automatically builds a domain-specific dictionary of concepts for extracting information from text. Using AutoSlog, we constructed a dictionary for the domain of terrorist event descriptions in only 5 person-hours. We then compared the AutoSlog dictionary with a handcrafted dictionary that was built by two highly skilled graduate students and required approximately 1500 person-hours of effort. We evaluated the two dictionaries using two blind test sets of 100 texts each. Overall, the AutoSlog dictionary achieved 98% of the performance of the handcrafted dictionary. On the first test set, the Auto-Slog dictionary obtained 96.3% of the performance of the handcrafted dictionary. On the second test set, the overall scores were virtually indistinguishable with the AutoSlog dictionary achieving 99.7% of the performance of the handcrafted dictionary.

artificial intelligence, expert system, natural language, (17 more...)

Classics

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
South America (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Industry:

Law Enforcement & Public Safety > Terrorism (0.72)
Government (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)

Add feedback