Homeland security researchers and analysts more than ever must process large volumes of textual information. Information extraction techniques have been proposed to help alleviate the burden of information overload. Information extraction techniques, however, require retraining and/or knowledge re-engineering when document types vary as in the homeland security domain. Also, while effectively reducing the volume of the information, information extraction techniques do not point researchers to unanticipated interesting relationships identified within the text. We present the Arizona TerrorNet, a system that utilizes less specified information extraction rules to extract less choreographed relationships between known terrorists. Extracted relations are combined in a network and visualized using a network visualizer. We processed 200 unseen documents using the TerrorNet which extracted over 500 relationships between known terrorists. An Al Qaeda network expert made a preliminary inspection of the network and confirmed many of the network links.
The application of statistical approaches to problems in natural language processing generally requires large (1,000,000 words) corpora to produce useful results. In this paper we show that a well-known statistical technique, the t test, can be applied to smaller corpora than was previously thought possible, by relying on semantic features rather than lexical items in a corpus of limited domain. We apply the t test to the problem of resolving relative pronoun antecedents, using collocation frequency data collected from the 500,000 word MUC-4 corpus. We conduct two experiments where t is calculated with lexical items and with semantic feature representations. We show that the test cases that are relevant to the MUC-4 domain produce more significant values of t than the ones that are irrelevant. We also show that the t test correctly resolves the relative pronoun in 91.07% of the relevant test cases where the value of t is significant. Introduction The use of statistical techniques in naturalanguage processing generally requires large corpora to produce useful results. We believe, however, that statistical techniques can be successfully applied to much smaller corpora, if the texts are drawn from a limitedomain. The limited nature of the corpus may compensate for its size because the texts share common properties.
Information extraction (IE) systems have been tailored to extract fixed target information from documents in a fixed language. In order to be truly useful for information analysts, the target information must be user-definable and the source documents should cover multiple languages. We will map out the path toward such open-target multilingual IE systems, identifying necessary technological breakthroughs along the path. We also discuss a Japanese-English named entity extraction system under development, which represents a case of the next step along the path. Introduction: Toward Multilingual Information Extraction Systems The natural language processing field has witnessed a rapid development of the information extraction (IE) technology since the early 90's, driven by the series of Message Understanding Conferences (MUC's) the government-sponsored TIPSTER program. 1 This technology enables a rapid, robust, and automatic extraction of certain predefined target information from real-world online texts or speech transcripts accessible through computer networks. Information analysts, whose task is to keep track of changing states of affairs about particular topics such as microelectronic products and international terrorist activities, can use the IE technology for accomplishing their tasks more efficiently and effectively.
While this reality has become more tangible in recent years through consumer technology, such as Amazon's Alexa or Apple's Siri, the applications of AI software are already widespread, ranging from credit card fraud detection at VISA to payload scheduling operations at NASA to insider trading surveillance on the NASDAQ. Broadly defined as the imitation of human cognition by a machine, recent interest in AI has been driven by advances in machine learning, in which computer algorithms learn from data without human direction.1 Most sophisticated processes that involve some form of prediction generated from a large data set use this type of AI, including image recognition, web-search, speech-to-text language processing, and e-commerce product recommendations.2 AI is increasingly incorporated into devices that consumers keep with them at all times, such as smartphones, and powers consumer technologies on the horizon, such as self-driving cars. And there is anticipation that these advances will continue to accelerate: a recent survey of leading AI researchers predicted that, within the next 10 years, AI will outperform humans in transcribing speech, translating languages, and driving a truck.3
In the summer of 2013, Brazil experienced a period of conflict triggered by a series of protests. While the popular press covered the events, little empirical work has investigated how first-hand reporting of the protests occurred and evolved over social media and how such exposure in turn impacted the demonstrations themselves. In this study we examine over 42 million tweets shared during the three months of conflict in order to uncover patterns in online and offline protest-related activity as well as to explore relationships between language-use in tweets and the emotions and underlying motivations of protesters. Our findings show that peaks in Twitter activity coincide with days in which heavy protesting took place, that the words in tweets reflect emotional characteristics of protest-related events, and less expectedly, that these emotions convey both positive as well as negative sentiment.