Government
Accelerating the Discovery of Data Quality Rules: A Case Study
Yeh, Peter Z. (Accenture) | Puri, Colin A. (Accenture) | Wagman, Mark (Accenture) | Easo, Ajay K (Accenture)
Poor quality data is a growing and costly problem that affects many enterprises across all aspects of their business ranging from operational efficiency to revenue protection. In this paper, we present an application -- Data Quality Rules Accelerator (DQRA) -- that accelerates Data Quality (DQ) efforts (e.g. data profiling and cleansing) by automatically discovering DQ rules for detecting inconsistencies in data. We then present two evaluations. The first evaluation compares DQRA to existing solutions; and shows that DQRA either outperformed or achieved performance comparable with these solutions on metrics such as precision, recall, and runtime. The second evaluation is a case study where DQRA was piloted at a large utilities company to improve data quality as part of a legacy migration effort. DQRA was able to discover rules that detected data inconsistencies directly impacting revenue and operational efficiency. Moreover, DQRA was able to significantly reduce the amount of effort required to develop these rules compared to the state of the practice. Finally, we describe ongoing efforts to deploy DQRA.
Abductive Inference for Combat: Using SCARE-S2 to Find High-Value Targets in Afghanistan
Shakarian, Paulo (U.S. Army) | Nagel, Mago (University of Maryland) | Schuetzle, Brittany (University of Maryland) | Subrahmanian, V.S. (University of Maryland)
Recently, geospatial abduction was introduced by the authors in [Shakarian et. al. 2010] as a way to infer unobserved geographic phenomena from a set of known observations and constraints between the two. In this paper, we introduce the SCARE-S2 software tool which applies geospatial abduction to the environment of Afghanistan. Unlike previous work, where we looked for small weapon caches supporting local attacks, here we look for insurgent high-value targets (HVT's), supporting insurgent operations in two provinces. These HVT's include the locations of insurgent leaders and major supply depots. Applying this method of inference to Afghanistan introduces several practical issues not addressed in previous work. Namely, we are conducting inference in a much larger area (24,940 sq km as compared to 675 sq km in previous work), on more varied terrain, and must consider the influence of many local tribes. We address all of these problems and evaluate our software on 6 months of real-world counter-insurgency data. We show that we are able to abduce regions of a relatively small area (on average, under 100 sq km and each containing, on average, 4.8 villages) that are more dense with HVT's (35 X more than the overall area considered).
Monitoring Entities in an Uncertain World: Entity Resolution and Referential Integrity
Minton, Steven N. (InferLink Corporation) | Macskassy, Sofus A. (Fetch Technologies) | LaMonica, Peter (Air Force Research Laboratory) | See, Kane (Fetch Technologies) | Knoblock, Craig A. (University of Southern California) | Barish, Greg (Fetch Technologies) | Michelson, Matthew (Fetch Technologies) | Liuzzi, Raymond (Raymond Technologies)
This paper describes a system to help intelligence analysts track and analyze information being published in multiple sources, particularly open sources on the Web. The system integrates technology for Web harvesting, natural language extraction, and network analytics, and allows analysts to view and explore the results via a Web application. One of the difficult problems we address is the entity resolution problem, which occurs when there are multiple, differing ways to refer to the same entity. The problem is particularly complex when noisy data is being aggregated over time, there is no clean master list of entities, and the entities under investigation are intentionally being deceptive. Our system must not only perform entity resolution with noisy data, but must also gracefully recover when entity resolution mistakes are subsequently corrected. We present a case study in arms trafficking that illustrates the issues, and describe how they are addressed.
Emerging Applications for Intelligent Diabetes Management
Marling, Cindy (Ohio University) | Wiley, Matthew (Ohio University ) | Bunescu, Razvan (Ohio University ) | Shubrook, Jay (Ohio University) | Schwartz, Frank (Ohio University)
Diabetes management is a difficult task for patients, who must monitor and control their blood glucose levels in order to avoid serious diabetic complications. It is a difficult task for physicians, who must manually interpret large volumes of blood glucose data to tailor therapy to the needs of each patient. This paper describes three emerging applications that employ AI to ease this task and shares difficulties encountered in transitioning AI technology from university researchers to patients and physicians.
Hybrid Qualitative Simulation of Military Operations
Hinrichs, Thomas (Northwestern University) | Forbus, Kenneth (Northwestern University) | Kleer, Johan de (PARC) | Yoon, Sungwook (PARC) | Jones, Eric (BAE Systems AIT) | Hyland, Robert (BAE Systems AIT) | Wilson, Jason (BAE Systems AIT)
Our goal is to enable military planners to rapidly critique alternative battle plans by simulating multiple outcomes of adversarial plans. We describe a novel simulator, SimPath, that combines qualitative reasoning, a geographic information system (GIS), and targeted probabilistic calculations to envision how adversarial battle plans can play out. We outline the problem and describe the overall operation of the simulator. We then explain how qualitative process theory is extended with actions to model military tasks, how envisioning is factored to reduce combinatorial explosion, and how probabilities are computed for transitions and used to filter possibilities. Empirical results, including an experiment conducted by an independent evaluator, are summarized. The results show that it is possible to identify dozens of possible outcomes on each of 9 combinations of adversarial plans (COAs) in under two minutes. We close with a discussion of future work.
The Stock Sonar — Sentiment Analysis of Stocks Based on a Hybrid Approach
Feldman, Ronen (The Hebrew University of Jerusalem) | Rosenfeld, Benjamin (Digital Trowel) | Bar-Haim, Roy (Digital Trowel) | Fresko, Moshe (Digital Trowel)
The Stock Sonar (TSS) is a stock sentiment analysis application based on a novel hybrid approach. While previous work focused on document level sentiment classification, or extracted only generic sentiment at the phrase level, TSS integrates sentiment dictionaries, phrase-level compositional patterns, and predicate-level semantic events. TSS generates precise in text sentiment tagging as well as sentiment-oriented event summaries for a given stock, which are also aggregated into sentiment scores. Hence, TSS allows investors to get the essence of thousands of articles every day and may help them to make timely, informed trading decisions. The extracted sentiment is also shown to improve the accuracy of an existing document-level sentiment classifier.
A Machine Learning Based System for Semi-Automatically Redacting Documents
Cumby, Chad (Accenture Technology Labs) | Ghani, Rayid (Accenture Technology Labs)
Redacting text documents has traditionally been a mostly manual activity, making it expensive and prone to disclosure risks. This paper describes a semi-automated system to ensure a specified level of privacy in text data sets. Recent work has attempted to quantify the likelihood of privacy breaches for text data. We build on these notions to provide a means of obstructing such breaches by framing it as a multi-class classification problem. Our system gives users fine-grained control over the level of privacy needed to obstruct sensitive concepts present in that data. Additionally, our system is designed to respect a user-defined utility metric on the data (such as disclosure of a particular concept), which our methods try to maximize while anonymizing. We describe our redaction framework, algorithms, as well as a prototype tool built in to Microsoft Word that allows enterprise users to redact documents before sharing them internally and obscure client specific information. In addition we show experimental evaluation using publicly available data sets that show the effectiveness of our approach against both automated attackers and human subjects.The results show that we are able to preserve the utility of a text corpus while reducing disclosure risk of the sensitive concept.
Learning by Demonstration Technology for Military Planning and Decision Making: A Deployment Story
Myers, Karen (SRI International) | Kolojejchick, Jake (General Dynamics C4 Systems) | Angiolillo, Carl (General Dynamics C4 Systems) | Cummings, Tim (General Dynamics C4 Systems) | Garvey, Tom (SRI International) | Gervasio, Melinda (SRI International) | Haines, Will (SRI International) | Jones, Chris (SRI International) | Knittel, Janette (General Dynamics C4 Systems) | Morley, David (SRI International) | Ommert, William (General Dynamics C4 Systems) | Potter, Scott (General Dynamics C4 Systems)
Learning by demonstration technology has long held the promise to empower non-programmers to customize and extend software. We describe the deployment of a learning by demonstration capability to support user creation of automated procedures in a collaborative planning environment that is used widely by the U.S. Army. This technology, which has been in operational use since the summer of 2010, has helped to reduce user workloads by automating repetitive and time-consuming tasks. The technology has also provided the unexpected benefit of enabling standardization of products and processes.
Towards Evolutionary Nonnegative Matrix Factorization
Wang, Fei (IBM Research) | Tong, Hanghang (IBM Research) | Lin, Ching-Yung (IBM Research)
Nonnegative Matrix Factorization (NMF) techniques has aroused considerable interests from the field of artificial intelligence in recent years because of its good interpretability and computational efficiency. However, in many real world applications, the data features usually evolve over time smoothly. In this case, it would be very expensive in both computation and storage to rerun the whole NMF procedure after each time when the data feature changing. In this paper, we propose Evolutionary Nonnegative Matrix Factorization (eNMF), which aims to incrementally update the factorized matrices in a computation and space efficient manner with the variation of the data matrix. We devise such evolutionary procedure for both asymmetric and symmetric NMF. Finally we conduct experiments on several real world data sets to demonstrate the efficacy and efficiency of eNMF.
Efficient Methods for Lifted Inference with Aggregate Factors
Choi, Jaesik (University of Illinois at Urbana-Champaign) | Braz, Rodrigo de Salvo (SRI International) | Bui, Hung H. (SRI International)
Aggregate factors (that is, those based on aggregate functions such as SUM, AVERAGE, AND etc.) in probabilistic relational models can compactly represent dependencies among a large number of relational random variables. However, propositional inference on a factor aggregating n k -valued random variables into an r -valued result random variable is O ( r k 2 n ). Lifted methods can ameliorate this to O ( r n k ) in general and O ( r k log n ) for commutative associative aggregators. In this paper, we propose (a) an exact solution constant in n when k = 2 for certain aggregate operations such as AND, OR and SUM, and (b) a close approximation for inference with aggregate factors with time complexity constant in n . This approximate inference involves an analytical solution for some operations when k > 2. The approximation is based on the fact that the typically used aggregate functions can be represented by linear constraints in the standard ( k –1)-simplex in R k where k is the number of possible values for random variables. This includes even aggregate functions that are commutative but not associative (e.g., the MODE operator that chooses the most frequent value). Our algorithm takes polynomial time in k (which is only 2 for binary variables) regardless of r and n, and the error decreases as n increases. Therefore, for most applications (in which a close approximation suffices) our algorithm is a much more efficient solution than existing algorithms. We present experimental results supporting these claims. We also present a (c) third contribution which further optimizes aggregations over multiple groups of random variables with distinct distributions.