IBM Research - India
Content and Context: Two-Pronged Bootstrapped Learning for Regex-Formatted Entity Extraction
Simoes, Stanley (Indian Institute of Technology Madras) | P, Deepak (Queen's University Belfast) | Sairamesh, Munu (Indian Institute of Technology Madras) | Khemani, Deepak (Indian Institute of Technology Madras) | Mehta, Sameep (IBM Research - India)
Regular expressions are an important building block of rule-based information extraction systems. Regexes can encode rules to recognize instances of simple entities which can then feed into the identification of more complex cross-entity relationships. Manually crafting a regex that recognizes all possible instances of an entity is difficult since an entity can manifest in a variety of different forms. Thus, the problem of automatically generalizing manually crafted seed regexes to improve the recall of IE systems has attracted research attention. In this paper, we propose a bootstrapped approach to improve the recall for extraction of regex-formatted entities, with the only source of supervision being the seed regex. Our approach starts from a manually authored high precision seed regex for the entity of interest, and uses the matches of the seed regex and the context around these matches to identify more instances of the entity. These are then used to identify a set of diverse, high recall regexes that are representative of this entity. Through an empirical evaluation over multiple real world document corpora, we illustrate the effectiveness of our approach.
Automatically Augmenting Titles of Research Papers for Better Discovery
Pallan, Madhavan (IBM Research - India) | Srivastava, Biplav (IBM Research - India)
It is well known that the title of an article impacts how well it is discovered by potential readers and read. With both people and search engines, acting on behalf of people, accessing papers from digital libraries, it is important that the paper titles should promote discovery. In this paper, we investigate the characteristics of titles of AI papers and then propose au- tomatic ways to augment them so that they can be better in- dexed and discovered by users. A user study with researchers shows that they overwhelmingly prefer the augmented titles over the originals for being more helpful.
The D-SCRIBE Process for Building a Scalable Ontology
Schloss, Robert (IBM T. J. Watson Research Center) | Usceda-Sosa, Rosario (IBM T. J. Watson Research Center) | Srivastava, Biplav (IBM Research - India)
In this paper, we describe the D-SCRIBE process used to build ontologies that are expected to have significant domain expansion after their initial introduction and whose coverage of concepts needs to be validated for a series of related applications. This process has been used to build SCRIBE, a very modular, ambitious ontology for the information about events triggered by both humans or nature, response activities by agencies that provide public services in cities by using resources and assets (land parcels, buildings, vehicles, equipment) and their communication (requests, work orders, sensor reports). SCRIBE reuses concepts from previously existing ontologies and data exchange standards, and D-SCRIBE retains traceability to these source influences.
Towards Timely Public Health Decisions to Tackle Seasonal Diseases With Open Government Data
Srivastava, Vandana (Freelance Analyst) | Srivastava, Biplav (IBM Research - India)
Improving public health is a major responsibility of any government, and is of major interest to citizens and scientific communities around the world. Here, one sees two extremes. On one hand, tremendous progress has been made in recent years in the understanding of causes, spread and remedies of common and regularly occurring diseases like Dengue, Malaria and Japanese Encephalistis (JE). On the other hand, public agencies treat these diseases in an ad hoc manner without learning from the experiences of previous years. Specifically, they would get alerted once reported cases have already arisen substantially in the known disease season, reactively initiate a few actions and then document the disease impact (cases, deaths) for that period, only to forget this learning in the next season. However, they miss the opportunity to reduce preventable deaths and sickness, and their corresponding economic impact, which scientific progress could have enabled. The gap is universal but very prominent in developing countries like India. In this paper, we show that if public agencies provide historical disease impact information openly, it can be analyzed with statistical and machine learning techniques, correlated with best emerging practices in disease control, and simulated in a setting to optimize social benefits to provide timely guidance for new disease seasons and regions. We illustrate using open data for mosquito-borne communicable diseases; published results in public health on efficacy of Dengue control methods and apply it on a simulated typical city for maximal benefits with available resources. The exercise helps us further suggest strategies for new regions that may be anywhere in the world, how data could be better recorded by city agencies and what prevention methods should medical community focus on for wider impact.
Integrated Operations (Re-)Scheduling from Mine to Ship
Sampath, Kameshwaran (IBM Research - India) | Tezabwala, Alfiya (IBM Research - India) | Chabrier, Alain (IBM Software Group) | Payne, Julain (IBM Software Group) | Tiozzo, Fabio (IBM Software Group)
Mining companies have complex supply chains that start from the mining location and stretch thousands of kilometers to the end customer in a different country and continent. The logistics of moving the materials from mines to ship is composed of series of optimization problems like berth allocation , ship scheduling , stockyard scheduling , and rail scheduling , which are individually NP-hard. In this paper, we present a scheduling application, called as IBM Optimization: Mine to Ship , for end-to-end integrated operations scheduling. The application is built on IBM ILOG ODM Enterprise with advanced features like rescheduling under deviations and disturbances, and maintenance scheduling. The modeling and computational complexity of integrated scheduling optimization is tamed using hybrid optimization technique that leverages mathematical programming and constraint programming. The application will benefit the mining companies with increased resource usage, higher throughput, reduced cost of operations, and higher revenue.
Towards Analyzing Micro-Blogs for Detection and Classification of Real-Time Intentions
Banerjee, Nilanjan (IBM Research - India) | Chakraborty, Dipanjan (IBM Research - India) | Joshi, Anupam (IBM Research - India) | Mittal, Sumit (IBM Research - India, New Delhi) | Rai, Angshu (IBM Research - India) | Ravindran, Balaraman (Indian Institute of Technology, Madras)
Micro-blog forums, such as Twitter, constitute a powerful medium today that people use to express their thoughts and intentions on a daily, and in many cases, hourly, basis. Extracting ‘Real-Time Intention’ (RTI) of a user from such short text updates is a huge opportunity towards web personalization and social net- working around dynamic user context. In this paper, we explore the novel problem of detecting and classifying RTIs from micro-blogs. We find that employing a heuristic based ensemble approach on a reduced dimension of the feature space, based on a wide spectrum of linguistic and statistical features of RTI expressions, achieves significant improvement in detect- ing RTIs compared to word-level features used in many social media classification tasks today. Our solution approach takes into account various salient characteristics of micro-blogs towards such classification – high dimensionality, sparseness of data, limited context, grammatical in-correctness, etc.
Design and Analysis of Value Creation Networks
Kameshwaran, Sampath (IBM Research - India) | Mehta, Sameep (IBM Research-India) | Pandit, Vinayaka (IBM Research - India)
There are many diverse domains like academic collaboration, service industry, and movies, where a group of agents are involved in a set of activities through interactions or collaborations to create value. The end result of the value creation process is two pronged: firstly, there is a cumulative value created due to the interactions and secondly, a network that captures the pattern of historical interactions between the agents. In this paper we summarize our efforts towards design and analysis of value creation networks: 1) network representation of interactions and value creations, 2) identify contribution of a node based on values created from various activities, and 3) ranking nodes based on structural properties of interactions and the resulting values. To highlight the efficacy of our proposed algorithms, we present results on IMDB and services industry data.