Building Watson: An Overview of the DeepQA Project

AI Magazine

IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV Quiz show, Jeopardy! The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy! Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After 3 years of intense research and development by a core team of about 20 researches, Watson is performing at human expert-levels in terms of precision, confidence and speed at the Jeopardy! Quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating and advancing a wide range of algorithmic techniques to rapidly advance the field of QA.


The AI Behind Watson -- The Technical Article

#artificialintelligence

The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After 3 years of intense research and development by a core team of about 20 researcherss, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of QA. The architecture and methodology developed as part of this project has highlighted the need to take a systems-level approach to research in QA, and we believe this applies to research in the broader field of AI. We have developed many different algorithms for addressing different kinds of problems in QA and plan to publish many of them in more detail in the future.


Leveraging Wikipedia Characteristics for Search and Candidate Generation in Question Answering

AAAI Conferences

Most existing Question Answering (QA) systems adopt a type-and-generate approach to candidate generation that relies on a pre-defined domain ontology. This paper describes a type independent search and candidate generation paradigm for QA that leverages Wikipedia characteristics. This approach is particularly useful for adapting QA systems to domains where reliable answer type identification and type-based answer extraction are not available. We present a three-pronged search approach motivated by relations an answer-justifying title-oriented document may have with the question/answer pair. We further show how Wikipedia metadata such as anchor texts and redirects can be utilized to effectively extract candidate answers from search results without a type ontology. Our experimental results show that our strategies obtained high binary recall in both search and candidate generation on TREC questions, a domain that has mature answer type extraction technology, as well as on Jeopardy! questions, a domain without such technology. Our high-recall search and candidate generation approach has also led to high overall QA performance in Watson, our end-to-end system.


Using Syntactic Features in Answer Reranking

AAAI Conferences

This paper describes a baseline question answering system for Swedish on which we measured the contribution brought by syntactic features. The system includes modules to carry out the question analysis, hypothesis generation, and reranking of answers. It was trained and evaluated on questions from a data set inspired by Swedish television quiz show Kvitt eller Dubbelt -- Tiotusenkronorsfrågan. We used a HTML dump of the Swedish version of Wikipedia as knowledge source and we show in this paper that paragraph retrieval from this corpus gives an acceptable coverage of answers when targeting Kvitt eller Dubbelt questions, especially single-word answer questions. Given a question, the hypothesis generation module retrieves a list of paragraphs, ranks them using a vector space model score, and extract a set of candidates. The question analysis part performs a lexical answer type prediction. To compute a baseline ranking, we sorted answer candidates according to their frequencies in the most relevant paragraphs. The reranker module makes use of information from the previous stages to estimate the correctness of the generated answer candidates as well a grammatical information from a dependency parser. The correctness estimate is then used to re-weight the baseline ranking. A 5-fold cross-validation showed that the median ranking of the correct candidate went from rank 21 in the baseline version to 10 using the reranker.


Explaining Watson: Polymath Style

AAAI Conferences

Our paper is actually two contributions in one. First, we argue that IBM's Jeopardy! playing machine needs a formal semantics. We present several arguments as we discuss the system. We also situate the work in the broader context of contemporary AI. Our second point is that the work in this area might well be done as a broad collaborative project. Hence our "Blue Sky'' contribution is a proposal to organize a polymath-style effort aimed at developing formal tools for the study of state of the art question-answer systems, and other large scale NLP efforts whose architectures and algorithms lack a theoretical foundation.