Steps toward a Cognitive Vision System

AI Magazine 

An adequate natural language description of developments in a real-world scene can be taken as proof of "understanding what is going on." An algorithmic system that generates natural language descriptions from video recordings of road traffic scenes can be said to "understand" its input to the extent that algorithmically generated text is acceptable to the humans judging it. The ability to present a "variant formulation" without distorting the essential parts of the original message is taken as a cue that these essentials have been "understood." During art lessons, in particular those concerned with classical or ecclesiastic paintings, students are initially invited to merely describe what they see. Frequently, considerable a priori knowledge about ancient mythology or biblical traditions is required to succinctly characterize the depicted scene. Lack of the corresponding knowledge about other cultures can make it difficult for someone with only a European education to really understand and describe in an appropriate manner a painting by, for example, a Far East classic artist. Familiar human experiences mentioned in the preceding paragraph will now be "morphed" into a scientific challenge: to design and implement an algorithmic engine that generates an appropriate textual description of essential developments in a video sequence recorded from a real-world scene. Such an algorithmic engine will serve as one example of a cognitive vision system (CVS), which leaves room, as the experienced reader has noticed, for there to be more than one way to introduce the concept of a CVS. An alternative clearly consists in coupling a computer vision system with a robotic system of some kind and assessing the reactions of such a compound system. To whomever accepts the formulation, "one of the actions available to an agent is to produce language. This is called a speech act. Russell and Norvig (1995)" is unlikely to consider the two variants of a CVS alluded to previously as being fundamentally different. With regard to the first CVS version in particular, the following remarks are submitted for consideration: Obviously, we avoid a precise definition of understanding in favor of having humans compare the reaction of an algorithmic engine to that expected from a human. This fuzzy approach toward the circumscription of a CVS opens the road to constructive criticism--that is, to incremental system improvement--by pinpointing aspects of an output text that are not yet considered satisfactory.