Goto

Collaborating Authors

 Siskind, Jeffrey Mark


Tricks from Deep Learning

arXiv.org Machine Learning

The deep learning community has devised a diverse set of methods to make gradient optimization, using large datasets, of large and highly complex models with deeply cascaded nonlinearities, practical. Taken as a whole, these methods constitute a breakthrough, allowing computational structures which are quite wide, very deep, and with an enormous number and variety of free parameters to be effectively optimized. The result now dominates much of practical machine learning, with applications in machine translation, computer vision, and speech recognition. Many of these methods, viewed through the lens of algorithmic differentiation (AD), can be seen as either addressing issues with the gradient itself, or finding ways of achieving increased efficiency using tricks that are AD-related, but not provided by current AD systems. The goal of this paper is to explain not just those methods of most relevance to AD, but also the technical constraints and mindset which led to their discovery. After explaining this context, we present a "laundry list" of methods developed by the deep learning community. Two of these are discussed in further mathematical detail: a way to dramatically reduce the size of the tape when performing reverse-mode AD on a (theoretically) time-reversible process like an ODE integrator; and a new mathematical insight that allows for the implementation of a stochastic Newton's method.


Learning to Describe Video with Weak Supervision by Exploiting Negative Sentential Information

AAAI Conferences

Most previous work on video description trains individualparts of speech independently. It is more appealing from a linguistic point of view, for word models for all parts of speech to be learned simultaneously from whole sentences, a hypothesis suggested by some linguists for child language acquisition. In this paper, we learn to describe video by discriminatively training positive sentential labels against negative ones in a weakly supervised fashion: the meaning representations (i.e., HMMs) of individual words in these labels are learned from whole sentences without any correspondence annotation of what those words denote in the video. Textual descriptions are then generated for new video using trained word models.


Conducting Neuroscience to Guide the Development of AI

AAAI Conferences

Study of the human brain through fMRI can potentially benefit the pursuit of artificial intelligence. Four examples are presented. First, fMRI decoding of the brain activity of subjects watching video clips yields higher accuracy than state-of-the-art computer-vision approaches to activity recognition. Second, novel methods are presented that decode aggregate representations of complex visual stimuli by decoding their independent constituents. Third, cross-modal studies demonstrate the ability to decode the brain activity induced in subjects watching video stimuli when trained on the brain activity induced in subjects seeing text or hearing speech stimuli and vice versa. Fourth, the time course of brain processing while watching video stimuli is probed with scanning that trades off the amount of the brain scanned for the frequency at which it is scanned. Techniques like these can be used to study how the human brain grounds language in visual perception and may motivate development of novel approaches in AI.


Seeing What You're Told: Sentence-Guided Activity Recognition In Video

arXiv.org Artificial Intelligence

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.


Simultaneous Object Detection, Tracking, and Event Recognition

arXiv.org Artificial Intelligence

The common internal structure and algorithmic organization of object detection, detection-based tracking, and event recognition facilitates a general approach to integrating these three components. This supports multidirectional information flow between these components allowing object detection to influence tracking and event recognition and event recognition to influence tracking and object detection. The performance of the combination can exceed the performance of the components in isolation. This can be done with linear asymptotic complexity.


Seeing Unseeability to See the Unseeable

arXiv.org Artificial Intelligence

We present a framework that allows an observer to determine occluded portions of a structure by finding the maximum-likelihood estimate of those occluded portions consistent with visible image evidence and a consistency model. Doing this requires determining which portions of the structure are occluded in the first place. Since each process relies on the other, we determine a solution to both problems in tandem. We extend our framework to determine confidence of one's assessment of which portions of an observed structure are occluded, and the estimate of that occluded structure, by determining the sensitivity of one's assessment to potential new observations. We further extend our framework to determine a robotic action whose execution would allow a new observation that would maximally increase one's confidence.


Video In Sentences Out

arXiv.org Artificial Intelligence

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases,spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.