Goto

Collaborating Authors

 command


ENTER: Event Based Interpretable Reasoning for VideoQA

arXiv.org Artificial Intelligence

In this paper, we present ENTER, an interpretable Video Question Answering (VideoQA) system based on event graphs. Event graphs convert videos into graphical representations, where video events form the nodes and event-event relationships (temporal/causal/hierarchical) form the edges. This structured representation offers many benefits: 1) Interpretable VideoQA via generated code that parses event-graph; 2) Incorporation of contextual visual information in the reasoning process (code generation) via event graphs; 3) Robust VideoQA via Hierarchical Iterative Update of the event graphs. Existing interpretable VideoQA systems are often top-down, disregarding low-level visual information in the reasoning plan generation, and are brittle. While bottom-up approaches produce responses from visual data, they lack interpretability. Experimental results on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does our method outperform existing top-down approaches while obtaining competitive performance against bottom-up approaches, but more importantly, offers superior interpretability and explainability in the reasoning process.


The US Army's Vision of Soldiers in Exoskeletons Lives On

WIRED

After decades of research and development, the United States Army is taking yet another run at developing a powered exoskeleton to help soldiers carry heavy loads on the battlefield--but don't expect a futuristic suit of combat armor straight out of Starship Troopers or Iron Man anytime soon. Soldiers assigned to the Army's 1-78 Field Artillery Battalion training unit at Fort Sill, Oklahoma, recently completed a three-day "proof of concept" evaluation of several off-the-shelf "exoskeleton suits" in late September and early October, officials confirmed to WIRED. The evaluation was overseen by the service's Combat Capabilities Development Command (DEVCOM), the organization responsible for developing new technology for soldiers. Official photos from the evaluation published to social media showed Advanced Individual Training students hauling artillery shells to and from a M109 Paladin self-propelled howitzer and M777-towed howitzer with telltale black exoskeleton harnesses contrasted against their camouflage uniforms, part of a field exercise undertaken "to assess the potential of human augmentation, improve soldier performance, and determine if these exoskeletons meet the demands of our warfighters," as the service put it. While a DEVCOM spokesperson declined to identify which commercially produced systems were evaluated by soldiers, the Army announced its intent in August to award a contract to exoskeleton maker SUITX to "give users experience of advanced soldier augmentation technologies," according to a government notice.


Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

arXiv.org Artificial Intelligence

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.


Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

arXiv.org Artificial Intelligence

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.


Recursive Visual Programming

arXiv.org Artificial Intelligence

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.


US jets intercept Russian Tu-95 bombers near Alaska; first encounter there since US drone taken down

FOX News

Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. U.S. fighter jets intercepted Russian bomber aircraft near Alaska Monday, according to the Alaskan Region of North American Aerospace Defense Command (NORAD). NORAD made the announcement Wednesday in an official statement. "The Alaskan Region of North American Aerospace Defense Command (NORAD) detected, tracked, positively identified and intercepted two Russian aircraft entering and operating within the Alaska Air Defense Identification Zone (ADIZ) on April 17, 2023," the defense organization said.


Pentagon goes on AI hiring spree to bring machine learning capabilities to the battlefield

FOX News

'The Five' discuss how AI generated images are getting harder to distinguish from reality and how the Dalai Lama asked a young boy to suck his tongue. The Pentagon is hiring data scientists, technologists and engineers as part of its effort to incorporate artificial intelligence into the machinery used to wage war. The Defense Department has posted several AI jobs on USAjobs.gov over the last few weeks, including many with salaries well into six figures. One of the higher paying jobs advertised in the last few weeks is for a senior technologist for "cognitive and decision science" at the U.S. Navy's Point Loma Complex in San Diego. That job starts at $170,000 and could pay as much as $212,000 year for someone who can help insert "cutting-edge technology" into Navy weaponry and equipment.


Your smart speakers are listening to you. Here's how to delete their recordings.

Popular Science

The Echo smart speaker, powered by Amazon's artificially intelligent assistant Alexa, keeps a digital ear out for its wake phrase, "Hey Alexa." When it hears these words, it starts recording the sounds that come next--your spoken commands--and then it saves these snippets in the cloud. But if it really makes you uncomfortable, then manually make the speaker stop listening when you're not using it: Tap the microphone button on the top of the device, and it will stop listening for your next "hey Alexa." To review and potentially delete the snippets that the Echo has been saving, you have to go through the Alexa app (for Android and iOS) or the Amazon website. Let's start with the former. When you open the Alexa app on your phone, the front page displays a list of saved words and phrases that you've directed at your Echo speaker.


New Niger drone video shows harrowing escape of surviving U.S. forces amid friendly fire

The Japan Times

WASHINGTON – Dramatic new drone video of the Niger ambush that killed four American soldiers shows U.S. forces desperately trying to escape and fighting for their lives after friendly Nigerien forces mistook them for the enemy. It describes how the fleeing troops set up a quick defensive location on the edge of a swamp and -- thinking they were soon to die -- wrote messages home to their loved ones. The video, released by the Pentagon with explanatory narration, includes more than 10 minutes of drone footage, file tape and animation that wasn't made public last week when the military released a portion of the final report on the October attack. The video depicts for the first time the harrowing hours as troops held off their enemy and waited for rescue. There were 46 U.S. and Nigerien troops out on the initial mission in the West African nation, going after but failing to find a high-value militant, then collecting intelligence at a site where the insurgent had been.


KBEmacs: Where's the AI?

AI Magazine

The Programmer's Apprentice project uses the domain of programming as a vehicle for studying (and attempting to duplicate) human problem solving behavior. Recognizing that it will be a long time before it is possible to fully duplicate an expert programmer's abilities, the project seeks to develop an intelligent assistant system, the Programmer's Apprentice (PA), which will help a programmer in various phases of the programming task. The Knowledge-Based Editor in Emacs (KBEmacs) is an initial step in the direction of the PA. A question that has been asked about KBEmacs is, "Where's the AI?" Going beyond this, the article uses the development of KBEmacs as an example that illustrates a number of general features of the process of developing an applied AI system. As part of this, the article compares the way AI ideas are used in KBEmacs with the way they were used in the initial proposal for the PA.