Goto

Collaborating Authors

 popcorn


Robot Operation of Home Appliances by Reading User Manuals

arXiv.org Artificial Intelligence

Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by "reading" their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones.


'Minecraft' movie mayhem raises alarms for America's youth, 'bad for society': expert

FOX News

"A Minecraft Movie," the big-screen adaptation of the popular video game "Minecraft," has been packing theaters with rowdy kids and teens since its release this month, spurring a social media phenomenon and sparking concern for America's youth. Videos on social media show young theatergoers huge reactions to one key scene, where one of the film's stars, Jack Black, yells out the phrase "Chicken Jockey!" as a small, Frankenstein-looking creature lands on top of a chicken in a boxing ring to face off with co-star Jason Momoa. The scene has prompted excited fans to scream, shout, throw popcorn around, jump up out of their seats, and in one instance in Provo, Utah, toss a live chicken in the air during a screening, according to the Salt Lake Tribune. Springs Cinema & Taphouse in Sandy Springs, Georgia, told FOX 5 Atlanta that its staff has had to clean up popcorn, ICEEs, ketchup and shattered glass. The scene featuring the "Chicken Jockey" in "A Minecraft Movie" has spawned some chaotic movie theater behavior from young audiences. "The movie-going experience has changed a lot since I was younger," Josh Gunderson, director of marketing and events at Oviedo Mall in Florida, told FOX Business.


'Parents left picking popcorn out of their hair': the meme-soaked magic of A Minecraft Movie

The Guardian

This week I took my son, Zac, to see the new Minecraft movie, which is hardly a remarkable statement in the highly video game-branded world of 21st-century cinema – except that what followed was not typical at all. As you may have seen from a number of bewildered news reports over the last few days, A Minecraft Movie has quickly engendered a community of, let's say, highly engaged and enthusiastic fans. Spurred on by TikTok meme posts, vast portions of the film's audience are now yelling out key lines of dialogue as they happen and singing along to the songs. In one key moment where a rare character from the game – the zombie chicken jockey – is introduced, they go absolutely crazy, throwing drinks and popcorn around, and in some US cinemas, getting escorted from the screening by police. The reaction was a little more muted in our tiny independent cinema in Frome, but still, there were rows of teenagers who had clearly seen all the TikTok posts telling them which lines to shout along to, and went to throw stuff, and they were extremely excited to be doing so, a few surreptitiously filming their mates' reactions so they could add to the social media carnage.


First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning

arXiv.org Artificial Intelligence

Language models can solve complex reasoning tasks better by learning to generate rationales for their predictions. Often these models know how to solve a task but their auto-regressive decoding nature leads to incorrect results if they start incorrectly. We observe that smaller models in particular when corrected, can solve a task that they would have otherwise struggled with. We demonstrate this phenomenon by using a larger model to guide smaller models, which leads to significantly improved performance (up to +24 points on the GSM8K dataset by 7B models). To assist smaller models in initiating the starting step, we propose QuestCoT, where a smaller model first asks itself how to start, before proceeding with a chain of reasoning. On various multistep mathematical reasoning datasets over multiple smaller models, we show that getting the right start can lead to significant performance gains across all models (gains of up to +6 points on GSM8K, +9 on SVAMP, +5 on ASDiv, and +7 on MultiArith).


Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

arXiv.org Artificial Intelligence

There is increasing interest in distilling task-specific knowledge from large language models (LLM) to smaller student models. Nonetheless, LLM distillation presents a dual challenge: 1) there is a high cost associated with querying the teacher LLM, such as GPT-4, for gathering an ample number of demonstrations; 2) the teacher LLM might provide imperfect outputs with a negative impact on the student's learning process. To enhance sample efficiency within resource-constrained, imperfect teacher scenarios, we propose a three-component framework leveraging three signal types. The first signal is the student's self-consistency (consistency of student multiple outputs), which is a proxy of the student's confidence. Specifically, we introduce a ``teaching assistant'' (TA) model to assess the uncertainty of both the student's and the teacher's outputs via confidence scoring, which serves as another two signals for student training. Furthermore, we propose a two-stage training schema to first warm up the student with a small proportion of data to better utilize student's signal. Experiments have shown the superiority of our proposed framework for four complex reasoning tasks. On average, our proposed two-stage framework brings a relative improvement of up to 20.79% compared to fine-tuning without any signals across datasets.


From Text to Motion: Grounding GPT-4 in a Humanoid Robot "Alter3"

arXiv.org Artificial Intelligence

We report the development of Alter3, a humanoid robot capable of generating spontaneous motion using a Large Language Model (LLM), specifically GPT-4. This achievement was realized by integrating GPT-4 into our proprietary android, Alter3, thereby effectively grounding the LLM with Alter's bodily movement. Typically, low-level robot control is hardware-dependent and falls outside the scope of LLM corpora, presenting challenges for direct LLM-based robot control. However, in the case of humanoid robots like Alter3, direct control is feasible by mapping the linguistic expressions of human actions onto the robot's body through program code. Remarkably, this approach enables Alter3 to adopt various poses, such as a 'selfie' stance or 'pretending to be a ghost,' and generate sequences of actions over time without explicit programming for each body part. This demonstrates the robot's zero-shot learning capabilities. Additionally, verbal feedback can adjust poses, obviating the need for fine-tuning. A video of Alter3's generated motions is available at https://tnoinkwms.github.io/ALTER-LLM/


Theory of Mind Might Have Spontaneously Emerged in Large Language Models

arXiv.org Artificial Intelligence

Abstract: We explore the intriguing possibility that theory of mind (ToM), or the uniquely human ability to impute unobservable mental states to others, might have spontaneously emerged in large language models (LLMs). We designed 40 false-belief tasks, considered a gold standard in testing ToM in humans, and administered them to several LLMs. Each task included a falsebelief scenario, three closely matched true-belief controls, and the reversed versions of all four. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo These findings suggest the intriguing possibility that ToM, previously considered exclusive to humans, may have spontaneously emerged as a byproduct of LLMs' improving language skills. LLMs' Performance dropped but the results and the conclusions remain the same. LLMs' Performance dropped but the results and the conclusions remain the same. Expanded the discussion to present the results in the context of the ongoing debate on whether AI can be credited with human-like mental properties. Code availability and data: The code used to estimate the results, the false-belief tasks, and the instructions given to research assistants can be accessed at https://osf.io/csdhb. Main Text: Many animals excel at using cues such as vocalization, body posture, gaze, or facial expression to predict other animals' behavior and mental states. Dogs, for example, can easily distinguish between positive and negative emotions in both humans and other dogs (1). Yet, humans do not merely respond to observable cues, but also automatically and effortlessly track others' unobservable mental states, such as their knowledge, intentions, beliefs, and desires (2). This ability--typically referred to as "theory of mind" (ToM)--is considered central to human social interactions (3), communication (4), empathy (5), self-consciousness (6), moral judgment (7, 8), and even religious beliefs (9). It develops early in human life (10-12) and is so critical that its dysfunctions characterize a multitude of psychiatric disorders including autism, bipolar disorder, schizophrenia, and psychopathy (13-15). Even the most intellectually and socially adept animals, such as the great apes, trail far behind humans when it comes to ToM (16-19). Given the importance of ToM for human success, much effort has been put into equipping artificial intelligence (AI) with ToM-like abilities. Virtual and physical AI agents would be better and safer if they could impute unobservable mental states to others. The safety of self-driving cars, for example, would greatly increase if they could anticipate the intentions of pedestrians and human drivers. Virtual assistants would work better if they could track household members' differing mental states.


Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks

arXiv.org Artificial Intelligence

Intuitive psychology is a pillar of common-sense reasoning. The replication of this reasoning in machine intelligence is an important stepping-stone on the way to human-like artificial intelligence. Several recent tasks and benchmarks for examining this reasoning in Large-Large Models have focused in particular on belief attribution in Theory-of-Mind tasks. These tasks have shown both successes and failures. We consider in particular a recent purported success case (1), and show that small variations that maintain the principles of ToM turn the results on their head. We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical, and that outlying failure cases should outweigh average success rates. We also consider what possible future successes on Theory-of-Mind tasks by more powerful LLMs would mean for ToM tasks with people.


Making Sense of Data Features - DataScienceCentral.com

#artificialintelligence

Spend any time at all in the machine learning space, and pretty soon you will encounter the term "feature". It's a term that may seem self-evident at first, but it very quickly descends into a level of murkiness that can leave most laypeople (and even many programmers) confused, especially when you hear examples of machine learning systems that involve millions or even billions of features. If you take a look at a spreadsheet, you can think of a feature as being roughly analogous to a column of data, along with the metadata that describes that column. This means that each cell in that column (which corresponds to a given "record") becomes one item in an array, not including any header labels for that column. The feature could have potentially thousands of values, but they are all values of the same type and semantics.


New AI model shows how machines can learn from vision, language and sound together – GeekWire

#artificialintelligence

An image showing how machines learn from vision, language, and sound together. Most of us have watched television with the sound turned off at one time or another. While it's usually possible to follow the story at least to some degree, the absence of an audio track tends to limit our ability to fully appreciate what's taking place. Similarly, it's easy to miss a lot of information just listening to the sounds coming from another room. The multimodality of combining image, sound and other details greatly enhances our understanding of what's happening, whether it's on TV or in the real world.