Goto

Collaborating Authors

 Personal


Can Deception Detection Go Deeper? Dataset, Evaluation, and Benchmark for Deception Reasoning

arXiv.org Artificial Intelligence

Deception detection has attracted increasing attention due to its importance in real-world scenarios. Its main goal is to detect deceptive behaviors from multimodal clues such as gestures, facial expressions, prosody, etc. However, these bases are usually subjective and related to personal habits. Therefore, we extend deception detection to deception reasoning, further providing objective evidence to support subjective judgment. Specifically, we provide potential lies and basic facts and then analyze why this sentence may be a lie by combining factual inconsistencies and intent behind them. Compared with deception detection, this task is more applicable to real-world scenarios. For example, in interrogation, the police should judge whether a person is lying based on solid evidence. This paper presents our initial attempts at this task, including constructing a dataset and defining evaluation metrics. Meanwhile, this task can serve as a benchmark for evaluating the complex reasoning capability of large language models. Code and data will be made publicly available.


Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

arXiv.org Artificial Intelligence

Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM's ability to effectively acquire new knowledge from raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. In addition, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM's knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on Llama2 family models reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.


Engadget Podcast: The fallout from Apple's WWDC 2024 and Summer Game Fest

Engadget

This week has felt like a month worth of news, now that we've wrapped up Apple's WWDC 2024 and Summer Game Fest in LA. In this episode, Cherlynn and Devindra discuss their final thoughts on Apple Intelligence and the company's upcoming software, and they chat about some of our coverage highlights from the pseudo-E3 Game Fest. Also, we dive into X making likes private (what is Elon hiding?!) and the news around Sony buying the Alamo Drafthouse theater chain. Listen below or subscribe on your podcast app of choice. If you've got suggestions or topics you'd like covered on the show, be sure to email us or drop a note in the comments! And be sure to check out our other podcast, Engadget News! Summer Games Fest highlights: Kunitsu-Gami: Path of the Goddess, LEGO Horizon Adventures, and an Assassin's Creed finally set in Japan โ€“ 25:06 X makes users' likes private โ€“ 40:27 Devindra: We are back from Apple's WWDC, and we have thoughts. And I feel like, It's just one of those whirlwind things. Both Trillin and I got back in from California yesterday. After recording this, I still feel like my body doesn't know, like, where I'm in, Trillin, or what time zone. I don't know how you feel. Cherlynn: I went to the gym at 8 a. m. Devindra: I like how you fit in the humble brag there. We're also going to be talking about Summer Game Fest, folks. We weren't there for that and I was trying to get Jess Condit on, but she's super busy still writing up stuff from that. So we have got a lot of coverage around that and there's some stories I want to highlight that Engadget has done. Also some games that looks pretty cool. Also joining us this morning is podcast producer Ben Ellman, who I'm sure has thoughts on Apple and the game stuff. And [00:01:00] as always, folks, if you're enjoying the show, please be sure to subscribe to us on iTunes or your podcast or of choice, leave us a review in iTunes. I would love to answer some reader questions. You can also typically join us Thursday mornings around 10 30 a. m. It's just like about scheduling, but that's about the time you can carve out in your schedule for us. You could see us on video. Sometimes we'll demo gadgets and We'll just have a great Q and a session too. I do want to point out if you're just listening to this episode, we did do a bonus episode at Apple's campus and it actually turned out pretty well because for Lynn and I were like right outside the, was it the Mac cafe or cafe Mac? But we were outdoors surrounded by traffic and other noise, but it actually ended up sounding pretty good.


Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

arXiv.org Artificial Intelligence

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.


GenQA: Generating Millions of Instructions from a Handful of Prompts

arXiv.org Artificial Intelligence

Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. To study questions about finetuning at scale, such as curricula and learning rate cooldown schedules, there is a need for industrial-scale datasets. However, this scale necessitates a data generation process that is almost entirely automated. In this work, we study methods for generating large instruction datasets from a single prompt. With little human oversight, we get LLMs to write diverse sets of instruction examples ranging from simple completion tasks to complex multi-turn dialogs across a variety of subject areas. When finetuning a Llama-3 8B base model, our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations. We release our dataset, the "generator" prompts that created it, and our finetuned model checkpoints.


If Ray Kurzweil Is Right (Again), You'll Meet His Immortal Soul in the Cloud

WIRED

The 76-year-old scientist and engineer has spent much of his time on earth arguing that humans can not only take advantage of yet-to-be-invented medical advances to live longer, but also ultimately merge with machines, become hyperintelligent, and stick around indefinitely. Just minutes before we met, we both learned that Daniel Kahneman, the Nobel Prizeโ€“winning psychologist and one of Kurzweil's intellectual jousting partners, had suffered that fate. A few days before that, the science fiction author Vernor Vinge had also passed. Vinge's novels first described the singularity, that moment when superintelligent AI surpasses what humans can do and mere mortals need high-tech augmentation themselves to remain relevant. Kurzweil embraced the name for his own grand vision, and in 2005 wrote a best-selling book called The Singularity Is Near.


ChatISA: A Prompt-Engineered Chatbot for Coding, Project Management, Interview and Exam Preparation Activities

arXiv.org Artificial Intelligence

As generative AI continues to evolve, educators face the challenge of preparing students for a future where AI-assisted work is integral to professional success. This paper introduces ChatISA, an in-house, multi-model chatbot designed to support students in an Information Systems and Analytics department. ChatISA comprises four primary modules-Coding Companion, Project Coach, Exam Ally, and Interview Mentor-each tailored to enhance different aspects of the educational experience. Through iterative development, student feedback, and leveraging open-source frameworks, we created a robust tool that addresses coding inquiries, project management, exam preparation, and interview readiness. The implementation of ChatISA revealed significant insights and challenges, including the necessity of ethical guidelines and balancing AI usage with maintaining student agency. Our findings underscore the importance of adaptive pedagogy and proactive engagement with AI tools to maximize their educational benefits. To support broader adoption and innovation, all code for ChatISA is made publicly available on GitHub, enabling other institutions to customize and integrate similar AI-driven educational tools within their curricula.


Yo'LLaVA: Your Personalized Language and Vision Assistant

arXiv.org Artificial Intelligence

Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user's pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, "What should I buy for my dog's birthday?"; as opposed to a generic inquiry about "What should I buy for a dog's birthday?". Similarly, when looking at a friend's image, the interest lies in seeing their activities (e.g., "my friend is holding a cat"), rather than merely observing generic human actions (e.g., "a man is holding a cat"). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo'LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo'LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA).


DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at \href{https://github.com/ashikiut/DefAn}{https://github.com/ashikiut/DefAn}.


Engadget Podcast: Recapping WWDC 2024 from Apple Park

Engadget

There was no new Apple hardware at WWDC 2024, but Apple still had tons of news around AI and its upcoming operating systems. In this bonus episode, Cherlynn and Devindra brave the California heat to discuss Apple Intelligence and how it's different than other AI solutions. And they dive into other new features they're looking forward to, like the iPhone mirroring in macOS Sequoia and iPadOS 18's surprisingly cool Calculator app. Listen below or subscribe on your podcast app of choice. If you've got suggestions or topics you'd like covered on the show, be sure to email us or drop a note in the comments! And be sure to check out our other podcast, Engadget News! This is Devindra here, and we are live at Apple Park. Cherlynn and I are in the middle of covering Apple's WWDC conference. Cherlynn: We are, I feel quite zen right now, because even though I have a lot more meetings coming up, we are seated outside, it's nice out, and even though it's really hot, it's not dying.