Goto

Collaborating Authors

 Media


CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography

arXiv.org Artificial Intelligence

Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. However, visual reasoning, reasoning involving both visual and textual inputs, remains underexplored. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, which incorporate image inputs, have opened this capability. In this ongoing work, we focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics (i.e., illumination, blur extent, etc.) interplay with the camera parameters. Successfully reasoning from the visual information of a photo to identify these numerical camera settings requires the MLLMs to have a deeper understanding of the underlying physics for precise visual comprehension, representing a challenging and intelligent capability essential for practical applications like photography assistant agents. We aim to evaluate MLLMs on their ability to distinguish visual differences related to numerical camera settings, extending a methodology previously proposed for vision-language models (VLMs). Our preliminary results demonstrate the importance of visual reasoning in photography-related tasks. Moreover, these results show that no single MLLM consistently dominates across all evaluation tasks, demonstrating ongoing challenges and opportunities in developing MLLMs with better visual reasoning.


Amazon's sneaky LEGO sale drops prices on some popular sets

Popular Science

I was never good at building with LEGO as a kid. My attention span was too short to follow the directions, and my imagination was too boring to make anything cool on my own. I've learned to enjoy them more as an adult, however, and that makes me excited about Amazon's current sale on popular LEGO sets. It's not an advertised sale, but the site has dropped the price on dozens of different kits, including some Star Wars options, which are rarely discounted. These are the original STEM toys, so grab them and have fun.


7 simple ways to protect your credit cards while traveling

FOX News

Travel expert Colleen Kelly shares the hottest travel destinations for this summer and provides tips for travelers planning a cruise. As you rush through busy terminals, juggling bags and boarding passes, your credit cards may be at risk, not just from pickpockets, but from digital thieves using high-tech tools like RFID (radio-frequency identification) skimmers. While today's chip-enabled cards are more secure than old magnetic stripes, it's still wise to take extra precautions, especially in crowded places like airports. Here's how to keep your cards protected while traveling. GET SECURITY ALERTS & EXPERT TECH TIPS โ€“ SIGN UP FOR KURT'S'THE CYBERGUY REPORT' NOW WHAT IS ARTIFICIAL INTELLIGENCE (AI)?


Improving LLM Interpretability and Performance via Guided Embedding Refinement for Sequential Recommendation

arXiv.org Artificial Intelligence

The fast development of Large Language Models (LLMs) offers growing opportunities to further improve sequential recommendation systems. Yet for some practitioners, integrating LLMs to their existing base recommendation systems raises questions about model interpretability, transparency and related safety. To partly alleviate challenges from these questions, we propose guided embedding refinement, a method that carries out a guided and interpretable usage of LLM to enhance the embeddings associated with the base recommendation system. Instead of directly using LLMs as the backbone of sequential recommendation systems, we utilize them as auxiliary tools to emulate the sales logic of recommendation and generate guided embeddings that capture domain-relevant semantic information on interpretable attributes. Benefiting from the strong generalization capabilities of the guided embedding, we construct refined embedding by using the guided embedding and reduced-dimension version of the base embedding. We then integrate the refined embedding into the recommendation module for training and inference. A range of numerical experiments demonstrate that guided embedding is adaptable to various given existing base embedding models, and generalizes well across different recommendation tasks. The numerical results show that the refined embedding not only improves recommendation performance, achieving approximately $10\%$ to $50\%$ gains in Mean Reciprocal Rank (MRR), Recall rate, and Normalized Discounted Cumulative Gain (NDCG), but also enhances interpretability, as evidenced by case studies.


GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

arXiv.org Artificial Intelligence

Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with open-ended goals remain underexplored. We introduce GraphicBench, a new planning benchmark for graphic design that covers 1,079 user queries and input images across four design types. We further present GraphicTown, an LLM agent framework with three design experts and 46 actions (tools) to choose from for executing each step of the planned workflows in web environments. Experiments with six LLMs demonstrate their ability to generate workflows that integrate both explicit design constraints from user queries and implicit commonsense constraints. However, these workflows often do not lead to successful execution outcomes, primarily due to challenges in: (1) reasoning about spatial relationships, (2) coordinating global dependencies across experts, and (3) retrieving the most appropriate action per step. We envision GraphicBench as a challenging yet valuable testbed for advancing LLM-agent planning and execution in creative design tasks.


RAID: An In-Training Defense against Attribute Inference Attacks in Recommender Systems

arXiv.org Artificial Intelligence

In various networks and mobile applications, users are highly susceptible to attribute inference attacks, with particularly prevalent occurrences in recommender systems. Attackers exploit partially exposed user profiles in recommendation models, such as user embeddings, to infer private attributes of target users, such as gender and political views. The goal of defenders is to mitigate the effectiveness of these attacks while maintaining recommendation performance. Most existing defense methods, such as differential privacy and attribute unlearning, focus on post-training settings, which limits their capability of utilizing training data to preserve recommendation performance. Although adversarial training extends defenses to in-training settings, it often struggles with convergence due to unstable training processes. In this paper, we propose RAID, an in-training defense method against attribute inference attacks in recommender systems. In addition to the recommendation objective, we define a defensive objective to ensure that the distribution of protected attributes becomes independent of class labels, making users indistinguishable from attribute inference attacks. Specifically, this defensive objective aims to solve a constrained Wasserstein barycenter problem to identify the centroid distribution that makes the attribute indistinguishable while complying with recommendation performance constraints. To optimize our proposed objective, we use optimal transport to align users with the centroid distribution. We conduct extensive experiments on four real-world datasets to evaluate RAID. The experimental results validate the effectiveness of RAID and demonstrate its significant superiority over existing methods in multiple aspects.


Visual moral inference and communication

arXiv.org Artificial Intelligence

Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstrated in two related tasks: 1) inferring human moral judgment toward visual images and 2) analyzing patterns in moral content communicated via images from public news. We find that models based on text alone cannot capture the fine-grained human moral judgment toward visual stimuli, but language-vision fusion models offer better precision in visual moral inference. Furthermore, applications of our framework to news data reveal implicit biases in news categories and geopolitical discussions. Our work creates avenues for automating visual moral inference and discovering patterns of visual moral communication in public media.


Assessing how hyperparameters impact Large Language Models' sarcasm detection performance

arXiv.org Artificial Intelligence

Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI's GPT, and Meta's Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and $F_1$-score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an $F_1$-score of 0.75. Furthermore, a model's performance may increase or decline with each release, highlighting the need to reassess performance after each release.


Fox News AI Newsletter: Nvidia announces plans to make AI supercomputers in US

FOX News

Tech expert Kurt Knutsson discusses how robots can milk, feed and clean cows on dairy farms, boosting efficiency and comfort. Jensen Huang, co-founder and CEO of Nvidia Corp., gives a talk in Taipei, Taiwan. MADE IN AMERICA: Nvidia on Monday announced plans to manufacture its artificial intelligence supercomputers entirely in the U.S. for the first time. RIDEABLE 4-LEGGED ROOT: Kawasaki Heavy Industries has introduced something that feels straight out of a video game: CORLEO, a hydrogen-powered, four-legged robot prototype designed to be ridden by humans. JOB-KILLER ROBOT: This semi-humanoid robot combines advanced manipulation capabilities with intelligent delivery features, making it a significant innovation in the service robotics sector.


Windows 11's Snipping Tool tests instant on-screen text extraction feature

PCWorld

According to yesterday's Windows Insider blog post, Microsoft is currently testing a new version of the Snipping Tool in Windows 11, which is now available to Windows Insiders on both Canary and Dev Channels. This new version (11.2503.27.0) is getting built-in Text Extraction. Text Extraction is a feature that uses OCR to "extract" text from an image and convert it into, well, text. Instead of typing up entire paragraphs or pages by hand, you can simply let the Snipping Tool handle it--and it's much faster than doing it by hand. The Snipping Tool has actually had Text Extraction for a while now, but you had to first snap a screenshot and then open the screenshot for editing within the Snipping Tool before you could access the Text Extraction feature.