Goto

Collaborating Authors

 multimodal ai


Multimodal AI for Body Fat Estimation: Computer Vision and Anthropometry with DEXA Benchmarks

Aldajani, Rayan

arXiv.org Artificial Intelligence

Tracking body fat percentage is essential for effective weight management, yet gold-standard methods such as DEXA scans remain expensive and inaccessible for most people. This study evaluates the feasibility of artificial intelligence (AI) models as low-cost alternatives using frontal body images and basic anthropometric data. The dataset consists of 535 samples: 253 cases with recorded anthropometric measurements (weight, height, neck, ankle, and wrist) and 282 images obtained via web scraping from Reddit posts with self-reported body fat percentages, including some reported as DEXA-derived by the original posters. Because no public datasets exist for computer-vision-based body fat estimation, this dataset was compiled specifically for this study. Two approaches were developed: (1) ResNet-based image models and (2) regression models using anthropometric measurements. A multimodal fusion framework is also outlined for future expansion once paired datasets become available. The image-based model achieved a Root Mean Square Error (RMSE) of 4.44% and a Coefficient of Determination (R^2) of 0.807. These findings demonstrate that AI-assisted models can offer accessible and low-cost body fat estimates, supporting future consumer applications in health and fitness.


Towards deployment-centric multimodal AI beyond vision and language

Liu, Xianyuan, Zhang, Jiayang, Zhou, Shuo, van der Plas, Thijs L., Vijayaraghavan, Avish, Grishina, Anastasiia, Zhuang, Mengdie, Schofield, Daniel, Tomlinson, Christopher, Wang, Yuhan, Li, Ruizhe, van Zeeland, Louisa, Tabakhi, Sina, Demeocq, Cyndie, Li, Xiang, Das, Arunav, Timmerman, Orlando, Baldwin-McDonald, Thomas, Wu, Jinge, Bai, Peizhen, Sahili, Zahraa Al, Alwazzan, Omnia, Do, Thao N., Suvon, Mohammod N. I., Wang, Angeline, Cipolina-Kun, Lucia, Moretti, Luigi A., Farndale, Lucas, Jain, Nitisha, Efremova, Natalia, Ge, Yan, Varela, Marta, Lam, Hak-Keung, Celiktutan, Oya, Evans, Ben R., Coca-Castro, Alejandro, Wu, Honghan, Abdallah, Zahraa S., Chen, Chen, Danchev, Valentin, Tkachenko, Nataliya, Lu, Lei, Zhu, Tingting, Slabaugh, Gregory G., Moore, Roger K., Cheung, William K., Charlton, Peter H., Lu, Haiping

arXiv.org Artificial Intelligence

Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.


Interview with Flávia Carvalhido: Responsible multimodal AI

AIHub

In this interview series, we're meeting some of the AAAI/SIGAI Doctoral Consortium participants to find out more about their research. In this latest interview, we hear from Flávia Carvalhido who is a PhD student at the University of Porto. We find out about her work on responsible multimodal AI, what inspired her to study AI, and how she found the Doctoral Consortium experience. My PhD programme is on Informatics Engineering in the Faculty of Engineering at the University of Porto, where I also got both my Bachelor's and Master's in the same field. My thesis research project is focused on responsible multimodal AI, titled "Stress Testing of Image-Text Multimodal Models in Medical Image Report Generation", supervised by Professor Henrique Lopes Cardoso and Professor Vítor Cerqueira and developed in the LIACC research laboratory.

  Country: Europe (0.30)
  Genre: Research Report (0.32)
  Industry: Health & Medicine (0.39)

Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

Gaihre, Sujata, Magar, Amir Thapa, Pokharel, Prasuna, Tiwari, Laxmi

arXiv.org Artificial Intelligence

This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git


An open-source training framework to advance multimodal AI

AIHub

Trying to model the physical reality by assembling various modalities: the image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene. The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red near, blue far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries). Large Language Models such as OpenAI's ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language -- hundreds of terabytes of text'scraped' from across the Internet and with billions of parameters. Looking ahead, many believe the'engines' that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data. Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges.


OpenAI Poaches 3 Top Engineers From DeepMind

WIRED

OpenAI announced today it has hired three senior computer vision and machine learning engineers from rival Google DeepMind, all of whom will work in a newly opened OpenAI office in Zurich, Switzerland. OpenAI executives told staff in an internal memo on Tuesday that Lucas Beyer, Alexander Kolesnikov, and Xiaohua Zhai will be joining the company to work on multimodal AI, artificial intelligence models capable of performing tasks in different mediums ranging from images to audio. OpenAI has long been at the forefront of multimodal AI and released the first version of its text-to-image platform Dall-E in 2021. Its flagship chatbot ChatGPT, however, was initially only capable of interacting with text inputs. The company later added voice and image features as multimodal functionality became an increasingly important part of its product line and AI research.


The Download: Geoffrey Hinton's Nobel Prize, and multimodal AI

MIT Technology Review

Large language models can do jaw-dropping things. But nobody knows exactly why. Two years ago, Yuri Burda and Harri Edwards, researchers at OpenAI, were trying to find out what it would take to get a large language model to do basic arithmetic. The models memorized the sums they saw but failed to solve new ones. By accident, Burda and Edwards left some of their experiments running for days rather than hours.


Google Project Astra hands-on: Full of potential, but it's going to be a while

Engadget

At I/O 2024, Google's teaser for Project Astra gave us a glimpse at where AI assistants are going in the future. It's a multi-modal feature that combines the smarts of Gemini with the kind of image recognition abilities you get in Google Lens, as well as powerful natural language responses. However, while the promo video was slick, after getting to try it out in person, it's clear there's a long way to go before something like Astra lands on your phone. So here are three takeaways from our first experience with Google's next-gen AI. Currently, most people interact with digital assistants using their voice, so right away Astra's multi-modality (i.e. using sight and sound in addition to text/speech) to communicate with an AI is relatively novel.


Ray-Ban Meta smart glasses do the AI thing without a projector or subscription

Engadget

The Ray-Ban Meta smart glasses have been something of a pleasant surprise. They make videos, take photos, livestream and act as an adequate replacement for headphones, all while looking like a normal pair of sunglasses. However, everyone's been waiting for the addition of multimodal AI after early access testing began in January. What is multimodal AI? Simply put, it's a toolset that allows an AI assistant to process multiple types of information, including photos, videos, text and audio. It's an AI that can view and understand the world around you in real time.


Reports of the Workshops Held at the 2023 AAAI Conference on Artificial Intelligence

Interactive AI Magazine

The Workshop Program of the Association for the Advancement of Artificial Intelligence's 37th Conference on Artificial Intelligence (AAAI-23) was held in Washington, DC, USA on February 13-14, 2023. There were 32 workshops in the program: AI for Agriculture and Food Systems, AI for Behavior Change, AI for Credible Elections: A Call to Action with Trusted AI, AI for Energy Innovation, AI for Web Advertising, AI to Accelerate Science and Engineering, AI4EDU: AI for Education, Artificial Intelligence and Diplomacy, Artificial Intelligence for Cyber Security (AICS), Artificial Intelligence for Social Good (AI4SG), Artificial Intelligence Safety (SafeAI), Creative AI Across Modalities, Deep Learning on Graphs: Methods and Applications (DLG-AAAI'23), DEFACTIFY: Multimodal Fact-Checking and Hate Speech Detection, Deployable AI (DAI), DL-Hardware Co-Design for AI Acceleration, Energy Efficient Training and Inference of Transformer Based Models, Graphs and More Complex Structures for Learning and Reasoning (GCLR), Health Intelligence (W3PHIAI-23), Knowledge-Augmented Methods for Natural Language Processing, Modelling Uncertainty in the Financial World (MUFin'23), Multi-Agent Path Finding, Multimodal AI for Financial Forecasting (Muffin), Multimodal AI for Financial Forecasting (Muffin), Privacy-Preserving Artificial Intelligence, Recent Trends in Human-Centric AI, Reinforcement Learning Ready for Production, Scientific Document Understanding, Systems Neuroscience Approach to General Intelligence, Uncertainty Reasoning and Quantification in Decision Making (UDM'23), User-Centric Artificial Intelligence for Assistance in At-Home Tasks, and When Machine Learning Meets Dynamical Systems: Theory and Applications. This report contains summaries of the workshops, which were submitted by some, but not all of the workshop chairs. An increasing world population, coupled with finite arable land, changing diets, and the growing expense of agricultural inputs, is poised to stretch our agricultural systems to their limits. By the end of this century, the earth's population is projected to increase by 45% with available arable land decreasing by 20% coupled with changes in what crops these arable lands can best support; this creates the urgent need to enhance agricultural productivity by 70% before 2050.