Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary

Ishat, Tahoshin Alam, Qayum, Mohammad Abdul

arXiv.org Artificial Intelligence 

Abstract--This research investigates the opportunity of an intelligent, multi-modal AI system interpreting visual,audio and motion based data to analyse and comprehend cooking recipes. The system is integrated with object segmentation, hand motion classification and auido to text convertion with help of natural language processing to create a comprehensive pipeline that imitates human level understanding of kitchen tasks and recipies. The early stages of the project involved experimenting with Pre-made dataset, specially COCO dataset for object segmentation, which yielded suboptimal for use case of the project. T o overcome this, a domain-specific dataset was curated by collecting and annotating over 7,000 kitchen-related images, later augmented to 17,000 images. Several YOLOv8 segmentation models were trained on this dataset to detect 16 essential kitchen objects. Additionally, short-duration videos capturing cooking actions were collected and processed using MediaPipe to extract hand, elbow, and shoulder keypoints. These were used to train an LSTM-based model for hand action classification and incorporated Whisper, a audio-to-text transcription model and leverage a large language model such as TinyLlama to generate structured cooking recipes from the multi-modal inputs. A. Background and motivation In the era of computer vision and automation of every crucial task in our day to day life is also being infiltrated by artificial intelligence and machines.