itinerary
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
Wang, Hao, Murata, Eiki, Zhang, Lingfang, Sato, Ayako, Fukuda, So, Yin, Ziqi, Hu, Wentao, Nakao, Keisuke, Nakamura, Yusuke, Zwirner, Sebastian, Chen, Yi-Chia, Otomo, Hiroyuki, Ouchi, Hiroki, Kawahara, Daisuke
Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.
- Consumer Products & Services > Travel (1.00)
- Transportation > Infrastructure & Services (0.93)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries
Gadbail, Shravan, Desai, Masumi, Karlapalem, Kamalakar
The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.
- Asia > Middle East > UAE > Dubai Emirate > Dubai (0.05)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
- Asia > India > Telangana > Hyderabad (0.05)
- (5 more...)
- Transportation > Air (1.00)
- Consumer Products & Services > Travel (1.00)
- Transportation > Passenger (0.98)
TripTide: A Benchmark for Adaptive Travel Planning under Disruptions
Karmakar, Priyanshu, Chaudhuri, Soumyabrata, Mallick, Shubhojit, Gupta, Manish, Jana, Abhik, Ghosh, Shreya
Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM's ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > India (0.04)
- (9 more...)
- Transportation > Passenger (1.00)
- Consumer Products & Services > Travel (1.00)
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
Qu, Yincen, Xiao, Huan, Li, Feng, Li, Gregory, Zhou, Hui, Dai, Xiangying, Dai, Xiaoru
Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs' planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.
- Europe > Austria > Vienna (0.14)
- Asia > Singapore (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
I asked AI to plan my dream summer holiday. Here's how it turned out
I asked AI to plan my dream summer holiday. Here's how it turned out The gothic arches of Santa María de la Asunción crown the hill above the stone harbour where fishermen land their catches of sardines and anchovies. The church in Castro Urdiales, a small seaside town of about 30,000 people in northern Spain, is more than 700 years old. It was the perfect holiday destination for Alan Smith and his family, though he had never heard of it - until he asked ChatGPT. This week a report from the travel association ABTA found an increasing number of people were turning to AI to help with their holidays, from suggesting destinations to planning itineraries once there.
- Europe > Spain (0.26)
- South America > Paraguay > Asunción > Asunción (0.25)
- North America > Central America (0.15)
- (20 more...)
- Leisure & Entertainment (1.00)
- Consumer Products & Services > Travel (0.68)
COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
Qin, Tian, Bai, Felix, Hu, Ting-Yao, Vemulapalli, Raviteja, Koppula, Hema Swetha, Xu, Zhiyang, Jin, Bowen, Cemri, Mert, Lu, Jiarui, Wang, Zirui, Cao, Meng
Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.
- North America > United States > Utah > Grand County (0.04)
- North America > United States > Virginia (0.04)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Consumer Products & Services > Travel (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Persuasive or Neutral? A Field Experiment on Generative AI in Online Travel Planning
Jirpongopas, Lynna, Lutz, Bernhard, Ebner, Jörg, Vahidov, Rustam, Neumann, Dirk
Generative AI (GenAI) offers new opportunities for customer support in online travel agencies, yet little is known about how its design influences user engagement, purchase behavior, and user experience. We report results from a randomized field experiment in online travel itinerary planning, comparing GenAI that expressed (A) positive enthusiasm, (B) neutral expression, and (C) no tone instructions (control). Users in group A wrote significantly longer prompts than those in groups B and C. At the same time, users in groups A and B were more likely to purchase subscriptions of the webservice. We further analyze linguistic cues across experimental groups to explore differences in user experience and explain subscription purchases and affiliate link clicks based on these cues. Our findings provide implications for the design of persuasive and engaging GenAI interfaces in consumer-facing contexts and contribute to understanding how linguistic framing shapes user behavior in AI-mediated decision support.
- Europe > Germany > Baden-Württemberg > Freiburg (0.05)
- Europe > Ireland > Munster > County Kerry > Killarney (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (16 more...)
- Research Report > Strength High (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine (1.00)
- Consumer Products & Services > Travel (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.71)
- (2 more...)
TripTailor: A Real-World Benchmark for Personalized Travel Planning
Shen, Yuanzhe, Wang, Kaimin, Lv, Changze, Zheng, Xiaoqing, Huang, Xuanjing
The continuous evolution and enhanced reasoning capabilities of large language models (LLMs) have elevated their role in complex tasks, notably in travel planning, where demand for personalized, high-quality itineraries is rising. However, current benchmarks often rely on unrealistic simulated data, failing to reflect the differences between LLM-generated and real-world itineraries. Existing evaluation metrics, which primarily emphasize constraints, fall short of providing a comprehensive assessment of the overall quality of travel plans. To address these limitations, we introduce TripTailor, a benchmark designed specifically for personalized travel planning in real-world scenarios. This dataset features an extensive collection of over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries, complete with detailed information, providing a more authentic evaluation framework. Experiments show that fewer than 10\% of the itineraries generated by the latest state-of-the-art LLMs achieve human-level performance. Moreover, we identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization of the proposed solutions. We hope that TripTailor will drive the development of travel planning agents capable of understanding and meeting user needs while generating practical itineraries. Our code and dataset are available at https://github.com/swxkfm/TripTailor
- Asia > China > Liaoning Province > Shenyang (0.04)
- Asia > China > Chongqing Province > Chongqing (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
ChatGPT can plan your dream getaway--if you know how to ask
Planning a trip takes time and often its more of a hassle than you'd like. If you don't feel like spending hours researching, you can simply outsource the first draft of your holiday plans to ChatGPT. The chatbot suggests travel destinations, creates daily plans, compares means of transport, reminds you of charging devices, and even virtually packs your suitcase. But how reliable are these suggestions? And can it actually save you money?
Vaiage: A Multi-Agent Solution to Personalized Travel Planning
Liu, Binwen, Ge, Jiexi, Wang, Jiamin
Planning trips is a cognitively intensive task involving conflicting user preferences, dynamic external information, and multi-step temporal-spatial optimization. Traditional platforms often fall short - they provide static results, lack contextual adaptation, and fail to support real-time interaction or intent refinement. Our approach, Vaiage, addresses these challenges through a graph-structured multi-agent framework built around large language models (LLMs) that serve as both goal-conditioned recommenders and sequential planners. LLMs infer user intent, suggest personalized destinations and activities, and synthesize itineraries that align with contextual constraints such as budget, timing, group size, and weather. Through natural language interaction, structured tool use, and map-based feedback loops, Vaiage enables adaptive, explainable, and end-to-end travel planning grounded in both symbolic reasoning and conversational understanding. To evaluate Vaiage, we conducted human-in-the-loop experiments using rubric-based GPT-4 assessments and qualitative feedback. The full system achieved an average score of 8.5 out of 10, outperforming the no-strategy (7.2) and no-external-API (6.8) variants, particularly in feasibility. Qualitative analysis indicated that agent coordination - especially the Strategy and Information Agents - significantly improved itinerary quality by optimizing time use and integrating real-time context. These results demonstrate the effectiveness of combining LLM reasoning with symbolic agent coordination in open-ended, real-world planning tasks.
- Asia > China > Hong Kong (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > United States > California > Los Angeles County > Los Angeles (0.05)
- Asia > China > Shanghai > Shanghai (0.04)
- Health & Medicine > Consumer Health (1.00)
- Consumer Products & Services > Travel (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- (2 more...)