Goto

Collaborating Authors

 Pacific Ocean


Typhoon Intensity Prediction with Vision Transformer

arXiv.org Artificial Intelligence

Predicting typhoon intensity accurately across space and time is crucial for issuing timely disaster warnings and facilitating emergency response. This has vast potential for minimizing life losses and property damages as well as reducing economic and environmental impacts. Leveraging satellite imagery for scenario analysis is effective but also introduces additional challenges due to the complex relations among clouds and the highly dynamic context. Existing deep learning methods in this domain rely on convolutional neural networks (CNNs), which suffer from limited per-layer receptive fields. This limitation hinders their ability to capture long-range dependencies and global contextual knowledge during inference. In response, we introduce a novel approach, namely "Typhoon Intensity Transformer" (Tint), which leverages self-attention mechanisms with global receptive fields per layer. Tint adopts a sequence-to-sequence feature representation learning perspective. It begins by cutting a given satellite image into a sequence of patches and recursively employs self-attention operations to extract both local and global contextual relations between all patch pairs simultaneously, thereby enhancing per-patch feature representation learning.


iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

arXiv.org Artificial Intelligence

The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting.


Reconstructing Historical Climate Fields With Deep Learning

arXiv.org Artificial Intelligence

Historical records of climate fields are often sparse due to missing measurements, especially before the introduction of large-scale satellite missions. Several statistical and model-based methods have been introduced to fill gaps and reconstruct historical records. Here, we employ a recently introduced deep-learning approach based on Fourier convolutions, trained on numerical climate model output, to reconstruct historical climate fields. Using this approach we are able to realistically reconstruct large and irregular areas of missing data, as well as reconstruct known historical events such as strong El Ni\~no and La Ni\~na with very little given information. Our method outperforms the widely used statistical kriging method as well as other recent machine learning approaches. The model generalizes to higher resolutions than the ones it was trained on and can be used on a variety of climate fields. Moreover, it allows inpainting of masks never seen before during the model training.


The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP

arXiv.org Artificial Intelligence

I propose a paradigm for scientific progress in NLP centered around developing scalable, data-driven theories of linguistic structure. The idea is to collect data in tightly scoped, carefully defined ways which allow for exhaustive annotation of behavioral phenomena of interest, and then use machine learning to construct explanatory theories of these phenomena which can form building blocks for intelligible AI systems. After laying some conceptual groundwork, I describe several investigations into data-driven theories of shallow semantic structure using Question-Answer driven Semantic Role Labeling (QA-SRL), a schema for annotating verbal predicate-argument relations using highly constrained question-answer pairs. While this only scratches the surface of the complex language behaviors of interest in AI, I outline principles for data collection and theoretical modeling which can inform future scientific progress. This note summarizes and draws heavily on my PhD thesis.


Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis

arXiv.org Artificial Intelligence

Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models. A prevalent strategy to align multiple modalities is to fine-tune the generator with a large number of annotated text-image pairs. However, such a procedure is labor-consuming and resource-draining. The key question we ask is: can we enhance the quality and faithfulness of text-driven generative models beyond extensive text-image pair annotations? To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. Instead of training generators to handle generic prompts, KPP employs a recursive knowledge query process to gather informative external facts from the knowledge base, instructs a language model to compress the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. The entire process is zero-shot, without accessing the architectures and parameters of generative models. We evaluate the framework across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of different domains. We further demonstrate the extensibility and adaptability of KPP through varying foundation model bases and instructions. Our results show that KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.


Hierarchical Joint Graph Learning and Multivariate Time Series Forecasting

arXiv.org Artificial Intelligence

Multivariate time series is prevalent in many scientific and industrial domains. Modeling multivariate signals is challenging due to their long-range temporal dependencies and intricate interactions--both direct and indirect. To confront these complexities, we introduce a method of representing multivariate signals as nodes in a graph with edges indicating interdependency between them. Specifically, we leverage graph neural networks (GNN) and attention mechanisms to efficiently learn the underlying relationships within the time series data. Moreover, we suggest employing hierarchical signal decompositions running over the graphs to capture multiple spatial dependencies. The effectiveness of our proposed model is evaluated across various real-world benchmark datasets designed for long-term forecasting tasks. The results consistently showcase the superiority of our model, achieving an average 23\% reduction in mean squared error (MSE) compared to existing models.


Reinforcement Replaces Supervision: Query focused Summarization using Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Query-focused Summarization (QfS) deals with systems that generate summaries from document(s) based on a query. Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, and thereby performs better (empirically) than SL, we use an RL-based approach for this task of QfS. Additionally, we also resolve the conflict of employing RL in Transformers with Teacher Forcing. We develop multiple Policy Gradient networks, trained on various reward signals: ROUGE, BLEU, and Semantic Similarity, which lead to a 10-point improvement over the State-of-the-Art approach on the ROUGE-L metric for a benchmark dataset (ELI5). We also show performance of our approach in zero-shot setting for another benchmark dataset (DebatePedia) -- our approach leads to results comparable to baselines, which were specifically trained on DebatePedia. To aid the RL training, we propose a better semantic similarity reward, enabled by a novel Passage Embedding scheme developed using Cluster Hypothesis. Lastly, we contribute a gold-standard test dataset to further research in QfS and Long-form Question Answering (LfQA).


HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models

arXiv.org Artificial Intelligence

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision) and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 13 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.


Evaluating Optimal Reference Translations

arXiv.org Artificial Intelligence

Machine translation (MT) is routinely evaluated using various segment-level similarity metrics against one or more reference translations. At the same time, reference translations acquired in the standard way are often criticized for their flaws of various types. For several high-resourced language pairs, MT quality reaches levels comparable to the quality of the reference translation (Freitag et al. 2022; Hassan et al. 2018) and sometimes MT even significantly surpasses humans in a particular evaluation setting (Popel et al. 2020). Given this, one could conclude that state-of-the-art MT has reached the point where reference-based evaluation is no longer reliable and we have to resort to other methods (such as targeted expert evaluation of particular outputs), even if they are costly, subjective and possibly impossible to automate. The narrow goal of the presented work is to allow for an "extension of the expiry date" for reference-based evaluation methods. In a broader perspective, we want to formulate a methodology for creating reference translations which avoid the often-observed deficiencies of "standard" or "professional" reference translations, be it multiple interfering phenomena, inappropriate expressions, ignorance of topic-focus articulation (information structure) or other abundant shortcomings in the translation, indicating their authors' insensitivity to the topic itself, but above all to the source and target language. To this end, we introduce so-called optimal reference translations (ORT), which are intended to represent optimal (ideal or excellent) human translations (should they be the subject of a translation quality evaluation).


The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

arXiv.org Artificial Intelligence

Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. Project page is available at https://omriavrahami.com/the-chosen-one