Goto

Collaborating Authors

 Atlantic Ocean


'I find them quite magical': the UK's obsession with weather apps

The Guardian

Several times a day, Francesca Simon, the author of the Horrid Henry children's books, gets out her phone to check the weather – not just for where she is, but where friends and family live, where she has been on holiday, where she was brought up. I find them quite magical," she said. With about 10 locations logged, her friends make fun of her "weather porn" habit. This week, Simon discovered she shared a weather app fixation with Queen Camilla when the pair discussed a miserable summer's day at a charity event. "[Camilla] said everybody teases her … so we were laughing at our mutual obsession," Simon said. It is an obsession shared by millions. If you are going on holiday, planning a summer barbecue, worrying about your garden or suffering from hay fever, you are likely to check an app at least daily for the latest forecast. The apps give much more localised and detailed information than traditional weather forecasts, including wind speeds and the percentage chance of rain, in ...


Russia is building ground-based kamikaze robots out of old hoverboards

New Scientist

A Russian group is cobbling together hoverboards, a form of personal transport, to create four-wheeled robots capable of carrying out kamikaze attacks, moving supplies or laying a smokescreen. Both sides in the Russia-Ukraine war are using improvised aerial drones by the thousand, while in the Black Sea, Ukraine has deployed an armada of uncrewed vessels developed from Jet Skis and speedboats. Both sides are also developing cheap ground-based robots, and Russia's latest effort is an extreme example.


BenthicNet: A global compilation of seafloor images for deep learning applications

arXiv.org Artificial Intelligence

Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with which seafloor image datasets are analyzed, yet large and consistent datasets necessary to support development of such approaches are scarce. Here we present BenthicNet: a global compilation of seafloor imagery designed to support the training and evaluation of large-scale image recognition models. An initial set of over 11.4 million images was collected and curated to represent a diversity of seafloor environments using a representative subset of 1.3 million images. These are accompanied by 2.6 million annotations translated to the CATAMI scheme, which span 190,000 of the images. A large deep learning model was trained on this compilation and preliminary results suggest it has utility for automating large and small-scale image analysis tasks. The compilation and model are made openly available for use by the scientific community at https://doi.org/10.20383/103.0614.


Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs

arXiv.org Artificial Intelligence

Large language models (LLMs) are prone to hallucinations, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's generation, typically computing representative vectors of hallucinations vs. grounded generations, for steering the model's hidden states away from a hallucinatory state. However, common studies employ different setups and do not properly separate different possible causes of hallucinations, making interventions misguided. In this work, we introduce a method for categorizing examples based on the model's prior knowledge, named WACK. We construct WACK benchmarks that support interventions in two settings: open-book and closed-book question answering. Using the benchmarks, we perform an extensive investigation of the effect of different choices for intervention, such as the intervened components, and how often and how strongly to intervene. We find that intervention success varies depending on the component, with the attention blocks performing well and the residual stream proving detrimental to language modeling capabilities. We also show that interventions can benefit from representative vectors collected before, rather than after, a hallucination occurs. Finally, we introduce a new dynamic intervention, which intervenes only if needed, and thus is more robust than standard static interventions.


Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning

arXiv.org Artificial Intelligence

As Large language models have shown a remarkable a significant milestone in this area, Elhage et al. ability to learn and perform complex tasks through (2021) demonstrated the existence of induction in-context learning (ICL) (Brown et al., 2020; Touvron heads in Transformer LMs. These heads scan the et al., 2023b). In ICL, the model receives context for previous instances of the current token a demonstration context and a query question as using a prefix matching mechanism, which identifies a prompt for prediction. Unlike supervised learning, if and where a token has appeared before. ICL utilises the pretrained model's capabilities If a matching token is found, the head employs to recognise and replicate patterns within the a copying mechanism to increase the probability demonstration context, thereby enabling accurate of the subsequent token, facilitating exact or approximate predictions for the query without the use of gradient repetition of sequences and embodying updates.


Fine-grained, Multi-dimensional Summarization Evaluation with LLMs

arXiv.org Artificial Intelligence

Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recently proposed LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis, e.g., we can only assign one hallucination score at the summary level, while at the sentence level, we can count sentences containing hallucinations. To remedy those limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE. In addition, we conduct extensive benchmarking of FineSurE against SOTA methods including NLI-, QA-, and LLM-based methods, showing improved performance especially on the completeness and conciseness dimensions. The code is available at https://github.com/DISL-Lab/FineSurE-ACL24.


KG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise Questions

arXiv.org Artificial Intelligence

Recent studies have demonstrated that large language models (LLMs) are susceptible to being misled by false premise questions (FPQs), leading to errors in factual knowledge, know as factuality hallucination. Existing benchmarks that assess this vulnerability primarily rely on manual construction, resulting in limited scale and lack of scalability. In this work, we introduce an automated, scalable pipeline to create FPQs based on knowledge graphs (KGs). The first step is modifying true triplets extracted from KGs to create false premises. Subsequently, utilizing the stateof-the-art capabilities of GPTs, we generate semantically rich FPQs. Based on the proposed method, we present a comprehensive benchmark, the Knowledge Graph-based False Premise Questions (KG-FPQ), which contains approximately 178k FPQs across three knowledge domains, at six levels of confusability, and in two task formats. Using KG-FPQ, we conduct extensive evaluations on several representative LLMs and provide valuable insights. The KG-FPQ dataset and code are available Figure 1: Top: LLM correctly answers when faced with at https://github.com/yanxuzhu/KG-FPQ. a TPQ. Middle: LLM experiences factuality hallucination when faced with a FPQ.


Training Task Experts through Retrieval Based Distillation

arXiv.org Artificial Intelligence

One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.


Ukraine's navy chief says Russian warships are leaving Crimean hub in Black Sea

FOX News

The Russian navy's Black Sea Fleet has been forced to rebase nearly all its combat-ready warships from occupied Crimea to other locations, and its main naval hub is becoming ineffectual because of attacks by Kyiv, Ukraine's navy chief said. Vice-Admiral Oleksiy Neizhpapa said Ukrainian missile and naval drone strikes had caused heavy damage to the Sevastopol base, a logistics hub for repairs, maintenance, training and ammunition storage among other important functions for Russia. "They were established over many decades, possibly centuries. And clearly they are now losing this hub," Neizhpapa told Reuters in a rare interview in the port city of Odesa ahead of Ukraine Navy Day on Sunday. More than 28 months since Russia's full-scale invasion, Kyiv has dealt a series of stinging blows to Moscow in the Black Sea although Ukrainian ground troops are on the back foot across a sprawling front.


Improving ensemble extreme precipitation forecasts using generative artificial intelligence

arXiv.org Artificial Intelligence

An ensemble post-processing method is developed to improve the probabilistic forecasts of extreme precipitation events across the conterminous United States (CONUS). The method combines a 3-D Vision Transformer (ViT) for bias correction with a Latent Diffusion Model (LDM), a generative Artificial Intelligence (AI) method, to post-process 6-hourly precipitation ensemble forecasts and produce an enlarged generative ensemble that contains spatiotemporally consistent precipitation trajectories. These trajectories are expected to improve the characterization of extreme precipitation events and offer skillful multi-day accumulated and 6-hourly precipitation guidance. The method is tested using the Global Ensemble Forecast System (GEFS) precipitation forecasts out to day 6 and is verified against the Climate-Calibrated Precipitation Analysis (CCPA) data. Verification results indicate that the method generated skillful ensemble members with improved Continuous Ranked Probabilistic Skill Scores (CRPSSs) and Brier Skill Scores (BSSs) over the raw operational GEFS and a multivariate statistical post-processing baseline. It showed skillful and reliable probabilities for events at extreme precipitation thresholds. Explainability studies were further conducted, which revealed the decision-making process of the method and confirmed its effectiveness on ensemble member generation. This work introduces a novel, generative-AI-based approach to address the limitation of small numerical ensembles and the need for larger ensembles to identify extreme precipitation events.