Goto

Collaborating Authors

 Sevastopol


Ukraine says it carried out first-ever underwater drone strike on Russian submarine in Novorossiysk

FOX News

This material may not be published, broadcast, rewritten, or redistributed. Quotes displayed in real-time or delayed by at least 15 minutes. Market data provided by Factset . Powered and implemented by FactSet Digital Solutions . Mutual Fund and ETF data provided by Refinitiv Lipper .


Ukraine's 'Spiderweb' drone assault forces Russia to shelter, move aircraft

Al Jazeera

Russia's increased sense of vulnerability may be the most important result of a recent large-scale Ukrainian drone attack named Operation Spiderweb, experts tell Al Jazeera. The operation destroyed as much as a third of Russia's strategic bomber fleet on the tarmac of four airfields deep inside Russia on June 1. Days later, Russia started to build shelters for its bombers and relocate them. An open source intelligence (OSINT) researcher nicknamed Def Mon posted time-lapse satellite photographs on social media showing major excavations at the Kirovskoe airfield in annexed Crimea as well as in Sevastopol, Gvardiyskoye and Saki, where Russia was constructing shelters for military aircraft. They reported similar work at several airbases in Russia, including the Engels base, which was targeted in Ukraine's attacks on June 1.


Ukraine bombs Russian bases: Here are some of Kyiv's most audacious attacks

Al Jazeera

Ukrainian drones struck multiple military airbases deep inside Russia on Sunday in a major operation a day before the neighbours held peace talks in Istanbul. The Russian Defence Ministry said Ukraine had launched drone strikes targeting Russian military airfields across five regions, causing several aircraft to catch fire. The attacks occurred in the Murmansk, Irkutsk, Ivanovo, Ryazan, and Amur regions. Air defences repelled the assaults in all but two regions – Murmansk and Irkutsk, the ministry said. "In the Murmansk and Irkutsk regions, the launch of FPV drones from an area in close proximity to airfields resulted in several aircraft catching fire," the Defence Ministry said.


SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

arXiv.org Artificial Intelligence

Reasoning and strategic behavior in social interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a new multi-domain evaluation designed to measure the intelligence of strategic planning and social reasoning. While many existing benchmarks focus on narrow planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN-Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step-wise decision making, but also conceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human--AI teaming. Project Website: https://spinbench.github.io/


DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

arXiv.org Artificial Intelligence

Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.


HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

arXiv.org Artificial Intelligence

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.


License Plate Images Generation with Diffusion Models

arXiv.org Artificial Intelligence

Despite the evident practical importance of license plate recognition (LPR), corresponding research is limited by the volume of publicly available datasets due to privacy regulations such as the General Data Protection Regulation (GDPR). To address this challenge, synthetic data generation has emerged as a promising approach. In this paper, we propose to synthesize realistic license plates (LPs) using diffusion models, inspired by recent advances in image and video generation. In our experiments a diffusion model was successfully trained on a Ukrainian LP dataset, and 1000 synthetic images were generated for detailed analysis. Through manual classification and annotation of the generated images, we performed a thorough study of the model output, such as success rate, character distributions, and type of failures. Our contributions include experimental validation of the efficacy of diffusion models for LP synthesis, along with insights into the characteristics of the generated data. Furthermore, we have prepared a synthetic dataset consisting of 10,000 LP images, publicly available at https://zenodo.org/doi/10.5281/zenodo.13342102. Conducted experiments empirically confirm the usefulness of synthetic data for the LPR task. Despite the initial performance gap between the model trained with real and synthetic data, the expansion of the training data set with pseudolabeled synthetic data leads to an improvement in LPR accuracy by 3% compared to baseline.


Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding

arXiv.org Artificial Intelligence

Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning


BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation

arXiv.org Artificial Intelligence

Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs' responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM's response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges. We make our dataset and code publicly available at https://github.com/manestay/bordIRlines.


Ukraine's navy chief says Russian warships are leaving Crimean hub in Black Sea

FOX News

The Russian navy's Black Sea Fleet has been forced to rebase nearly all its combat-ready warships from occupied Crimea to other locations, and its main naval hub is becoming ineffectual because of attacks by Kyiv, Ukraine's navy chief said. Vice-Admiral Oleksiy Neizhpapa said Ukrainian missile and naval drone strikes had caused heavy damage to the Sevastopol base, a logistics hub for repairs, maintenance, training and ammunition storage among other important functions for Russia. "They were established over many decades, possibly centuries. And clearly they are now losing this hub," Neizhpapa told Reuters in a rare interview in the port city of Odesa ahead of Ukraine Navy Day on Sunday. More than 28 months since Russia's full-scale invasion, Kyiv has dealt a series of stinging blows to Moscow in the Black Sea although Ukrainian ground troops are on the back foot across a sprawling front.