AITopics | transf

Collaborating Authors

transf

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

92d1e1eb1cd6f9fba3227870bb6d7f07-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 09:06:36 GMT

overlap, time step, transf, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

From Image Captioning to Visual Storytelling

Passadakis, Admitos, Song, Yingjin, Gatt, Albert

arXiv.org Artificial IntelligenceAug-21-2025

Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.14045

Country:

North America > United States > Minnesota (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Appendices A Masking distribution

Neural Information Processing SystemsAug-15-2025, 03:46:36 GMT

For a 15 sec long audio sample, the average mask length is 14.7 time-steps, corresponding to 299ms Table 6 summarizes the fine-tuning hyper-parameter settings used for the different labeled data setup. In this section we study the most common errors our models make when fine-tuned on different amounts of labeled data (Table 11). L V -60k model achieves WER 38.3 on dev-clean and adding a Transformer language model enables The ten minute models without lexicon and language model tend to spell words phonetically and omit repeated letters, e.g., will At ten hours, top errors include articles, e.g., a, the which The "from scratch" 960 hour model has a similar word error rate as the 100 hour pre-trained model In brackets is the total number of occurrences of each error. The setup for the baseline model is described in 5.4. Both did not lead to meaningful improvements.

overlap, time step, transf, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Asymptotically Optimal Path Planning With an Approximation of the Omniscient Set

Kříž, Jonáš, Vonásek, Vojtěch

arXiv.org Artificial IntelligenceMar-20-2025

The asymptotically optimal version of Rapidly-exploring Random Tree (RRT*) is often used to find optimal paths in a high-dimensional configuration space. The well-known issue of RRT* is its slow convergence towards the optimal solution. A possible solution is to draw random samples only from a subset of the configuration space that is known to contain configurations that can improve the cost of the path (omniscient set). A fast convergence rate may be achieved by approximating the omniscient with a low-volume set. In this letter, we propose new methods to approximate the omniscient set and methods for their effective sampling. First, we propose to approximate the omniscient set using several (small) hyperellipsoids defined by sections of the current best solution. The second approach approximates the omniscient set by a convex hull computed from the current solution. Both approaches ensure asymptotical optimality and work in a general n-dimensional configuration space. The experiments have shown superior performance of our approaches in multiple scenarios in 3D and 6D configuration spaces.

artificial intelligence, planning & scheduling, random sample, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LRA.2025.3540627

2503.16164

Country:

Europe > Czechia > Prague (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.84)

Add feedback

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Moratelli, Nicholas, Caffagni, Davide, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

arXiv.org Artificial IntelligenceAug-26-2024

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.

background, dico, scst, (15 more...)

arXiv.org Artificial Intelligence

2408.14547

Country:

Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
Europe > Italy > Emilia-Romagna > Modeno Province > Modena (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry:

Transportation > Ground > Rail (1.00)
Leisure & Entertainment > Sports (1.00)
Transportation > Passenger (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Secure short-term load forecasting for smart grids with transformer-based federated learning

Sievers, Jonas, Blank, Thomas

arXiv.org Artificial IntelligenceOct-26-2023

Electricity load forecasting is an essential task within smart grids to assist demand and supply balance. While advanced deep learning models require large amounts of high-resolution data for accurate short-term load predictions, fine-grained load profiles can expose users' electricity consumption behaviors, which raises privacy and security concerns. One solution to improve data privacy is federated learning, where models are trained locally on private data, and only the trained model parameters are merged and updated on a global server. Therefore, this paper presents a novel transformer-based deep learning approach with federated learning for short-term electricity load prediction. To evaluate our results, we benchmark our federated learning architecture against central and local learning and compare the performance of our model to long short-term memory models and convolutional neural networks. Our simulations are based on a dataset from a German university campus and show that transformer-based forecasting is a promising alternative to state-of-the-art models within federated learning.

architecture, forecasting, learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICCEP57914.2023.10247363

2310.17477

Country:

Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
North America > United States > New Jersey (0.04)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Energy > Power Industry (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Vieting, Peter, Berger, Simon, von Neumann, Thilo, Boeddeker, Christoph, Schlüter, Ralf, Haeb-Umbach, Reinhold

arXiv.org Artificial IntelligenceSep-15-2023

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.

encoder, mixture encoder, separation, (17 more...)

arXiv.org Artificial Intelligence

2309.08454

Country:

Europe > Germany (0.05)
Europe > Austria > Styria > Graz (0.05)
Europe > Greece (0.05)
(7 more...)

Genre: Research Report (0.50)

Industry: Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Reverse Ordering Techniques for Attention-Based Channel Prediction

Rizzello, Valentina, Böck, Benedikt, Joham, Michael, Utschick, Wolfgang

arXiv.org Artificial IntelligenceMay-11-2023

This work aims to predict channels in wireless communication systems based on noisy observations, utilizing sequence-to-sequence models with attention (Seq2Seq-attn) and transformer models. Both models are adapted from natural language processing to tackle the complex challenge of channel prediction. Additionally, a new technique called reverse positional encoding is introduced in the transformer model to improve the robustness of the model against varying sequence lengths. Similarly, the encoder outputs of the Seq2Seq-attn model are reversed before applying attention. Simulation results demonstrate that the proposed ordering techniques allow the models to better capture the relationships between the channel snapshots within the sequence, irrespective of the sequence length, as opposed to existing methods.

machine learning, natural language, prediction, (18 more...)

arXiv.org Artificial Intelligence

2302.00341

Country: