Goto

Collaborating Authors

 Czechia


AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization

arXiv.org Artificial Intelligence

We introduce the AnnoPage Dataset, a novel collection of 7 550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research.


b83bea9688047be30f54034c55716854-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

In addition, users may become overly dependent on the model's outputs For the feedback, we ask the person "Please consider the quality of the Given a score (1-5). 1 means its quality is bad, and 5 means its quality is very good". The interface of the user study is shown in Fig. A1. We report the average scores in Tab. We have a total of 1.1M training data in FIRE. In Fig. A2, we present the curves of AT, ATR, ATR, and RR using different Results show that more data leads to better performance. This experiment shows the quality of data in FIRE again.


LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

arXiv.org Artificial Intelligence

We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS


Solving Sparse & High-Dimensional-Output Regression via Compression

Neural Information Processing Systems

Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse & High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient twostage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.


Supplementary Materials for MAViL: Masked Audio-Video Learners

Neural Information Processing Systems

These results are obtained using the stage-1 MAViL's decoders, In D, we discuss MAViL's societal impact and limitations. Figure 1: Video clip and spectrogram reconstruction on the AudioSet eval set. We sample 4 paired (video, audio) examples as follows: Top left: a puppy video; Top right: a recording from an ambulance's dash camera; Bottom left: a person dialing a phone in a dark room; Bottom right: a singer dancing. In each 3-row group, we show the original video and its audio spectrogram (top), masked input to MAViL (middle), and MAViL's video and audio spectrogram reconstructions (bottom). The spectrogram shape is 1024 128; patch size is 16 16.


Doubly Mild Generalization for Offline Reinforcement Learning Yixiu Mao 1, Qi Wang 1, Y un Qu

Neural Information Processing Systems

Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions.


Comparative Analysis of Deep Learning Models for Real-World ISP Network Traffic Forecasting

arXiv.org Artificial Intelligence

Traffic monitoring is a cornerstone of effective network management and cybersecurity, providing Internet Service Providers (ISPs) with critical insights to detect anomalies, mitigate congestion, and maintain network performance [1]. The surge in video streaming, cloud computing, and online gaming is driving rapid growth in internet usage, contributing to increasingly complex and less predictable network traffic. Efficient network monitoring allows ISPs to maintain service quality, mitigate security risks, and optimize bandwidth in real time [2]. However, real-time monitoring alone is insufficient for proactively managing network resources. To anticipate variations in demand and prevent service disruptions, ISPs increasingly adopt advanced forecasting techniques to predict traffic patterns and optimize resource allocation in advance [3]. Accurate traffic forecasting allows ISPs to efficiently allocate resources, scale network capacity, and sustain service quality under fluctuating loads [3]. The rise of diverse, high-bandwidth services has significantly increased network traffic variability. Traditional models like ARIMA and exponential smoothing, which assume linearity, struggle with ISP data due to prevalent non-linear and high-frequency fluctuations, especially during peak traffic hours [4]. These limitations have driven the adoption of deep learning models, particularly neural networks, which excel at capturing complex temporal dependencies across various forecasting domains [5].


Asymptotically Optimal Path Planning With an Approximation of the Omniscient Set

arXiv.org Artificial Intelligence

The asymptotically optimal version of Rapidly-exploring Random Tree (RRT*) is often used to find optimal paths in a high-dimensional configuration space. The well-known issue of RRT* is its slow convergence towards the optimal solution. A possible solution is to draw random samples only from a subset of the configuration space that is known to contain configurations that can improve the cost of the path (omniscient set). A fast convergence rate may be achieved by approximating the omniscient with a low-volume set. In this letter, we propose new methods to approximate the omniscient set and methods for their effective sampling. First, we propose to approximate the omniscient set using several (small) hyperellipsoids defined by sections of the current best solution. The second approach approximates the omniscient set by a convex hull computed from the current solution. Both approaches ensure asymptotical optimality and work in a general n-dimensional configuration space. The experiments have shown superior performance of our approaches in multiple scenarios in 3D and 6D configuration spaces.


DMNet: Self-comparison Driven Model for Subject-independent Seizure Detection

Neural Information Processing Systems

Automated seizure detection (ASD) using intracranial electroencephalography (iEEG) is critical for effective epilepsy treatment. However, the significant domain shift of iEEG signals across subjects poses a major challenge, limiting their applicability in real-world clinical scenarios. In this paper, we address this issue by analyzing the primary cause behind the failure of existing iEEG models for subject-independent seizure detection, and identify a critical universal seizure pattern: seizure events consistently exhibit higher average amplitude compared to adjacent normal events. To mitigate the domain shifts and preserve the universal seizure patterns, we propose a novel self-comparison mechanism.


WildGaussians: 3D Gaussian Splatting in the Wild Jonas Kulhanek

Neural Information Processing Systems

While the field of 3D scene reconstruction is dominated by NeRFs due to their photorealistic quality, 3D Gaussian Splatting (3DGS) has recently emerged, offering similar quality with real-time rendering speeds. However, both methods primarily excel with well-controlled 3D scenes, while in-the-wild data - characterized by occlusions, dynamic objects, and varying illumination - remains challenging. NeRFs can adapt to such conditions easily through per-image embedding vectors, but 3DGS struggles due to its explicit representation and lack of shared parameters. To address this, we introduce WildGaussians, a novel approach to handle occlusions and appearance changes with 3DGS. By leveraging robust DINO features and integrating an appearance modeling module within 3DGS, our method achieves state-of-the-art results. We demonstrate that WildGaussians matches the real-time rendering speed of 3DGS while surpassing both 3DGS and NeRF baselines in handling in-the-wild data, all within a simple architectural framework.