Not enough data to create a plot.
Try a different view from the menu above.
Czechia
Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation
While significant advancements have been made in music generation and differentiable sound synthesis within machine learning and computer audition, the simulation of instrument vibration guided by physical laws has been underexplored. To address this gap, we introduce a novel model for simulating the spatio-temporal motion of nonlinear strings, integrating modal synthesis and spectral modeling within a neural network framework. Our model leverages physical properties and fundamental frequencies as inputs, outputting string states across time and space that solve the partial differential equation characterizing the nonlinear string. Empirical evaluations demonstrate that the proposed architecture achieves superior accuracy in string motion simulation compared to existing baseline architectures. The code and demo are available online.
Supplementary Materials
Finally, the data was subsampled by a factor of 2. Data augmentation TX features were augmented by adding two types of artificial noise. Subsequently, random constant offsets (mean = 0 std = 0.6) were added to the means of the Each session day has its own affine transform layer. RNN training hyperparameters The hyperparameters for RNN training are listed in Table 1. It used a 130,000 word vocabulary taken from the CMU Pronouncing Dictionary [1]. Out-of-vocabulary words were mapped to a special
AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization
Kišš, Martin, Hradiš, Michal, Dvořáková, Martina, Jiroušek, Václav, Kersch, Filip
We introduce the AnnoPage Dataset, a novel collection of 7 550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research.
b83bea9688047be30f54034c55716854-Supplemental-Datasets_and_Benchmarks_Track.pdf
In addition, users may become overly dependent on the model's outputs For the feedback, we ask the person "Please consider the quality of the Given a score (1-5). 1 means its quality is bad, and 5 means its quality is very good". The interface of the user study is shown in Fig. A1. We report the average scores in Tab. We have a total of 1.1M training data in FIRE. In Fig. A2, we present the curves of AT, ATR, ATR, and RR using different Results show that more data leads to better performance. This experiment shows the quality of data in FIRE again.
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
Stojnić, Vladan, Kalantidis, Yannis, Matas, Jiří, Tolias, Giorgos
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
Solving Sparse & High-Dimensional-Output Regression via Compression
Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse & High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient twostage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.
Supplementary Materials for MAViL: Masked Audio-Video Learners
These results are obtained using the stage-1 MAViL's decoders, In D, we discuss MAViL's societal impact and limitations. Figure 1: Video clip and spectrogram reconstruction on the AudioSet eval set. We sample 4 paired (video, audio) examples as follows: Top left: a puppy video; Top right: a recording from an ambulance's dash camera; Bottom left: a person dialing a phone in a dark room; Bottom right: a singer dancing. In each 3-row group, we show the original video and its audio spectrogram (top), masked input to MAViL (middle), and MAViL's video and audio spectrogram reconstructions (bottom). The spectrogram shape is 1024 128; patch size is 16 16.
Comparative Analysis of Deep Learning Models for Real-World ISP Network Traffic Forecasting
Koumar, Josef, Smoleň, Timotej, Jeřábek, Kamil, Čejka, Tomáš
Traffic monitoring is a cornerstone of effective network management and cybersecurity, providing Internet Service Providers (ISPs) with critical insights to detect anomalies, mitigate congestion, and maintain network performance [1]. The surge in video streaming, cloud computing, and online gaming is driving rapid growth in internet usage, contributing to increasingly complex and less predictable network traffic. Efficient network monitoring allows ISPs to maintain service quality, mitigate security risks, and optimize bandwidth in real time [2]. However, real-time monitoring alone is insufficient for proactively managing network resources. To anticipate variations in demand and prevent service disruptions, ISPs increasingly adopt advanced forecasting techniques to predict traffic patterns and optimize resource allocation in advance [3]. Accurate traffic forecasting allows ISPs to efficiently allocate resources, scale network capacity, and sustain service quality under fluctuating loads [3]. The rise of diverse, high-bandwidth services has significantly increased network traffic variability. Traditional models like ARIMA and exponential smoothing, which assume linearity, struggle with ISP data due to prevalent non-linear and high-frequency fluctuations, especially during peak traffic hours [4]. These limitations have driven the adoption of deep learning models, particularly neural networks, which excel at capturing complex temporal dependencies across various forecasting domains [5].