time axis
about real-world experiments and deep density models and then answer detailed comments and questions
We thank the reviewers for their very helpful comments and suggestions. The 100% recall but very poor precision (i.e., it always predicts a shift) is expected Marginal-KS is very bad because the attack model is very strong, i.e., it mimics the marginal distribution of Thus, marginal KS will naturally fail--highlighting the limitation of prior work for this adversarial attack. F or bootstrapping, does the model need to be fit multiple times? For Gaussian, this is fairly simple. For the detection stage, the FDR was controlled below 0.05 in all See also Table 6 and 7 in appendix.
ROSAnnotator: A Web Application for ROSBag Data Analysis in Human-Robot Interaction
Zhang, Yan, Li, Haoqi, Tabatabaei, Ramtin, Johal, Wafa
Human-robot interaction (HRI) is an interdisciplinary field that utilises both quantitative and qualitative methods. While ROSBags, a file format within the Robot Operating System (ROS), offer an efficient means of collecting temporally synched multimodal data in empirical studies with real robots, there is a lack of tools specifically designed to integrate qualitative coding and analysis functions with ROSBags. To address this gap, we developed ROSAnnotator, a web-based application that incorporates a multimodal Large Language Model (LLM) to support both manual and automated annotation of ROSBag data. ROSAnnotator currently facilitates video, audio, and transcription annotations and provides an open interface for custom ROS messages and tools. By using ROSAnnotator, researchers can streamline the qualitative analysis process, create a more cohesive analysis pipeline, and quickly access statistical summaries of annotations, thereby enhancing the overall efficiency of HRI data analysis. https://github.com/CHRI-Lab/ROSAnnotator
TripCast: Pre-training of Masked 2D Transformers for Trip Time Series Forecasting
Liao, Yuhua, Wang, Zetian, Wei, Peng, Nie, Qiangqiang, Zhang, Zhenhua
Deep learning and pre-trained models have shown great success in time series forecasting. However, in the tourism industry, time series data often exhibit a leading time property, presenting a 2D structure. This introduces unique challenges for forecasting in this sector. In this study, we propose a novel modelling paradigm, TripCast, which treats trip time series as 2D data and learns representations through masking and reconstruction processes. Pre-trained on large-scale real-world data, TripCast notably outperforms other state-of-the-art baselines in in-domain forecasting scenarios and demonstrates strong scalability and transferability in out-domain forecasting scenarios.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
Improving Real-Time Music Accompaniment Separation with MMDenseNet
Wang, Chun-Hsiang, Wang, Chung-Che, Wang, Jun-You, Jang, Jyh-Shing Roger, Chu, Yen-Hsun
Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.
Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation
Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg
Timeline-based Process Discovery
Kaur, Harleen, Mendling, Jan, Rubensson, Christoffer, Kampik, Timotheus
A key concern of automatic process discovery is to provide insights into performance aspects of business processes. Waiting times are of particular importance in this context. For that reason, it is surprising that current techniques for automatic process discovery generate directly-follows graphs and comparable process models, but often miss the opportunity to explicitly represent the time axis. In this paper, we present an approach for automatically constructing process models that explicitly align with a time axis. We exemplify our approach for directly-follows graphs. Our evaluation using two BPIC datasets and a proprietary dataset highlight the benefits of this representation in comparison to standard layout techniques.
V2CE: Video to Continuous Events Simulator
Zhang, Zhongyang, Cui, Shuyang, Chai, Kaidong, Yu, Haowen, Dasgupta, Subhasis, Mahbub, Upal, Rahman, Tauhidur
Dynamic Vision Sensor (DVS)-based solutions have recently garnered significant interest across various computer vision tasks, offering notable benefits in terms of dynamic range, temporal resolution, and inference speed. However, as a relatively nascent vision sensor compared to Active Pixel Sensor (APS) devices such as RGB cameras, DVS suffers from a dearth of ample labeled datasets. Prior efforts to convert APS data into events often grapple with issues such as a considerable domain shift from real events, the absence of quantified validation, and layering problems within the time axis. In this paper, we present a novel method for video-to-events stream conversion from multiple perspectives, considering the specific characteristics of DVS. A series of carefully designed losses helps enhance the quality of generated event voxels significantly. We also propose a novel local dynamic-aware timestamp inference strategy to accurately recover event timestamps from event voxels in a continuous fashion and eliminate the temporal layering problem. Results from rigorous validation through quantified metrics at all stages of the pipeline establish our method unquestionably as the current state-of-the-art (SOTA).
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > United States > California > Santa Clara County > San Jose (0.04)
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
Toyama, Keisuke, Akama, Taketo, Ikemiya, Yukara, Takida, Yuhta, Liao, Wei-Hsiang, Mitsufuji, Yuki
Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this work, we propose hFT-Transformer, which is an automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture. The first hierarchy includes a convolutional block in the time axis, a Transformer encoder in the frequency axis, and a Transformer decoder that converts the dimension in the frequency axis. The output is then fed into the second hierarchy which consists of another Transformer encoder in the time axis. We evaluated our method with the widely used MAPS and MAESTRO v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the F1-scores of the metrics among Frame, Note, Note with Offset, and Note with Offset and Velocity estimations.
- Media > Music (0.94)
- Leisure & Entertainment (0.94)
fMRI Multiple Missing Values Imputation Regularized by a Recurrent Denoiser
Functional Magnetic Resonance Imaging (fMRI) is a neuroimaging technique with pivotal importance due to its scientific and clinical applications. As with any widely used imaging modality, there is a need to ensure the quality of the same, with missing values being highly frequent due to the presence of artifacts or sub-optimal imaging resolutions. Our work focus on missing values imputation on multivariate signal data. To do so, a new imputation method is proposed consisting on two major steps: spatial-dependent signal imputation and time-dependent regularization of the imputed signal. A novel layer, to be used in deep learning architectures, is proposed in this work, bringing back the concept of chained equations for multiple imputation. Finally, a recurrent layer is applied to tune the signal, such that it captures its true patterns. Both operations yield an improved robustness against state-of-the-art alternatives.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Health Care Technology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.66)
Mel-spectrogram augmentation for sequence to sequence voice conversion
Hwang, Yeongtae, Cho, Hyemin, Yang, Hongsun, Oh, Insoo, Lee, Seong-Whan
When training the sequence-to-sequence voice conversion model, we need to handle an issue of insufficient data about the number of speech tuples which consist of the same utterance. This study experimentally investigated the effects of Mel-spectrogram augmentation on the sequence-to-sequence voice conversion model. For Mel-spectrogram augmentation, we adopted the policies proposed in SpecAugment. In addition, we propose new policies for more data variations. To find the optimal hyperparameters of augmentation policies for voice conversion, we experimented based on the new metric, namely deformation per deteriorating ratio. We observed the effect of these through experiments based on various sizes of training set and combinations of augmentation policy. In the experimental results, the time axis warping based policies showed better performance than other policies.