Takida, Yuhta
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
Toyama, Keisuke, Akama, Taketo, Ikemiya, Yukara, Takida, Yuhta, Liao, Wei-Hsiang, Mitsufuji, Yuki
Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this work, we propose hFT-Transformer, which is an automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture. The first hierarchy includes a convolutional block in the time axis, a Transformer encoder in the frequency axis, and a Transformer decoder that converts the dimension in the frequency axis. The output is then fed into the second hierarchy which consists of another Transformer encoder in the time axis. We evaluated our method with the widely used MAPS and MAESTRO v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the F1-scores of the metrics among Frame, Note, Note with Offset, and Note with Offset and Velocity estimations.
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration
Murata, Naoki, Saito, Koichi, Lai, Chieh-Hsin, Takida, Yuhta, Uesaka, Toshimitsu, Mitsufuji, Yuki, Ermon, Stefano
Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measurement operator is unknown. GibbsDDRM constructs a joint distribution of the data, measurements, and linear operator by using a pre-trained diffusion model for the data prior, and it solves the problem by posterior sampling with an efficient variant of a Gibbs sampler. The proposed method is problem-agnostic, meaning that a pre-trained diffusion model can be applied to various inverse problems without fine-tuning. In experiments, it achieved high performance on both blind image deblurring and vocal dereverberation tasks, despite the use of simple generic priors for the underlying linear operators.
FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation
Lai, Chieh-Hsin, Takida, Yuhta, Murata, Naoki, Uesaka, Toshimitsu, Mitsufuji, Yuki, Ermon, Stefano
An SGM involves a Score-based generative models (SGMs) learn a stochastic forward and backward process. In the forward family of noise-conditional score functions corresponding process, also known as the diffusion process, noise with to the data density perturbed with gradually increasing variances is added to each data point increasingly large amounts of noise. These until the original structure is lost, transforming data into perturbed data densities are linked together by pure noise. The backward process attempts to reverse the the Fokker-Planck equation (FPE), a partial differential diffusion process by using a neural network (called a noiseconditional equation (PDE) governing the spatialtemporal score model) that is trained to gradually denoise evolution of a density undergoing a diffusion the data, effectively transforming pure noise into clean data process. In this work, we derive a corresponding samples. The neural network is trained with a denoising equation called the score FPE that score matching objective (Hyvärinen & Dayan, 2005; Vincent, characterizes the noise-conditional scores of the 2011) to estimate the score (i.e., the gradient of the perturbed data densities (i.e., their gradients). Surprisingly, log-likelihood function) of the data density perturbed with despite the impressive empirical performance, various amounts of noise (as in forward process).
On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization
Lai, Chieh-Hsin, Takida, Yuhta, Uesaka, Toshimitsu, Murata, Naoki, Mitsufuji, Yuki, Ermon, Stefano
It refers to a (diffusion) model that is explicitly designed to align with The emergence of various notions of "consistency" the underlying trajectory defined by an ordinary differential in diffusion models has garnered considerable equation (ODE), stochastic differential equation (SDE), or attention and helped achieve improved sample partial differential equation (PDE). In this study, we aim quality, likelihood estimation, and accelerated to provide a theoretical investigation into the relationships sampling. Although similar concepts have between these three consistency-type models. Under certain been proposed in the literature, the precise relationships mild assumptions, we will rigorously establish the equivalence among them remain unclear. In this of these independently developed concepts.