Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude ( 0.1 ms to 100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment ( 3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. Since the beginning of the recent wave of deep learning research, there have been many attempts to create generative models of expressive musical audio de novo. These models would ideally generate audio that is both musically and sonically realistic to the point of being indistinguishable to a listener from music composed and performed by humans. However, modeling music has proven extremely difficult due to dependencies across the wide range of timescales that give rise to the characteristics of pitch and timbre (short-term) as well as those of rhythm (medium-term) and song structure (long-term). On the other hand, much of music has a large hierarchy of discrete structure embedded in its generative process: a composer creates songs, sections, and notes, and a performer realizes those notes with discrete events on their instrument, creating sound.