DocumentNet: Bridging the Data Gap in Document Pre-Training