In this example we are faced with thousands of text articles scraped from both HMTL and PDF files. The quality of text returned is very much dependent on the scraping process. From sample-checking some of the results we know there are issues ranging from bad links, unreadable PDFs to items which have been successfully read-in but the content itself is complete garbage. The articles relate to Company Modern Slavery returns from this database: https://www.modernslaveryregistry.org/ These now reside in a Pandas data frame with'meta data' on each item such as the company name and year of publication, alongside the text which has been scraped from the return: The python Missingno package is super-useful.
First, a couple of pointers to keep in mind when searching for datasets. Kaggle: A data science site that contains a variety of externally contributed interesting datasets. You can find all kinds of niche datasets in its master list, from ramen ratings to basketball data to and even seattle pet licenses. Although the data sets are user-contributed, and thus have varying levels of cleanliness, the vast majority are clean. VisualData: Discover computer vision datasets by category, it allows searchable queries.
Deep Learning is a revolutionary field, but for it to work as intended, it requires data. The area related to these big datasets is known as Big Data, which stands for the abundance of digital data. Data is as important for Deep Learning algorithms as the architecture of the network itself, i.e., the software. Acquiring and cleaning the data is one of the most valuable aspects of the work. Without data, the neural networks cannot learn.