How Elsevier Accelerated COVID-19 research using Dask on Saturn Cloud -- Elsevier Labs
The version of CORD-19 that we used yielded 3,389,064 paragraphs and 16,952,279 sentences. Each sentence is sent to each model and yields zero or more entities. A notable point is that the process of generating entities from sentences is embarrassingly parallel, and therefore processing multiple sentences in parallel achieves savings in processing time. . To process the dataset, we used Dask, an open source library for parallel computing in Python. Dask provides multiple convenient abstractions that mimic familiar APIs such as Numpy and Pandas Dataframes, which can operate on datasets that do not fit in main memory.
Oct-27-2020, 22:45:45 GMT
- Industry:
- Technology: