python file-processing framework
Introducing Wildebeest, a Python File-Processing Framework
ShopRunner has more than ten million product images that we use to train computer vision classification models. Moving those files around and processing them is a pain without good tooling. Just downloading them serially takes many days, and the occasional corrupted image can bring the whole process to a halt. Without good logging and error handling, it might then be necessary to start the process over until the next error is raised. Over time we built up techniques for parallelizing over files, handling errors, and skipping files that had already been processed.