Earlier this year at the QCon.ai I also presented an interactive workshop, using R and the Microsoft Cognitive Services APIs, to automatically generate captions for images, and to create a tool to recognize images of hotdogs. Video from both the presentation, and the workshop (which starts at the 10:00 mark), is now available to view the QCon.ai You can find the slides for the presentation here. The R code for the "Not Hotdog" workshop is available as an Azure Notebook which you can clone and use to follow along with the workshop.
Our long national nightmare is over: Thanks to HBO and Silicon Valley there's finally an app that will tell you if the object you pointed your phone's camera at is a hot dog or not. For fans of the show, it's a cute joke, but everyone else might be a little puzzled. As a brief bit of background, T.J. Miller's character Erlich Bachman accidentally invested in an app he thought had something to do with Oculus, when, in actuality, it was an application with recipes for preparing octopus rather than anything to do with virtual reality. A common mistake, to be sure. That led to pivoting the app to become the "Shazam of food."
That would have beat what I ended up shipping, but the problem of course was the size of those networks. So really, if we're comparing apples to apples, I'll say none of the "small", mobile-friendly neural nets (e.g. SqueezeNet, MobileNet) I tried to retrain did anywhere near as well as my DeepDog network trained from scratch. The training runs were really erratic and never really reached any sort of upper bound asymptotically as they should. I think this has to do with the fact that these very small networks contain data about a lot of ImageNet classes, and it's very hard to tune what they should retain vs. what they should forget, so picking your learning rate (and possibly adjusting it on the fly) ends up being very critical.