To learn more about cutting-edge data science tools like Apache Kafka, check out the Strata Data Conference in San Jose, March 5-8, 2018--registration is now open. Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make similar predictions on new data not seen before. What could be easier, right? Those with real experience building and deploying production systems built around machine learning know that, in fact, these systems are shockingly hard to build. This difficulty is not, for the most part, the algorithmic or mathematical complexities of machine learning algorithms. Creating such algorithms is difficult, to be sure, but the algorithm creation process is mostly done by academic researchers.
Do you fully understand how your systems operate? As an engineer, there is a lot you can do to aid the person who is going to manage your application in the future. In a previous post we covered how exposing the tuning knobs of the underlying technologies to operations will go a long way to making your application successful. Your application is a unique project--it's easier for you to learn the operational aspects of the underlying technologies, than for others to learn the specifics of all the applications. Notice I said "the person who is going to manage your application in the future" and not "operations."
Operating Kafka at Uber's scale almost instantaneously for many downstream consumers is difficult. We use batching aggressively and rely on asynchronous processing wherever possible for high throughput. Services use in-house client libraries to publish messages to Kafka proxies, which batch and forward them to regional Kafka clusters. Some Kafka topics are directly consumed from regional clusters, while many others are combined with data from other data centers into an aggregate Kafka cluster using uReplicator for scalable stream or batch processing.