It was the second half of the 18th century, and there was no branch of mathematical sciences called "Probability Theory". It was known simply by the rather odd-sounding "Doctrine of Chances" -- named after a book by Abraham de Moievre. An article called, "An Essay towards solving a Problem in the Doctrine of Chances", first formulated by Bayes, but edited and amended by his friend Richard Price, was read to Royal Society and published in the Philosophical Transactions of the Royal Society of London, in 1763. In this essay, Bayes described -- in a rather frequentist manner -- the simple theorem concerning joint probability which gives rise to the calculation of inverse probability i.e.

Thanks to my CS7641 class at Georgia Tech in my MS Analytics program, where I discovered this concept and was inspired to write about it. It is somewhat surprising that among all the high-flying buzzwords of machine learning, we don't hear much about the one phrase which fuses some of the core concepts of statistical learning, information theory, and natural philosophy into a single three-word-combo. Moreover, it is not just an obscure and pedantic phrase meant for machine learning (ML) Ph.Ds and theoreticians. It has a precise and easily accessible meaning for anyone interested to explore, and a practical pay-off for the practitioners of ML and data science. I am talking about Minimum Description Length.

Presbyterian reverend Thomas Bayes had no reason to suspect he'd make any lasting contribution to humankind. Born in England at the beginning of the 18th century, Bayes was a quiet and questioning man. He published only two works in his lifetime. In 1731, he wrote a defense of God's--and the British monarchy's--"divine benevolence," and in 1736, an anonymous defense of the logic of Isaac Newton's calculus. Yet an argument he wrote before his death in 1761 would shape the course of history.

It is somewhat surprising that among all the high-flying buzzwords of machine learning, we don't hear much about the one phrase which fuses some of the core concepts of statistical learning, information theory, and natural philosophy into a single three-word-combo. Moreover, it is not just an obscure and pedantic phrase meant for machine learning (ML) Ph.Ds and theoreticians. It has a precise and easily accessible meaning for anyone interested to explore, and a practical pay-off for the practitioners of ML and data science. I am talking about Minimum Description Length. Let's peel the layers off and see how useful it is… We start with (not chronologically) with Reverend Thomas Bayes, who by the way, never published his idea about how to do statistical inference, but was later immortalized by the eponymous theorem.