r/MachineLearning - [1905.11786] Putting An End to End-to-End: Gradient-Isolated Learning of Representations

#artificialintelligence 

In general agree, but in machine learning mutual information seems to be a case where approximation can help sometime rather than hurt. In another discussion this week about the Tishby information bottleneck cameldrv correctly said that the mutual information between a signal and its encrypted version should be high, but in practice no algorithm will discover this. But turn that around: when used in a complex DNN, a learning algorithm that seeks to maximize mutual information (such as today's putting-and-end-to-end-to-end) could in theory produce something like a weak encryption: the desired information is extracted, but it is in such a complex form that _another_ DNN classifier would be needed to extract it! So the fact that mutual information can only approximate can be a good thing, because this is prevented when optimizing objectives that cannot "see" complex relationships. A radical example is in the HSIC bottlneck paper where an approximation that is only monotonically related spontaneously produced on-hot classifications without any guidance.