Reviews: Object based Scene Representations using Fisher Scores of Local Subspace Projections

Neural Information Processing Systems 

There is no theoretical justification of why MFA outperforms FV on transfer learning form object level to holistic scene descriptor. The main argument of the paper about "... inability of the standard GMM ... to provide good approximation ..." in L73-75 needs proof or reference to appropriate literature rather than only experiment results. It needs to clarify why full covariance in MFA is the key to transfer learning problem on CNN features. I reckon it as a week argument although it was considered as second contribution of the paper because; any other dictionary learning method with full covariance should generate the same improvement as MFA according to authors' reasoning. An experienced reader is already aware of these formulations; hence it is expected to see the focus of formulation towards main claims which I could not see them there.