Review for NeurIPS paper: Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization
–Neural Information Processing Systems
There are many small language mistakes, mostly in the technical section (Section 3), but they are not the main problem. The proposed method is simple (which, again, is something good), but somehow it is difficult to understand from the text. I try to detail below what could be changed to improve the text clarity: - Calling "Cross-view transformers" the mapping functions used in the constraint term is confusing, as "transformer" means other thing in deep learning (transformers in NLP, spatial transformers) - Section 3.4 (about the transformers) mentions features, while in fact it is the final outputs that are "transformed" - it is not said explicitly that the weights in Eq (1) are learned in Section 3.4 - Eqs (3) to (6) seem to use the Euclidean(?) norm, while the authors probably meant some similarity functions; - Eqs (6) is disconnected from the text - Figure 1 is very dense and it is difficult to understand the method from it, while it should be possible to convey visually the method in a simple way - mentioning the Hough transform to explain the method did not make the presentation more intuitive for me.
Neural Information Processing Systems
Feb-8-2025, 03:43:45 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.63)
- Vision (0.40)
- Information Technology > Artificial Intelligence