Review for NeurIPS paper: Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization

Feb-8-2025, 03:43:45 GMT–Neural Information Processing Systems

There are many small language mistakes, mostly in the technical section (Section 3), but they are not the main problem. The proposed method is simple (which, again, is something good), but somehow it is difficult to understand from the text. I try to detail below what could be changed to improve the text clarity: - Calling "Cross-view transformers" the mapping functions used in the constraint term is confusing, as "transformer" means other thing in deep learning (transformers in NLP, spatial transformers) - Section 3.4 (about the transformers) mentions features, while in fact it is the final outputs that are "transformed" - it is not said explicitly that the weights in Eq (1) are learned in Section 3.4 - Eqs (3) to (6) seem to use the Euclidean(?) norm, while the authors probably meant some similarity functions; - Eqs (6) is disconnected from the text - Figure 1 is very dense and it is difficult to understand the method from it, while it should be possible to convey visually the method in a simple way - mentioning the Hough transform to explain the method did not make the presentation more intuitive for me.

cross-view consistency, hybrid-cylindrical-spherical voxelization, transformer, (3 more...)

Neural Information Processing Systems

Feb-8-2025, 03:43:45 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.63)
  - Vision (0.40)