I was criticized, badly.
Here's the review:
Comments to author(s)
REVIEWERS COMMENTS TO AUTHORS:
This paper is an obvious reject. Basically, this has been done before, and better, by others. In addition, the math - while trivial- is explained in too much detail and in a convoluted manner. Finally, the evaluation results are way below par as regards to what is currently the norm in robotics/SLAM.
NOVELTY:
The authors would better serve the community by doing a thorough literature search before embarking on new methods and submitting to major conferences. Here is (just a sample) of what has been done before:
J. Zheng and S. Tsuji, “Panoramic representation for route recognition
by a mobile robot,” International Journal of Computer Vision, vol. 9,
no. 1, pp. 55–76, October 1992
Using the Condensation Algorithm for Robust - Vision-based Mobile Robot Localization, Frank Dellaert, Dieter Fox, Wolfram Burgard, and Sebastian Thrun, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1999
Mosaicing a Large Number of Widely Dispersed, Noisy, and Distorted Images: A Bayesian Approach, F. Dellaert, S. Thrun, and C. Thorpe, tech. report CMU-RI-TR-99-34, Robotics Institute, Carnegie Mellon University, March, 1999.
Underwater video mosaics as visual navigation maps, Nuno Gracias, José Santos-Victor, VisLab-TR 07/2000 - Computer Vision and Image Understanding, vol. 79(1), pp. 66-91, July 2000
Kelly A., “Pose Determination and Tracking in Image Mosaic-
Based Vehicle Position Estimation”, International Conference on
Intelligent Robots and Systems (IROS00), October, 2000.
Mosaicing Large Cyclic Environments for Visual Navigation in Autonomous Vehicles, R. Unnikrishnan and A. Kelly, IEEE International Conference on Robotics and Automation, 2002 (ICRA '02), Vol. 4, May, 2002, pp. 4299-4306.
Mosaic-based navigation for autonomous underwater vehicles. Gracias, N.R. van der Zwaan, S. Bernardino, A. Santos-Victor, J. Instituto de Sistemas e Robotica, Instituto Superior Tecnico, Lisbon, Portugal; Oceanic Engineering, IEEE Journal of Publication Date: Oct. 2003
Volume: 28, Issue: 4, pp. 609- 624
RELEVANCE: if novel, this paper would be very relevant.
SIGNIFICANCE: It is not novel, hence not significant.
CLARITY: The math is made way too complicated ! It can be explained in one line: every observed point is related to a ground point by
o = H T g
where H is a homography and T a 2D rigid transform. H can be calibrated in *many* ways, the simplest of which is simply using 4 calibrated points on the ground. Localization is finding the three DOF of T, and SLAM is finding all of them.
TECHNICAL SOUNDNESS: The math in the paper is sound, if mostly tivial. The calibration method, however, is overly complex, as discussed above. And true SLAM would work directly on the translation and angles.
QUALITY OF EVALUATION: The small tracks are useless in evaluating the method. Compare with Al Kelly's papers and the Portuguese underwater SLAM for a sample of what constitutes a *real* evaluation.
Review 2:
Comments to author(s)
REVIEWERS COMMENTS TO AUTHORS:
The authors present a SLAM method for robotic navigation on a planar surface using monocular vision. Rather than relying on point features and correspondences or computing 3D positions of features (as in stereo vision), their approach is to transform each image from a front-facing camera into an overhead view through virtual rotation. The authors show that translation and rotation of these rotated images corresponds exactly to changes in position and orientation of the robot. Therefore, localization corresponds directly to image matching between frames.
The technique is described in a thorough and rigorous fashion with compelling results from a robot platform in an indoor environment. The two obvious limitations of this approach are that it is only suited for planar environments and relies in distinguishing characteristics on the floor (especially since only the bottom 1/3 of each image is used by the authors in their experimentation). The floor in their experiments is certainly more feature-rich than those of most indoor environments would be.
I would like to suggest one improvement to the ‘Localization by Ground Matching’ section of the paper. The mentioned approach relies on an exhaustive global search in 3 dimensions (x, y, theta), which is not only computationally expensive but also can select incorrect matchings on consistently pattered floors. Since rotations and translations between images result directly from the motion of the robot, the robot’s based motion model can be used to not only more intelligently focus the search for the best ground image matching (for example by using a particle filter approach to sample some distribution of positions around the odometry-based estimate of the new pose of the robot), but also can help decide on the most likely match among similar-quality matchings (for example in an environment with a square tiling pattern). I’m not sure but you may already be using such an approach, and if so you might want to explicitly mention it.
Overall, I think this is a logical and well-presented approach to dealing with a specific class of domains a robot may encounter in a cheap and efficient manner. My main concern, however, is that this approach has been applied already in the past.
RELEVANCE:
Presents an approach for SLAM on a planar surface from monocular vision. Such an approach would be of interest to anyone considering low-cost camera solutions for SLAM.
SIGNIFICANCE:
Has noticeable limitations (requires planar surface, distinguishing features on floor), but could be a very effective approach if used in the right domains.
TECHNICAL SOUNDNESS:
Thorough derivations of techniques for image rotation (although I had trouble following portions of the ‘Transformation to the Ground Plane’ section. Ground matching technique seems to be sub-optimal.
NOVELTY:
While this is a nice paper, such work has to some extent been done before (although I'm not the most knowledgeable person in this domain). This is my biggest concern in the decision to accept / reject this paper.
QUALITY OF EVALUATION:
Shows very good results from robot testing in indoor environment. Environment floor was very conducive to this technique (very irregular and discernable patterns on floor)
CLARITY:
Good organization and well-written. Convincing results.
Review 3:
Comments to author(s)
The main focus of this paper is the use of dense texture information from monocular cameras to perform robot localization in an previously unvisited environment. The primary effort is concentrated on extraction of the ground plane.
This idea is not a novel one for robotic localization. Many other researchers have pointed a camera at the ground, and used methods such as optical flow, image registration or the like to track the incremental motion of the robot. As such, I would like to see some comparison or differentiation for this method.
The significance of this paper is low, as the motion of the camera is restricted to the plane. No pitch, roll, or height changes are considered. The method relies on flat, textured surfaces. Given the prevalence of other methods which operate in this type of an environment, significant improvement over existing methods would need to be shown.
One concern that I have relates to the "orientation estimation", as described around figure 6. First, I was worried about the level of detail given to explaining this critical step. From the text, it appears that the extraction of vertical lines requires human input. The method for using a horizontal infinite line is completely skipped over, and the term itself should either be cited or defined. My second thought was whether it would always be the case that either vertical lines or an infinite line would be visible. Even in an office environment, this often not the case, particularly when the robot is approaching a wall. The last concern that I had was, if these vertical lines are omnipresent as indicated, and they are applicable for correcting orientation, why not use these lines as landmarks?
Nearly all current SLAM methods use some form of probabilistic tracking to handle uncertainty in the system, often with a particle filter or some form of a Kalman filter. This is often the primary focus of new research in SLAM. I was surprised to see this paper use a greedy/maximal likelihood approach, which has often been shown to be insufficient for SLAM or localization. If this method can be used without probabilistic tracking methods, it would be interesting to see a discussion of why this is true. If this is not true, I think that this method should be extended with something like a particle filter, and evaluated in that context. As it currently is described, I feel like this could be the beginning of a good method, but this paper is not yet fully developed. I would encourage the authors to perform more research to extend the system to handle larger areas and demonstrate a general applicability, and then resubmit the paper at a later time.
Empirical testing and evaluation was fairly limited, and the domains described were fairly small, making it difficult to really get a sense of the method's general performance. Also, there was no method available to evaluate the correctness of the algorithm, either through ground truth, or some event like loop closing which can demonstrate the correct accumulated displacement over time.