Local features have played a key role in a wide range of computer vision applications throughout the past 20 years, particularly since the introduction of SIFT. Despite the drastic advancements resulting from deep learning techniques, 3D reconstruction under challenging conditions remains somewhat of an outlier, as performance in small, constrained benchmarks does not necessarily translate to real-world scenarios. In practice, keypoint-based methods remain the most common solution to this problem, and techniques such as SIFT or RANSAC are still very much in use.
Historically, machine learning research on local features has focused on learning patch descriptors, for which training data is relatively easy to obtain. However, performance on patch matching benchmarks is not always meaningful, as descriptors are tightly coupled with the keypoints they work on and image properties which can greatly vary from one domain or dataset to another. More representative metrics can be extracted further down the chain, as for instance at the 3D reconstruction level, but this requires better ground truth.
In parallel, there has been a strong push over the last few years towards tackling the image matching problem with dense methods, that is, doing away with keypoints altogether. While promising results have been demonstrated under narrow baselines, particularly with dense, deep networks, the general wide-baseline scenario remains unsolved. In order to enable research in this area, large-scale benchmarks with training data are required. However, current datasets are constrained in terms of size, photometric variations, and viewpoint changes.
There is thus a clear need for new, large-scale, challenging benchmarks to both train and evaluate new methods for image matching. To this end, we propose a new challenge with two datasets. Both contain training data with accurate ground truth poses, along with other cues that can be used for further supervision.
In order to learn and evaluate models that can perform well under a wide range of situations, it is of paramount importance to collect information from multiple sensors obtained at different times. A natural solution is thus to turn to photo-tourism data. In this dataset we rely on 26 photo-tourism image collections of popular landmarks originally collected by the Yahoo Flickr Creative Commons 100M (YFCC) dataset and Reconstructing the world in six days. The sequences range from 75 images to almost 4k per sequence.
We can obtain dense 3D reconstructions from these collections of images with off-the-shelf Structure from Motion (SfM) algorithms. We rely on COLMAP, a state of the art method. In addition to a sparse point cloud, COLMAP is able to densify the estimates to produce noisy but useful depth maps for every image. We post-process these depth maps by projecting each image pixel to 3D space at the estimated depth, and mark it as invalid if the closest 3D point from the reconstruction is further than a threshold. The resulting depth maps are still noisy, but many occluded pixels are filtered out. While not perfect, these estimates can be used to project points across images and train keypoint detectors and descriptors, as done for example by LF-Net. We provide these ‘clean’ depth maps along with the images.
In order to guarantee a reasonable degree of overlap for each image pair we perform a visibility check with the SfM points visible over both images. Our metric is based on the size of the bounding box containing all of the points visible in either image, and applied over both views. We use this criteria to select valid image subsets for testing, and provide the entire visibility matrix for training, which can be easily thresholded to generate a list of valid pairs.
We provide 15 sequences for training and validation and 11 for testing.
|Training sequences||Num. images||Num. 3D points|
|Test sequences||Num. images||Num. 3D points|
The challenge consists of two tracks: stereo matching, and multi-view reconstruction with small subsets of images. New tasks and data modalities will be added in future editions.
In this task we match two images across wide baselines. Image pairs are selected according to visibility constraints so that at least part of the scene is guaranteed to overlap. The input can be a set of keypoints, or keypoints and correspondences. We primarily use two metrics:
Submissions can contain up to 8000 keypoints. For the purposes of the challenge, we rank entries by mAP15o, which we have found empirically to be an adequate proxy for wide-baseline stereo matching performance. Note that we also plan to allow submissions of the estimated poses in the future, to include dense methods based on deep networks.
While modern solutions have shown very promising results in stereo, it is not clear how much of these improvements remains after large-scale reconstruction with Bundle Adjustment. An alternative approach is thus to evaluate local features directly for SfM, as done by the Comparative Evaluation Benchmark. Unfortunately, it is not feasible to obtain truly accurate depth measurements for large image sequences collected from heterogenous sensors, so that under most circumstances the best we can do is collect statistics such as the number of observations obtained with the reconstruction, the track length, or the reprojection error. While this is informative, most methods seem to perform similarly under this scenario.
By contrast, we propose to build SfM reconstructions from small (3, 5, 10, 25) subsets of images and use the poses obtained from the entire (much larger) set as ground truth. We believe this can provide a better proxy for learning and evaluating feature extractors and matching algorithms for the task of pose estimation.
In specific, we subsample the test sets to 100 images and from them, generate 100 different subsets of 3, 5, 10, and 25 images. The subsets are sampled randomly from each dataset, accounting for visibility constraints. In order to compute the mAP, we use the same procedure as for stereo, with every possible combination of two images (i.e. 3 combinations for 3 images, 10 for 5 images, etc.) and average the results. Note that this penalizes reconstructions that fail to register images. If COLMAP generates multiple 3D models which cannot be co-registered, we consider the largest one (the one with the most images). As for stereo, use mAP15o to rank challenge entries.
The data can be downloaded here:
Code to parse the training data and format challenge submissions can be found here:
Results can be submitted with this link:
The submission website is password-protected to prevent abuse, please contact the organizers at email@example.com for the password (please account for short delays in answering and uploading close the deadline).
The Scape Imperial Localisation Dataset (SILDa) focuses on several localisation related tasks, and one of the tasks included is concerned with image matching under very significant camera pose and enviromental condition changes. The data were collected around Imperial College London over a period of one year. For the interested reader, a more detailed description of the dataset can be found here.
The SILDa Image Matching task, focuses on evaluating matching performance in pairs of images with significant difficulty. The ground truth models for the accurate camera pose estimation were built using an SfM pipeline. Given the nature of the data, some pairs are quite challenging due to the fact that the possible matching area is constrained to a small region of an image.
The results will be based on evaluating the matching accuracy, in terms of pixel level matching. Each method will produce a list of pixel to pixel matches for each image pair, which will then be evaluated in terms of their accuracy w.r.t to the underlying epipolar geometry.
The data can be downloaded here:
Code to parse the test data and format challenge submissions can be found here:
Results can be submitted directly with this link: Please note that the subsmission process is different than the one for the Phototourism dataset.