IMW CVPR 2019: Challenge


Local features have played a key role in a wide range of computer vision applications throughout the past 20 years, particularly since the introduction of SIFT. Despite the drastic advancements resulting from deep learning techniques, 3D reconstruction under challenging conditions remains somewhat of an outlier, as performance in small, constrained benchmarks does not necessarily translate to real-world scenarios. In practice, keypoint-based methods remain the most common solution to this problem, and techniques such as SIFT or RANSAC are still very much in use.

Historically, machine learning research on local features has focused on learning patch descriptors, for which training data is relatively easy to obtain. However, performance on patch matching benchmarks is not always meaningful, as descriptors are tightly coupled with the keypoints they work on and image properties which can greatly vary from one domain or dataset to another. More representative metrics can be extracted further down the chain, as for instance at the 3D reconstruction level, but this requires better ground truth.

In parallel, there has been a strong push over the last few years towards tackling the image matching problem with dense methods, that is, doing away with keypoints altogether. While promising results have been demonstrated under narrow baselines, particularly with dense, deep networks, the general wide-baseline scenario remains unsolved. In order to enable research in this area, large-scale benchmarks with training data are required. However, current datasets are constrained in terms of size, photometric variations, and viewpoint changes.

There is thus a clear need for new, large-scale, challenging benchmarks to both train and evaluate new methods for image matching. To this end, we propose a new challenge with two datasets. Both contain training data with accurate ground truth poses, along with other cues that can be used for further supervision.

Dataset 1: Phototourism

In order to learn and evaluate models that can perform well under a wide range of situations, it is of paramount importance to collect information from multiple sensors obtained at different times. A natural solution is thus to turn to photo-tourism data. In this dataset we rely on 26 photo-tourism image collections of popular landmarks originally collected by the Yahoo Flickr Creative Commons 100M (YFCC) dataset and Reconstructing the world in six days. The sequences range from 75 images to almost 4k per sequence.

Examples from brandenburg_gate

We can obtain dense 3D reconstructions from these collections of images with off-the-shelf Structure from Motion (SfM) algorithms. We rely on COLMAP, a state of the art method. In addition to a sparse point cloud, COLMAP is able to densify the estimates to produce noisy but useful depth maps for every image. We post-process these depth maps by projecting each image pixel to 3D space at the estimated depth, and mark it as invalid if the closest 3D point from the reconstruction is further than a threshold. The resulting depth maps are still noisy, but many occluded pixels are filtered out. While not perfect, these estimates can be used to project points across images and train keypoint detectors and descriptors, as done for example by LF-Net. We provide these ‘clean’ depth maps along with the images.

In order to guarantee a reasonable degree of overlap for each image pair we perform a visibility check with the SfM points visible over both images. Our metric is based on the size of the bounding box containing all of the points visible in either image, and applied over both views. We use this criteria to select valid image subsets for testing, and provide the entire visibility matrix for training, which can be easily thresholded to generate a list of valid pairs.

We provide 15 sequences for training and validation and 11 for testing.

Training sequences Num. images Num. 3D points
brandenburg_gate 1363 100040
buckingham_palace 1676 234052
colosseum_exterior 2063 259807
grand_place_brussels 1083 229788
hagia_sophia_interior 888 235541
notre_dame_front_facade 3765 488895
palace_of_westminster 983 115868
pantheon_exterior 1401 166923
prague_old_town_square 2316 558600
sacre_coeur 1179 140659
st_peters_square 2504 232329
taj_mahal 1312 94121
temple_nara_japan 904 92131
trevi_fountain 3191 580673
westminster_abbey 1061 198222
Total 25.6k 3.7M

Test sequences Num. images Num. 3D points
british_museum 660 73569
florence_cathedral_side 108 44143
lincoln_memorial_statue 850 58661
london_bridge 629 72235
milan_cathedral 124 33905
mount_rushmore 138 45350
piazza_san_marco 249 95895
reichstag 75 17823
sagrada_familia 401 120723
st_pauls_cathedral 615 98872
united_states_capitol 258 35095
Total 4107 696k

The challenge consists of two tracks: stereo matching, and multi-view reconstruction with small subsets of images. New tasks and data modalities will be added in future editions.

Task 1: Wide-baseline stereo matching

In this task we match two images across wide baselines. Image pairs are selected according to visibility constraints so that at least part of the scene is guaranteed to overlap. The input can be a set of keypoints, or keypoints and correspondences. We primarily use two metrics:

  • Matching score: The ratio of ground truth correspondences that can be recovered with nearest neighbour matching (and optionally the ratio test), without a robust matching strategy. For this we rely on ground truth depth to translate pixel coordinates from one image to another and a fixed threshold to decide if two keypoints match. Occluded points that cannot be matched are excluded from the calculation.
  • Pose estimation: We apply robust matching with RANSAC and use the surviving inliers to retrieve the relative pose between the two cameras. We measure performance with the angular difference between the estimated and ground truth vectors for both rotation and translation. To reduce this to one value, we use a variable threshold (the same value for rotation and translation) to determine each pose as correct or not, and compute the area under the curve up to the angular threshold x. This value is thus the mean average precision up to x, or mAPx. We consider 5, 10, 15, 20, and 25o.

Submissions can contain up to 8000 keypoints. For the purposes of the challenge, we rank entries by mAP15o, which we have found empirically to be an adequate proxy for wide-baseline stereo matching performance. Note that we also plan to allow submissions of the estimated poses in the future, to include dense methods based on deep networks.

Task 2: SfM from small subsets

While modern solutions have shown very promising results in stereo, it is not clear how much of these improvements remains after large-scale reconstruction with Bundle Adjustment. An alternative approach is thus to evaluate local features directly for SfM, as done by the Comparative Evaluation Benchmark. Unfortunately, it is not feasible to obtain truly accurate depth measurements for large image sequences collected from heterogenous sensors, so that under most circumstances the best we can do is collect statistics such as the number of observations obtained with the reconstruction, the track length, or the reprojection error. While this is informative, most methods seem to perform similarly under this scenario.

By contrast, we propose to build SfM reconstructions from small (3, 5, 10, 25) subsets of images and use the poses obtained from the entire (much larger) set as ground truth. We believe this can provide a better proxy for learning and evaluating feature extractors and matching algorithms for the task of pose estimation.

In specific, we subsample the test sets to 100 images and from them, generate 100 different subsets of 3, 5, 10, and 25 images. The subsets are sampled randomly from each dataset, accounting for visibility constraints. In order to compute the mAP, we use the same procedure as for stereo, with every possible combination of two images (i.e. 3 combinations for 3 images, 10 for 5 images, etc.) and average the results. Note that this penalizes reconstructions that fail to register images. If COLMAP generates multiple 3D models which cannot be co-registered, we consider the largest one (the one with the most images). As for stereo, use mAP15o to rank challenge entries.

The data can be downloaded here:

Code to parse the training data and format challenge submissions can be found here:

Results can be submitted with this link:

The submission website is password-protected to prevent abuse, please contact the organizers at for the password (please account for short delays in answering and uploading close the deadline).

Dataset 2: SILDa Image Matching

The Scape Imperial Localisation Dataset (SILDa) focuses on several localisation related tasks, and one of the tasks included is concerned with image matching under very significant camera pose and enviromental condition changes. The data were collected around Imperial College London over a period of one year. For the interested reader, a more detailed description of the dataset can be found here.

The Royal School of Mines at Imperial College London, across different conditions.

Matching Task Description

The SILDa Image Matching task, focuses on evaluating matching performance in pairs of images with significant difficulty. The ground truth models for the accurate camera pose estimation were built using an SfM pipeline. Given the nature of the data, some pairs are quite challenging due to the fact that the possible matching area is constrained to a small region of an image.

Example of the pairs available in the SILDa Image Matching dataset.

The results will be based on evaluating the matching accuracy, in terms of pixel level matching. Each method will produce a list of pixel to pixel matches for each image pair, which will then be evaluated in terms of their accuracy w.r.t to the underlying epipolar geometry.

The data can be downloaded here:

Code to parse the test data and format challenge submissions can be found here:

Results can be submitted directly with this link: Please note that the subsmission process is different than the one for the Phototourism dataset.