Eighth Workshop on Image Matching: Local Features & Beyond

Introduction

Local features have played a key role in a wide range of computer vision applications throughout the past 20 years, particularly since the introduction of SIFT.

Despite the drastic advancements resulting from deep learning techniques, 3D reconstruction under challenging conditions remains somewhat of an outlier, as performance in small, constrained benchmarks does not necessarily translate to real-world scenarios. In practice, keypoint-based methods remain the most common solution to this problem, and techniques such as SIFT or RANSAC are still very much in use.

Historically, machine learning research on local features has focused on learning patch descriptors, for which training data is relatively easy to obtain. However, performance on patch matching benchmarks is not always meaningful, as descriptors are tightly coupled with the keypoints they work on and image properties which can greatly vary from one domain or dataset to another. More representative metrics can be extracted further down the chain, as for instance at the 3D reconstruction level, but this requires better ground truth.

In parallel, there has been a strong push over the last few years towards tackling the image matching problem with dense methods, that is, doing away with keypoints altogether. While promising results have been demonstrated under narrow baselines, particularly with dense, deep networks, the general wide-baseline scenario remains unsolved. In order to enable research in this area, large-scale benchmarks with training data are required. However, current datasets are constrained in terms of size, photometric variations, and viewpoint changes.

There is thus a clear need for new, large-scale, challenging benchmarks to both train and evaluate new methods for image matching. To this end, we propose a new challenge with two datasets. Both contain training data with accurate ground truth poses, along with other cues that can be used for further supervision.

Dataset 1: Phototourism

In order to learn and evaluate models that can perform well under a wide range of situations, it is of paramount importance to collect information from multiple sensors obtained at different times. A natural solution is thus to turn to photo-tourism data. In this dataset we rely on 26 photo-tourism image collections of popular landmarks originally collected by the Yahoo Flickr Creative Commons 100M (YFCC) dataset and Reconstructing the world in six days. The sequences range from 75 images to almost 4k per sequence.

Examples from brandenburg_gate

We can obtain dense 3D reconstructions from these collections of images with off-the-shelf Structure from Motion (SfM) algorithms. We rely on COLMAP, a state of the art method. In addition to a sparse point cloud, COLMAP is able to densify the estimates to produce noisy but useful depth maps for every image. We post-process these depth maps by projecting each image pixel to 3D space at the estimated depth, and mark it as invalid if the closest 3D point from the reconstruction is further than a threshold. The resulting depth maps are still noisy, but many occluded pixels are filtered out. While not perfect, these estimates can be used to project points across images and train keypoint detectors and descriptors, as done for example by LF-Net. We provide these ‘clean’ depth maps along with the images.

In order to guarantee a reasonable degree of overlap for each image pair we perform a visibility check with the SfM points visible over both images. Our metric is based on the size of the bounding box containing all of the points visible in either image, and applied over both views. We use this criteria to select valid image subsets for testing, and provide the entire visibility matrix for training, which can be easily thresholded to generate a list of valid pairs.

We provide 15 sequences for training and validation and 11 for testing.

Training sequences	Num. images	Num. 3D points
brandenburg_gate	1363	100040
buckingham_palace	1676	234052
colosseum_exterior	2063	259807
grand_place_brussels	1083	229788
hagia_sophia_interior	888	235541
notre_dame_front_facade	3765	488895
palace_of_westminster	983	115868
pantheon_exterior	1401	166923
prague_old_town_square	2316	558600
sacre_coeur	1179	140659
st_peters_square	2504	232329
taj_mahal	1312	94121
temple_nara_japan	904	92131
trevi_fountain	3191	580673
westminster_abbey	1061	198222
Total	25.6k	3.7M

Test sequences	Num. images	Num. 3D points
british_museum	660	73569
florence_cathedral_side	108	44143
lincoln_memorial_statue	850	58661
london_bridge	629	72235
milan_cathedral	124	33905
mount_rushmore	138	45350
piazza_san_marco	249	95895
reichstag	75	17823
sagrada_familia	401	120723
st_pauls_cathedral	615	98872
united_states_capitol	258	35095
Total	4107	696k

The challenge consists of two tracks: stereo matching, and multi-view reconstruction with small subsets of images. New tasks and data modalities will be added in future editions.

Task 1: Wide-baseline stereo matching

In this task we match two images across wide baselines. Image pairs are selected according to visibility constraints so that at least part of the scene is guaranteed to overlap. The input can be a set of keypoints, or keypoints and correspondences. We primarily use two metrics:

Matching score: The ratio of ground truth correspondences that can be recovered with nearest neighbour matching (and optionally the ratio test), without a robust matching strategy. For this we rely on ground truth depth to translate pixel coordinates from one image to another and a fixed threshold to decide if two keypoints match. Occluded points that cannot be matched are excluded from the calculation.
Pose estimation: We apply robust matching with RANSAC and use the surviving inliers to retrieve the relative pose between the two cameras. We measure performance with the angular difference between the estimated and ground truth vectors for both rotation and translation. To reduce this to one value, we use a variable threshold (the same value for rotation and translation) to determine each pose as correct or not, and compute the area under the curve up to the angular threshold x. This value is thus the mean average precision up to x, or mAP^x. We consider 5, 10, 15, 20, and 25^o.

Submissions can contain up to 8000 keypoints. For the purposes of the challenge, we rank entries by mAP^{15^o}, which we have found empirically to be an adequate proxy for wide-baseline stereo matching performance. Note that we also plan to allow submissions of the estimated poses in the future, to include dense methods based on deep networks.

Task 2: SfM from small subsets

While modern solutions have shown very promising results in stereo, it is not clear how much of these improvements remains after large-scale reconstruction with Bundle Adjustment. An alternative approach is thus to evaluate local features directly for SfM, as done by the Comparative Evaluation Benchmark. Unfortunately, it is not feasible to obtain truly accurate depth measurements for large image sequences collected from heterogenous sensors, so that under most circumstances the best we can do is collect statistics such as the number of observations obtained with the reconstruction, the track length, or the reprojection error. While this is informative, most methods seem to perform similarly under this scenario.

By contrast, we propose to build SfM reconstructions from small (3, 5, 10, 25) subsets of images and use the poses obtained from the entire (much larger) set as ground truth. We believe this can provide a better proxy for learning and evaluating feature extractors and matching algorithms for the task of pose estimation.

In specific, we subsample the test sets to 100 images and from them, generate 100 different subsets of 3, 5, 10, and 25 images. The subsets are sampled randomly from each dataset, accounting for visibility constraints. In order to compute the mAP, we use the same procedure as for stereo, with every possible combination of two images (i.e. 3 combinations for 3 images, 10 for 5 images, etc.) and average the results. Note that this penalizes reconstructions that fail to register images. If COLMAP generates multiple 3D models which cannot be co-registered, we consider the largest one (the one with the most images). As for stereo, use mAP^{15^o} to rank challenge entries.

Download links and and submission instructions

The data can be downloaded here:

Code to parse the training data and format challenge submissions can be found here:

Data parsing and submission instructions

Results can be submitted with this link:

Submission link

The submission website is password-protected to prevent abuse, please contact the organizers at imagematching@uvic.ca for the password (please account for short delays in answering and uploading close the deadline).

Dataset 2: SILDa Image Matching

The Scape Imperial Localisation Dataset (SILDa) focuses on several localisation related tasks, and one of the tasks included is concerned with image matching under very significant camera pose and enviromental condition changes. The data were collected around Imperial College London over a period of one year. For the interested reader, a more detailed description of the dataset can be found here.

The Royal School of Mines at Imperial College London, across different conditions.

Matching Task Description

The SILDa Image Matching task, focuses on evaluating matching performance in pairs of images with significant difficulty. The ground truth models for the accurate camera pose estimation were built using an SfM pipeline. Given the nature of the data, some pairs are quite challenging due to the fact that the possible matching area is constrained to a small region of an image.

Example of the pairs available in the SILDa Image Matching dataset.

The results will be based on evaluating the matching accuracy, in terms of pixel level matching. Each method will produce a list of pixel to pixel matches for each image pair, which will then be evaluated in terms of their accuracy w.r.t to the underlying epipolar geometry.

Download links and and submission instructions

The data can be downloaded here:

Download the data

Code to parse the test data and format challenge submissions can be found here:

Data parsing and submission instructions

Results can be submitted directly with this link: Please note that the subsmission process is different than the one for the Phototourism dataset.

Submission link

IMW CVPR 2019: Challenge

Introduction

Dataset 1: Phototourism

Task 1: Wide-baseline stereo matching

Task 2: SfM from small subsets

Download links and and submission instructions

Dataset 2: SILDa Image Matching

Matching Task Description

Download links and and submission instructions

home