Results

marlet — Tue, 20 Nov 2018 11:52:07 +0000

The main results of the Semapolis project can be summarized as follows.

generic problems, not specific to urban data, tested on general benchmarks

Extraction and Understanding of Visual Features

Extracting relevant visual features is a basic but key ingredient to many computer vision problems, including recognition, object detection, semantization, etc. One issue is that deep neural network, which are nowadays often used for these tasks, are black boxes whose behavior is hard to understand. We have proposed a method to understand deep features using computer-generated imagery [Aubry & Russell ICCV 2015]. Another problem is that deep learning often requires large annotated datasets, which are difficult and expensive to create. We have developed an unsupervised method to learn relevant visual features [Gidaris et al. ICLR 2018].

Understanding Deep features with Computer-Generated Imagery. We introduce an approach for analyzing the variation of features generated by convolutional neural networks (CNNs) with respect to scene factors that occur in natural images. Such factors may include object style, 3D viewpoint, color, and scene lighting configuration. Our approach [Aubry & Russell ICCV 2015] analyzes CNN feature responses corresponding to different scene factors by controlling for them via rendering using a large database of 3D CAD models. The rendered images are presented to a trained CNN and responses for different layers are studied with respect to the input scene factors. We perform a decomposition of the responses based on knowledge of the input scene factors and analyze the resulting components. In particular, we quantify their relative importance in the CNN responses and visualize them using principal component analysis. We show qualitative and quantitative results of our study on three CNNs trained on large image datasets: AlexNet, Places, and Oxford VGG. We observe important differences across the networks and CNN layers for different scene factors and object categories. Finally, we demonstrate that our analysis based on computer-generated imagery translates to the network representation of natural images. See also the project page.

Unsupervised Representation Learning by Predicting Image Rotations. Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In [Gidaris et al. ICLR 2018] we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. See also the project page.

Place and Architecture Style Recognition

Large-scale visual place recognition is a challenging problem, especially due to ambiguities, changes of appearance (day/night, season, aging, structural evolution) and the lack of abundance of samples. We address it in various ways, leveraging on repetitive structures [Torii et al. PAMI 2015], new view syntheses [Torii et al. CVPR 2015, PAMI 2017], and weakly-supervised learning [Arandjelovic et al. CVPR 2016, PAMI 2017]. We also propose a learning-based method for inferring approximate construction dates and architecture styles [Lee et al. ICCP 2015].

Visual place recognition with repetitive structures. Repeated structures such as building facades, fences or road markings often represent a significant challenge for place recognition. Repeated structures are notoriously hard for establishing correspondences using multi-view geometry. They violate the feature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance. We show [Torii et al. PAMI 2015] that repeated structures are not a nuisance but, when appropriately represented, they form an important distinguishing feature for many places. We describe a representation of repeated structures suitable for scalable retrieval and geometric verification. The retrieval is based on robust detection of repeated image structures and a suitable modification of weights in the bag-of-visual-word model. We also demonstrate that the explicit detection of repeated patterns is beneficial for robust visual word matching for geometric verification. Place recognition results are shown on datasets of street-level imagery from Pittsburgh and San Francisco demonstrating significant gains in recognition performance compared to the standard bag-of-visual-words baseline as well as the more recently proposed burstiness weighting and Fisher vector encoding. See also the project page.

24/7 Place Recognition by View Synthesis. We address the problem of large-scale visual place recognition for situations where the scene undergoes a major change in appearance, for example, due to illumination (day/night), change of seasons, aging, or structural modifications over time such as buildings being built or destroyed. Such situations represent a major challenge for current large-scale place recognition methods. This work [Torii et al. CVPR 2015, PAMI 2017] has the following three principal contributions. First, we demonstrate that matching across large changes in the scene appearance becomes much easier when both the query image and the database image depict the scene from approximately the same viewpoint. Second, based on this observation, we develop a new place recognition approach that combines (i) an efficient synthesis of novel views with (ii) a compact indexable image representation. Third, we introduce a new challenging dataset of 1,125 camera-phone query images of Tokyo that contain major changes in illumination (day, sunset, night) as well as structural changes in the scene. We demonstrate that the proposed approach significantly outperforms other large-scale place recognition techniques on this challenging data. See also the project page.

NetVLAD: CNN architecture for weakly supervised place recognition. We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph [Arandjelovic et al. CVPR 2016, PAMI 2017]. We present the following four principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the “Vector of Locally Aggregated Descriptors” image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture’s parameters from images depicting the same places over time downloaded from Google Street View Time Machine. Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks. Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks. See also the project page.

Linking Past to Present: Discovering Style in Two Centuries of Architecture. With vast quantities of imagery now available online, researchers have begun to explore whether visual patterns can be discovered automatically. In [Lee et al. ICCP 2015], we consider the particular domain of architecture, using huge collections of street-level imagery to find visual patterns that correspond to semantic-level architectural elements distinctive to particular time periods. We use this analysis both to date buildings, as well as to discover how functionally-similar architectural elements (e.g. windows, doors, balconies, etc.) have changed over time due to evolving styles. We validate the methods by combining a large dataset of nearly 150,000 Google Street View images from Paris with a cadastre map to infer approximate construction date for each facade. Not only could our analysis be used for dating or geo-localizing buildings based on architectural features, but it also could give architects and historians new tools for confirming known theories or even discovering new ones. See also the supplementary material, the project page. and the interactive online demo.

2D-2D and 2D-3D Alignment and Detection

To go beyond

Visual geo-localization of non-photographic depictions via 2D-3D alignment. We propose a technique that can geo-localize arbitrary 2D depictions of architectural sites, including drawings, paintings and historical photographs [Aubry et al. SIGGRAPH 2014 / TOG 2014, Springer book 2015] . This is achieved by aligning the input depiction with a 3D model of the corresponding site. The task is very difficult as the appearance and scene structure in the 2D depictions can be very different from the appearance and geometry of the 3D model, e.g., due to the specific rendering style, drawing error, age, lighting or change of seasons. In addition, we face a hard search problem: the number of possible alignments of the depiction to a set of 3D models from different architectural sites is huge. To address these issues, we develop a compact representation of complex 3D scenes. 3D models of several scenes are represented by a set of discriminative visual elements that are automatically learnt from rendered views. Similar to object detection, the set of visual elements, as well as the weights of individual features for each element, are learnt in a discriminative fashion. We show that the learnt visual elements are reliably matched in 2D depictions of the scene despite large variations in rendering style (e.g. watercolor, sketch, historical photograph) and structural changes (e.g. missing scene parts, large occluders) of the scene. We demonstrate that the proposed approach can automatically identify the correct architectural site as well as recover an approximate viewpoint of historical photographs and paintings with respect to the 3D model of the site. See also the project page.

Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models. We pose object category detection in images as a type of 2D-to-3D alignment problem, utilizing the large quantities of 3D CAD models that have been made publicly available online [Aubry et al. CVPR 2014]. Using the “chair” class as a running example, we propose an exemplar-based 3D category representation, which can explicitly model chairs of different styles as well as the large variation in viewpoint. We develop an approach to establish part-based correspondences between 3D CAD models and real photographs. This is achieved by (i) representing each 3D model using a set of view-dependent mid-level visual elements learned from synthesized views in a discriminative fashion, (ii) carefully calibrating the individual element detectors on a common dataset of negative images, and (iii) matching visual elements to the test image allowing for small mutual deformations but preserving the viewpoint and style constraints. We demonstrate the ability of our system to align 3D models with 2D objects in the challenging PASCAL VOC images, which depict a wide variety of chairs in complex scenes. See also the project page.

Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views. Francisco Massa, Bryan Russell, Mathieu Aubry. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). See also the project page.

Object Recognition and Detection and Pose Estimation

Is object localization for free? – Weakly-supervised learning with convolutional neural networks. Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015).

Convolutional Neural Network Architecture for Geometric Matching. We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters [Rocco et al. CVPR 2017]. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset. See also the project page.

Object Detection via a Multi-region and Semantic Segmentation-aware CNN Model. Spyros Gidaris, Nikos Komodakis. IEEE International Conference on Computer Vision (ICCV 2015). See also the extended technical report and code repository.

LocNet: Improving Localization Accuracy for Object Detection. Spyros Gidaris, Nikos Komodakis. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). See also the extended technical report and code repository.

Crafting a multi-task CNN for viewpoint estimation. Francisco Massa, Renaud Marlet, Mathieu Aubry. 27th British Machine Vision Conference (BMVC 2016). See also the project page.

Attend Refine Repeat: Active Box Proposal Generation via In-Out Localization. Spyros Gidaris, Nikos Komodakis. 27th British Machine Vision Conference (BMVC 2016). See also the code repository.

Dynamic Few-Shot Visual Learning without Forgetting. Spyros Gidaris, Nikos Komodakis. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018). See also the project page.

High-level 3D Reconstruction

Statistical criteria for shape fusion and selection. Alexandre Boulch, Renaud Marlet. 22nd International Conference on Pattern Recognition (ICPR 2014). See also the code repository.

(b) Piecewise-Planar 3D Reconstruction with Edge and Corner Regularization. Alexandre Boulch and Martin de La Gorce, Renaud Marlet. Computer Graphics Forum (CGF 2014), 33(5):55-64. Also in 12th Eurographics Symposium on Geometry (SGP 2014).

Patchwork Stereo: Scalable, Structure-aware 3D Reconstruction in Man-made Environments. Amine Bourki, Martin de La Gorce, Renaud Marlet, Nikos Komodakis. IEEE Winter Conference on Applications of Computer Vision (WACV 2017). See also the supplementary material.

Semantic Segmentation of Urban Images, 3D Point Clouds and Meshes

Image Parsing with Graph Grammars and Markov Random Fields. Mateusz Koziński, Renaud Marlet. IEEE Winter Conference on Applications of Computer Vision (WACV 2014).

Beyond procedural facade parsing: Bidirectional alignment via linear programming. Mateusz Koziński, Guillaume Obozinski, Renaud Marlet. In 12th Asian Conference on Computer Vision (ACCV 2014). See also the supplementary material.

Efficient Facade Segmentation using Auto-Context. Varun Jampani, Raghudeep Gadde, Peter V. Gehler. IEEE Winter Conference on Applications of Computer Vision (WACV 2015). See also the project page.

A MRF Shape Prior for Facade Parsing with Occlusions. Mateusz Koziński, Raghudeep Gadde, Sergey Zagoruyko, Renaud Marlet, Guillaume Obozinski. In 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). See also the supplementary material.

An Adversarial Regularisation for Semi-Supervised Training of Structured Output Neural Networks. Mateusz Koziński, Loïc Simon, Frédéric Jurie. Neural Information Processing Systems (NIPS 2017).

Learning grammars for architecture-specific facade parsing. Raghudeep Gadde, Renaud Marlet, Nikos Paragios. International Journal of Computer Vision (IJCV 2016), 117(3):290-316, March 2016. See also the used dataset.

Efficient 2D and 3D Facade Segmentation using Auto-Context. Raghudeep Gadde, Varun Jampani, Renaud Marlet, Peter Gehler. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI 2017). See also the project page and the supplementary material.

Cut Pursuit: fast algorithms to learn piecewise constant functions on general weighted graphs. Loïc Landrieu, Guillaume Obozinski. SIAM Journal on Imaging Sciences (SIIMS 2017), 10(4), 1724–1766, 2017. See also the project page.

Detect, Replace, Refine: Deep Structured Prediction For Pixel Wise Labeling. Spyros Gidaris, Nikos Komodakis. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). See also the extended technical report and result repository.

Urban Procedural Models

Interactive Sketching of Urban Procedural Models. Gen Nishida, Ignacio Garcia-Dorado, Daniel G. Aliaga, Bedrich Benes, Adrien Bousseau. ACM Transactions on Graphics (TOG 2016). Also in SIGGRAPH Conference (SIGGRAPH 2016). See also the short project page and complete project page.

IBR and Applications to IBR

Scalable Inside-Out Image-Based Rendering. Peter Hedman, Tobias Ritschel, George Drettakis, Gabriel Brostow. ACM Transactions on Graphics (TOG 2016), 35(6). Also in SIGGRAPH Asia Conference (SIGGRAPH Asia 2016). See also the short project page and complete project page.

Thin Structures in Image Based Rendering. Theo Thonat, Abdelaziz Djelouah, Frédo Durand, George Drettakis. Computer Graphics Forum (CGF 2018), 37(4), 2018. Also in Eurographics Symposium on Rendering (EGSR 2018). See also the project page.

A Bayesian Approach for Selective Image-Based Rendering using Superpixels. Rodrigo Ortiz-Cayon, Abdelaziz Djelouah, George Drettakis. International Conference on 3D Vision (3DV 2015). See also the project page.

Automatic 3D Car Model Alignment for Mixed Image-Based Rendering. Rodrigo Ortiz-Cayon, Abdelaziz Djelouah, Francisco Massa, Mathieu Aubry, George Drettakis. International Conference on 3D Vision (3DV 2016). See also the project page.

Cotemporal Multi-View Video Segmentation. Abdelaziz Djelouah, Jean-Sébastien Franco, Edmond Boyer, Patrick Pérez, George Drettakis. International Conference on 3D Vision (3DV 2016). See also the project page.

Multi-View Inpainting for Image-Based Scene Editing and Rendering. Theo Thonat, Eli Shechtman, Sylvain Paris, George Drettakis. International Conference on 3D Vision (3DV 2016). See also the project page.

Plane-Based Multi-View Inpainting for Image-Based Rendering in Large Scenes. Julien Philip, George Drettakis. ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D 2018). See also the project page.

Semapolis project

Results

Extraction and Understanding of Visual Features

Place and Architecture Style Recognition

2D-2D and 2D-3D Alignment and Detection

An other news