The municipality of Amsterdam is responsible for 1,727km of roads. Stretched out in a row, this would take you from Amsterdam to the tip of Algeria, and every year these roads need inspecting and maintaining. So how should the Mobility and Public Space Department prioritise roads for manual inspections? Typically, the answer might be to work through the roads systematically, inspecting them at random intervals, or using citizen sensing. This is likely to be very inefficient and bias the process towards those neighbourhoods with the most engaged citizens.
Meanwhile, the city also possesses thousands of images of Amsterdam roads as part of its street view images dataset. Using image recognition to detect road damage could massively expedite the manual inspection process, detecting areas of concern and guiding the Mobility Department to the most likely sites of damage more cheaply and effectively. In this blog, I summarise my research into using street-level images, trained using road-level annotations, to detect cracks in Amsterdam’s roads.
Convolutional Neural Networks
The vast majority of image recognition techniques now use Convolutional Neural Networks (CNNs)1. CNNs are a form of artificial neural network which work by taking an image and applying “convolutions” over multiple layers. Each “convolution” essentially acts as a filter, looking for certain patterns. Typically, multiple filters are applied at each layer, before the image is shrunk. By the end of the network, therefore, the image has been converted into a number of features, each representing certain patterns in the image. Because the filters are learned, rather than hard-engineered CNNs require less preprocessing than other image recognition algorithms and have been shown to be effective in a wide range of domains.
One drawback of CNNs is that they typically require large amounts of training data. The best CNNs are trained using tens of thousands of labelled images. This is especially important when trying to detect “fine-grained” objects, such as road damage on a road surface, within an image of a whole street.
Unfortunately, in Amsterdam’s case, we did not have labelled images. Instead, the data we used was based on 23,919 existing manual inspections, which we then matched up to 36,447 street view images on the same road, taken within a short time of the inspection. This was not a perfect dataset. Due to the size of roads, for most inspections there were multiple images, with varying road conditions. These images also contained large quantities of non-road “noise” – cars, buildings, people, etc. As a result, we needed to look at innovative ways of making our model “context-aware”, removing unnecessary details and directing the model’s attention towards relevant details.
Context Aware Stacked CNNs
To do this, we drew from experience from medical imagery. Researchers2 investigating breast lesion diagnostics noticed how it was possible to train effective models for tumour classification looking at low resolution images of large areas of tissue, and to train models which could tell between benign and carcinoma when looking at high resolution images of tumour cells. However, it would be too computationally expensive to give one model images at the resolution needed to do both tasks.
Their innovative solution was “stacked CNNs”. One CNN would be trained using the high resolution images of small areas. This model could then be “frozen” and applied across the larger images. This outputs a heatmap of different types of tumour across the larger image. This heatmap could then be fed to another CNN, trained to find the structure of tumours. Together, the model could learn the fine grained features of carcinoma as well as the global structure of tumours, without ever seeing a full-sized high-resolution image.
We apply a similar technique to our problem. First, a CNN is trained using known examples of road damage, from the Global Road Damage Detection dataset. This CNN is then applied across the whole panoramas, for whole-image classification. We hoped that this technique would draw the model’s attention to examples of damage, while filtering out superfluous information.
A second technique we used was semantic segmentation. Semantic segmentation uses CNNs to identify the high-level classes in images – that is, roads, cars, people, etc. We used a wholly pre-trained model3 to detect where road surfaces were in our images and to remove all non-road surface pixels (a Region of Interest or ROI mask). This is to help avoid the model detection “road damage” anywhere aside from the road, for example cracks in buildings.
The images below show what the inputs of our final models looked like. The top left is a typical panorama, while underneath is the same panorama with non-road pixels removed. On the right are the corresponding heatmaps, based on the model trained on known examples of road damage. These are just examples from one image, but they show how the different models produce different results. For example, the non-segmented heatmap highlights the bridge joint in the middle of the road, while the segmented heatmap seems to show the pavement joint on the left more prominently.
Unfortunately, our results on the final test set were quite limited. For evaluation, we used the ROC curve. The ROC curve works by comparing the true positive and false positive rates at different levels of classification threshold. Once the curve has been calculated, we use the area under the curve (AUC) as our evaluation metric. For a perfect model, the AUC is 1. This would indicate that there are 0 false positives when all of the true positives are identified. Meanwhile if the AUC is 0.5, the model has “no skill”, and false positives increase at the same rate as true positives.
For our models, the best performing model was the Stacked Network, using an ROI mask. This had an AUC of 0.665, indicating only a small improvement over a “no skill” model. The stacked network performed better than all of the other models tested, including pre-trained CNNs without stacking, which shows some potential for the technique, but all of the models considered showed only small improvements over a totally random approach.
What next for the Municipality of Amsterdam?
The overriding lesson of our research was the importance of labelled data. With our limited labels, it was hard to train effective models as well as difficult to evaluate their “true” effectiveness. We believe the techniques used, with some tweaking, might still prove to be effective for road crack detection, but this would require a significant number of images to be labelled by a human.
One possible area for expansion would be a “human-in-the-loop” system. This would mean training the model at the same time as including evaluation by a domain expert. The expert could be shown images that the model is unsure about and provide an informed answer. In other domains, this approach has been shown to lead to big increases in effectiveness, even when the expert only labels a small number of images.
If the stacked network was shown to be effective using labelled images, it could also be extended to other problems, for example different forms of road damage, or other “fine-grained” issues such as graffiti or building facade damage.
The code of the project can be found on my GitHub page.
 Navdeep Kumar, Nirmal Kaur, and Deepti Gupta. Major Convolutional Neural Networks in Image Classification: A Survey. Lecture Notes in Networks and Systems, 116:243–258, 2020.
 Bejnordi et al. Context-aware stacked convolutional neural networks for classification of breast carcinomas in whole-slide histopathology images
 Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic Feature Pyramid Networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:6392–6401, 1 2019.