In 2016, Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros came up with a general method for solving different image generation tasks. Their paper, called pix2pix, introduced a pipeline based on Conditional Adversarial Generative Networks (cGANs). Some of the tasks solved by them are: generating a street scene based on labels, generating facades of buildings based on colorful sketches and generating maps based on satellite images. The latter tasks inspired me to create Sketch2Map: generating satellite images based on simple colorful sketches.
I have used the same architecture as the authors of pix2pix propose: two deep neural networks that play the adversarial game proposed by Goodfellow in 2014. The first model (the Generator) is based on the classic UNet architecture used for cell segmentation. The second model (the Discriminator) uses a PatchGAN approach for determining whether an image is real or fake. Both implementations were inspired by the following tutorial.
My contributions are the following:
- Created a dataset consisting of semantically segmented satellite-view elements (houses, trees, roads, etc)
- Generated realistic satellite images
That is it. Unfortunately, it is quite difficult to compare my work to other methods for generating satellite imagery since image generation is a task that is mostly meant to trick humans into believing the images are real, not for machines to distinguish them.
By creating similar frames and feeding them one after the other to the model, an animation can be created. The following animation consists of 61 frames of hand-drawn sketches:
Below you may see the structure of one of these input frames: (dataset details below)
The dataset consists of 125 semantically segmented images of Google Earth images. The pictures were taken around Vancouver, Canada (residential, downtown areas) with a zoom of precisely 141 feet (~43 meters). Each image is of size 1200x800.
The data is not publicly available. For access to the data, email me at [email protected] stating the purpose of using my data.
In the following table, all classes considered for the segmentation are listed together with their color code (in hexadecimal) from the mask.
Class | Color (hex) |
---|---|
Background | #0000de |
Car | #01cfff |
Bus | #eeff00 |
Van | #fd0002 |
Truck | #654321 |
Train | #808040 |
House (residential home) | #ae0001 |
Commercial building | #ed4dff |
Garage | #98839a |
Road | #ff9f00 |
Train track | #aea190 |
Sidewalk/Pathway | #ffb1a6 |
Foliage (trees, bushes) | #bffd45 |
Grass | #217717 |
Water | #a349f1 |
Training with 120 images (5 validation images) required around 400 epochs in order to generate "satisfying" images.
ControlNet is a neural network architecture that aims to offer more control to the generation of images using Stable Diffusion. It consists of two copies of Stable Diffusion, one locked and one unlocked that is trained on custom data to solve a particular task. In our case, we trained it to generate satellite images. Some of the results are displayed below:
There is a clear improvement in the quality of the results when using the new technique. Some artifacts are still visible in the image, such as flickering parked cars or road lines. Training on more images is expected to solve this issue.