-
-
Notifications
You must be signed in to change notification settings - Fork 120
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1,869 changed files
with
66,760 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1 @@ | ||
<!-- Please make sure you are opening a pull request against the `accepted` branch (not master!) of the STAGING repo (not 2023!) --> | ||
|
||
## OpenReview Submission Thread | ||
|
||
<!-- link to your OpenReview submission --> | ||
|
||
## Checklist before requesting a review | ||
|
||
<!-- To tick a box, put an 'x' inside it (e.g. [x]) --> | ||
|
||
- [ ] I am opening a pull request against the `accepted` branch of the `staging` repo. | ||
- [ ] I have de-anonymized my post, added author lists, etc. | ||
- [ ] My post matches the formatting requirements | ||
- [ ] I have a short 2-3 sentence abstract in the `description` field of my front-matter ([example](https://github.com/iclr-blogposts/staging/blob/aa15aa3797b572e7b7bb7c8881fd350d5f76fcbd/_posts/2022-12-01-distill-example.md?plain=1#L4-L5)) | ||
- [ ] I have a table of contents, formatted using the `toc` field of my front-matter ([example](https://github.com/iclr-blogposts/staging/blob/aa15aa3797b572e7b7bb7c8881fd350d5f76fcbd/_posts/2022-12-01-distill-example.md?plain=1#L33-L42)) | ||
- [ ] My bibliography is correctly formatted, using a `.bibtex` file as per the sample post | ||
|
||
## Changes implemented in response to reviewer feedback | ||
|
||
- [ ] Tick this box if you received a conditional accept | ||
- [ ] I have implemented the necessary changes in response to reviewer feedback (if any) | ||
|
||
<!-- briefly add your changes in response to reviewer feedback --> | ||
|
||
## Any other comments |
Binary file not shown.
199 changes: 199 additions & 0 deletions
199
...lace-recognition-satellite/2023-11-09-dof-visual-place-recognition-satellite.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
--- | ||
layout: distill | ||
title: 6-DOF estimation through visual place recognition | ||
description: A neural Visual Place Recognition solution is proposed which could help an agent with a downward-facing camera (such as a drone) to geolocate based on prior satellite imagery of terrain. The neural encoder infers extrinsic camera parameters from camera images, enabling estimation of 6 degrees of freedom (6-DOF), namely 3-space position and orientation. By encoding priors about satellite imagery in a neural network, the need for the agent to carry a satellite imagery dataset onboard is avoided. | ||
date: 2023-11-09 | ||
htmlwidgets: true | ||
|
||
# Anonymize when submitting | ||
# authors: | ||
# - name: Anonymous | ||
|
||
authors: | ||
- name: Andrew Feldman | ||
url: "https://andrew-feldman.com/" | ||
affiliations: | ||
name: MIT | ||
|
||
# must be the exact same name as your blogpost | ||
bibliography: 2023-11-09-dof-visual-place-recognition-satellite.bib | ||
|
||
# Add a table of contents to your post. | ||
# - make sure that TOC names match the actual section names | ||
# for hyperlinks within the post to work correctly. | ||
toc: | ||
- name: Introduction | ||
- name: Background | ||
# - name: Images and Figures | ||
# subsections: | ||
# - name: Interactive Figures | ||
- name: Proposed solution | ||
subsections: | ||
- name: Image-to-extrinsics encoder architecture | ||
- name: Data sources for offline training | ||
- name: Training and evaluation | ||
subsections: | ||
- name: Data pipeline | ||
- name: Training | ||
- name: Hyperparameters | ||
- name: Evaluation | ||
|
||
# Below is an example of injecting additional post-specific styles. | ||
# This is used in the 'Layouts' section of this post. | ||
# If you use this post as a template, delete this _styles block. | ||
_styles: > | ||
.fake-img { | ||
background: #bbb; | ||
border: 1px solid rgba(0, 0, 0, 0.1); | ||
box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); | ||
margin-bottom: 12px; | ||
} | ||
.fake-img p { | ||
font-family: monospace; | ||
color: white; | ||
text-align: left; | ||
margin: 12px 0; | ||
text-align: center; | ||
font-size: 16px; | ||
} | ||
--- | ||
|
||
# Introduction | ||
|
||
The goal of this project is to demonstrate how a drone or other platform with a downward-facing camera could perform approximate geolocation through visual place recognition, using a neural scene representation of existing satellite imagery. | ||
|
||
Visual place recognition<d-cite key="Schubert_2023"></d-cite> refers to the ability of an agent to recognize a location which it has not previously seen, by exploiting a system for cross-referencing live camera footage against some ground-truth of prior image data. | ||
|
||
In this work, the goal is to compress the ground-truth image data into a neural model which maps live camera footage to geolocation coordinates. | ||
|
||
Twitter user Stephan Sturges demonstrates his solution<d-cite key="Sturges_2023"></d-cite> for allowing a drone with a downward-facing camera to geolocate through cross-referencing against a database of satellite images: | ||
|
||
<div class="row mt-3"> | ||
<div class="col-sm mt-3 mt-md-0"> | ||
{% include figure.html path="assets/img/2023-11-09-dof-visual-place-recognition-satellite/sturges_satellite_vpr.jpeg" class="img-fluid rounded z-depth-1" %} | ||
</div> | ||
</div> | ||
<div class="caption"> | ||
Twitter user Stephan Sturges shows the results<d-cite key="Sturges_2023"></d-cite> of geolocation based on Visual Place Recognition. | ||
</div> | ||
|
||
The author of the above tweet employs a reference database of images. It would be interesting to eliminate the need for a raw dataset. | ||
|
||
Thus, this works seeks to develop a neural network which maps a terrain image from the agent's downward-facing camera, to a 6-DOF (position/rotation) representation of the agent in 3-space. Hopefully the neural network is more compact than the dataset itself - although aggressive DNN compression will not be a focus of this work. | ||
|
||
# Background | ||
|
||
The goal-statement - relating a camera image to a location and orientation in the world - has been deeply studied in computer vision and rendering<d-cite key="Anwar_2022"></d-cite>: | ||
|
||
<div class="row mt-3"> | ||
<div class="col-sm mt-3 mt-md-0"> | ||
{% include figure.html path="assets/img/2023-11-09-dof-visual-place-recognition-satellite/camera_intrinsic_extrinsic.png" class="img-fluid rounded z-depth-1" %} | ||
</div> | ||
</div> | ||
<div class="caption"> | ||
Camera parameters, as described in<d-cite key="Anwar_2022"></d-cite>. | ||
</div> | ||
|
||
Formally<d-cite key="Anwar_2022"></d-cite>, | ||
* The image-formation problem is modeled as a camera forming an image of the world using a planar sensor. | ||
* **World coordinates** refer to 3-space coordinates in the Earth or world reference frame. | ||
* **Image coordinates** refer to 2-space planar coordinates in the camera image plane. | ||
* **Pixel coordinates** refer to 2-space coordinates in the final image output from the image sensor, taking into account any translation or skew of pixel coordinates with respect to the image coordinates. | ||
|
||
The mapping from world coordinates to pixel coordinates is framed as two composed transformations, described as sets of parameters<d-cite key="Anwar_2022"></d-cite>: | ||
* **Extrinsic camera parameters** - the transformation from world coordinates to image coordinates (affected by factors "extrinsic" to the camera internals, i.e. position and orientation.) | ||
* **Intrinsic camera parameters** - the transformation from image coordinates to pixel coordinates (affected by factors "intrinsic" to the camera's design.) | ||
|
||
And so broadly speaking, this work strives to design a neural network that can map from an image (taken by the agent's downward-facing camera) to camera parameters of the agent's camera. With camera parameters in hand, geolocation parameters automatically drop out from extracting extrinsic translation parameters. | ||
|
||
To simplify the task, assume that camera intrinsic characteristics are consistent from image to image, and thus could easily be calibrated out in any application use-case. Therefore, this work focuses on inferring **extrinsic camera parameters** from an image. We assume that pixels map directly into image space. | ||
|
||
The structure of extrinsic camera parameters is as follows<d-cite key="Anwar_2022"></d-cite>: | ||
|
||
$$ | ||
\mathbf{E}_{4 \times 4} = \begin{bmatrix} \mathbf{R}_{3 \times 3} & \mathbf{t}_{3 \times 1} \\ \mathbf{0}_{1 \times 3} & 1 \end{bmatrix} | ||
$$ | ||
|
||
where $$\mathbf{R}_{3 \times 3} \in \mathbb{R^{3 \times 3}}$$ is rotation matrix representing the rotation from the world reference frame to the camera reference frame, and $$\mathbf{t}_{3 \times 1} \in \mathbb{R^{3 \times 1}}$$ represents a translation vector from the world origin to the image/camera origin. | ||
|
||
Then the image coordinates (a.k.a. camera coordinates) $$P_c$$ of a world point $$P_w$$ can be computed as<d-cite key="Anwar_2022"></d-cite>: | ||
|
||
$$ | ||
\mathbf{P_c} = \mathbf{E}_{4 \times 4} \cdot \mathbf{P_w} | ||
$$ | ||
|
||
# Proposed solution | ||
|
||
## Image-to-extrinsics encoder architecture | ||
|
||
The goal of this work, is to train a neural network which maps an image drawn from $$R^{3 \times S \times S}$$ (where $$S$$ is pixel side-length of an image matrix) to a pair of camera extrinsic parameters $$R_{3 \times 3}$$ and $$t_{3 \times 1}$$: | ||
|
||
$$ | ||
\mathbb{R^{3 \times S \times S}} \rightarrow \mathbb{R^{3 \times 3}} \times \mathbb{R^3} | ||
$$ | ||
|
||
The proposed solution is a CNN-based encoder which maps the image into a length-12 vector (the flattened extrinsic parameters); a hypothetical architecture sketch is shown below: | ||
|
||
<div class="row mt-3"> | ||
<div class="col-sm mt-3 mt-md-0"> | ||
{% include figure.html path="assets/img/2023-11-09-dof-visual-place-recognition-satellite/nn.svg" class="img-fluid rounded z-depth-1" %} | ||
</div> | ||
</div> | ||
<div class="caption"> | ||
Image encoder architecture. | ||
</div> | ||
|
||
## Data sources for offline training | ||
|
||
Online sources<d-cite key="Geller_2022"></d-cite> provide downloadable satellite terrain images. | ||
|
||
## Training and evaluation | ||
|
||
The scope of the model's evaluation is, that it will be trained to recognize aerial views of some constrained area i.e. Atlantic City New Jersey; this constrained area will be referred to as the "area of interest." | ||
|
||
### Data pipeline | ||
|
||
The input to the data pipeline is a single aerial image of the area of interest. The output of the pipeline is a data loader which generates augmented images. | ||
|
||
The image of the area of interest is $$\mathbb{R^{3 \times T \times T}}$$ where $$T$$ is the image side-length in pixels. | ||
|
||
Camera images will be of the form $$\mathbb{R^{3 \times S \times S}}$$ where $$S$$ is the image side-length in pixels, which may differ from $$T$$. | ||
|
||
* **Generate an image from the agent camera's vantage-point** | ||
* Convert the area-of-interest image tensor ($$\mathbb{R^{3 \times T \times T}}$$) to a matrix of homogenous world coordinates ($$\mathbb{R^{pixels \times 4}}$$) and an associated matrix of RGB values for each point ($$\mathbb{R^{pixels \times 3}}$$) | ||
* For simplicity, assume that all features in the image have an altitutde of zero | ||
* Thus, all of the pixel world coordinates will lie in a plane | ||
* Generate random extrinsic camera parameters $$R_{3 \times 3}$$ and $$t_{3 \times 1}$$ | ||
* Transform the world coordinates into image coordinates ($$\mathbb{R^{pixels \times 3}}$$) (note, this does not affect the RGB matrix) | ||
* Note - this implicitly accomplishes the commonly-used image augmentations such as shrink/expand, crop, rotate, skew | ||
* **Additional data augmentation** - to prevent overfitting | ||
* Added noise | ||
* Color/brightness adjustment | ||
* TBD | ||
* **Convert the image coordinates and the RGB matrix into a camera image tensor ($$\mathbb{R^{3 \times S \times S}}$$)** | ||
|
||
Each element of a batch from this dataloader, will be a tuple of (extrinsic parameters,camera image). | ||
|
||
## Training | ||
|
||
* For each epoch, and each mini-batch... | ||
* unpack batch elements into camera images and ground-truth extrinsic parameters | ||
* Apply the encoder to the camera images | ||
* Loss: MSE between encoder estimates of extrinsic parameters, and the ground-truth values | ||
|
||
### Hyperparameters | ||
* Architecture | ||
* Encoder architecture - CNN vs MLP vs ViT(?) vs ..., number of layers, ... | ||
* Output normalizations | ||
* Nonlinearities - ReLU, tanh, ... | ||
* Learning-rate | ||
* Optimizer - ADAM, etc. | ||
* Regularizations - dropout, L1, L2, ... | ||
|
||
## Evaluation | ||
|
||
For a single epoch, measure the total MSE loss of the model's extrinsic parameter estimates relative to the ground-truth. | ||
|
||
## Feasibility | ||
|
||
Note that I am concurrently taking 6.s980 "Machine learning for inverse graphics" so I already have background in working with camera parameters, which should help me to complete this project on time. |
42 changes: 42 additions & 0 deletions
42
...ition-satellite/assets/bibliography/2023-11-09-dof-visual-place-recognition-satellite.bib
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
@article{Schubert_2023, | ||
doi = {10.1109/mra.2023.3310859}, | ||
|
||
url = {https://doi.org/10.1109%2Fmra.2023.3310859}, | ||
|
||
year = 2023, | ||
publisher = {Institute of Electrical and Electronics Engineers ({IEEE})}, | ||
|
||
pages = {2--16}, | ||
|
||
author = {Stefan Schubert and Peer Neubert and Sourav Garg and Michael Milford and Tobias Fischer}, | ||
|
||
title = {Visual Place Recognition: A Tutorial}, | ||
|
||
journal = {{IEEE} Robotics {\&}amp$\mathsemicolon$ Automation Magazine} | ||
} | ||
|
||
@misc{Sturges_2023, title={Demo: One Minute of UAV visual navigation from satellite imagery where GPS coordinates are derived by comparing an image from a downward facing camera to a prior satellite image 😄 this is a solution for high-altitude navigation with no GPS. pic.twitter.com/y9i0qg4zng}, url={https://twitter.com/StephanSturges/status/1722236377990082632?t=1Yf69YRUIkydn8lu0g8Epw}, journal={Twitter}, publisher={Twitter}, author={Sturges, Stephan}, year={2023}, month={Nov}} | ||
|
||
@misc{Anwar_2022, title={What are Intrinsic and Extrinsic Camera Parameters in Computer Vision?}, url={https://towardsdatascience.com/what-are-intrinsic-and-extrinsic-camera-parameters-in-computer-vision-7071b72fb8ec}, journal={Medium}, author={Anwar, Aqeel}, year={2022}, month={Feb}} | ||
@misc{Geller_2022, title={Downloading satellite images made “Easy” hero image}, url={https://sites.northwestern.edu/researchcomputing/2021/11/19/downloading-satellite-images-made-easy/}, journal={Research Computing and Data Services Updates}, publisher={© 2023 Northwestern University}, author={Geller, Aaron}, year={2022}, month={Nov}} | ||
@misc{Taylor_2020, title={New Jersey: Images of the Garden State}, url={https://www.theatlantic.com/photo/2020/08/new-jersey-photos/614872/}, journal={The Atlantic}, publisher={Atlantic Media Company}, author={Taylor, Alan}, year={2020}, month={Aug}} | ||
@misc{sitzmann2020implicit, | ||
title={Implicit Neural Representations with Periodic Activation Functions}, | ||
author={Vincent Sitzmann and Julien N. P. Martel and Alexander W. Bergman and David B. Lindell and Gordon Wetzstein}, | ||
year={2020}, | ||
eprint={2006.09661}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CV} | ||
} | ||
|
||
@misc{xiang2018posecnn, | ||
title={PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes}, | ||
author={Yu Xiang and Tanner Schmidt and Venkatraman Narayanan and Dieter Fox}, | ||
year={2018}, | ||
eprint={1711.00199}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CV} | ||
} |
Binary file added
BIN
+408 KB
...023-11-09-dof-visual-place-recognition-satellite/camera_intrinsic_extrinsic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.