diff --git a/tutorials/Training_and_Inference_tutorial.ipynb b/tutorials/Training_and_Inference_tutorial.ipynb index f0f0afccfc..51f1234bb1 100644 --- a/tutorials/Training_and_Inference_tutorial.ipynb +++ b/tutorials/Training_and_Inference_tutorial.ipynb @@ -33,7 +33,7 @@ { "cell_type": "markdown", "source": [ - "# Training GaMPEN \n", + "# Training GaMPEN\n", "\n", "In this Jupyter Notebook, we will demonstrate how you can train a GaMPEN model from scratch and perform inference with it on galaxy images. For an extensive documentation on GaMPEN, please refer to https://gampen.readthedocs.io/en/latest/index.html\n", "\n", @@ -68,12 +68,242 @@ { "cell_type": "markdown", "source": [ - "### Installing GaMPEN\n", + "### Colab-Specific Steps\n", "\n", - "First, let's install GaMPEN. \n", + "We want to make some changes to Colab's default environment. So, let's use miniconda to create our own custom environment.\n", "\n", "Some of these commands are specifically for Google Colab. If doing this on your own machine, please follow the steps outlined [here](https://gampen.readthedocs.io/en/latest/Getting_Started.html)" ], + "metadata": { + "id": "Ig3b2OBGZ-n2" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install virtualenv\n", + "!virtualenv myenv\n", + "!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh\n", + "!chmod +x Miniconda3-py37_23.1.0-1-Linux-x86_64.sh\n", + "!./Miniconda3-py37_23.1.0-1-Linux-x86_64.sh -b -f -p /usr/local" + ], + "metadata": { + "id": "d8FcOYqBZ-Nd", + "outputId": "1b90fa0d-476b-48b2-8263-2050151639d5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting virtualenv\n", + " Downloading virtualenv-20.26.2-py3-none-any.whl (3.9 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.9/3.9 MB\u001b[0m \u001b[31m14.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting distlib<1,>=0.3.7 (from virtualenv)\n", + " Downloading distlib-0.3.8-py2.py3-none-any.whl (468 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m468.9/468.9 kB\u001b[0m \u001b[31m23.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: filelock<4,>=3.12.2 in /usr/local/lib/python3.10/dist-packages (from virtualenv) (3.14.0)\n", + "Requirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv) (4.2.1)\n", + "Installing collected packages: distlib, virtualenv\n", + "Successfully installed distlib-0.3.8 virtualenv-20.26.2\n", + "created virtual environment CPython3.10.12.final.0-64 in 1219ms\n", + " creator CPython3Posix(dest=/content/myenv, clear=False, no_vcs_ignore=False, global=False)\n", + " seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)\n", + " added seed packages: pip==24.0, setuptools==69.5.1, wheel==0.43.0\n", + " activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator\n", + "--2024-05-16 03:20:08-- https://repo.anaconda.com/miniconda/Miniconda3-py37_23.1.0-1-Linux-x86_64.sh\n", + "Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.191.158, 104.16.32.241, 2606:4700::6810:bf9e, ...\n", + "Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.191.158|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 90665082 (86M) [application/x-sh]\n", + "Saving to: ‘Miniconda3-py37_23.1.0-1-Linux-x86_64.sh’\n", + "\n", + "Miniconda3-py37_23. 100%[===================>] 86.46M 160MB/s in 0.5s \n", + "\n", + "2024-05-16 03:20:09 (160 MB/s) - ‘Miniconda3-py37_23.1.0-1-Linux-x86_64.sh’ saved [90665082/90665082]\n", + "\n", + "PREFIX=/usr/local\n", + "Unpacking payload ...\n", + " \n", + "Installing base environment...\n", + "\n", + "\n", + "Downloading and Extracting Packages\n", + "\n", + "\n", + "Downloading and Extracting Packages\n", + "\n", + "Preparing transaction: - \b\b\\ \b\b| \b\bdone\n", + "Executing transaction: - \b\b\\ \b\b| \b\b/ \b\b- \b\b\\ \b\b| \b\b/ \b\b- \b\b\\ \b\b| \b\b/ \b\b- \b\b\\ \b\b| \b\b/ \b\b- \b\b\\ \b\b| \b\b/ \b\bdone\n", + "installation finished.\n", + "WARNING:\n", + " You currently have a PYTHONPATH environment variable set. This may cause\n", + " unexpected behavior when running the Python interpreter in Miniconda3.\n", + " For best results, please verify that your PYTHONPATH only points to\n", + " directories of packages that are compatible with the Python interpreter\n", + " in Miniconda3: /usr/local\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!conda install -q -y --prefix /usr/local python=3.7 ujson" + ], + "metadata": { + "id": "wFkANdj9ageI", + "outputId": "b0c91cee-632f-4fa0-c914-60055eb6a4c1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting package metadata (current_repodata.json): ...working... done\n", + "Solving environment: ...working... done\n", + "\n", + "## Package Plan ##\n", + "\n", + " environment location: /usr/local\n", + "\n", + " added / updated specs:\n", + " - python=3.7\n", + " - ujson\n", + "\n", + "\n", + "The following packages will be downloaded:\n", + "\n", + " package | build\n", + " ---------------------------|-----------------\n", + " ca-certificates-2024.3.11 | h06a4308_0 127 KB\n", + " openssl-1.1.1w | h7f8727e_0 3.7 MB\n", + " ujson-5.4.0 | py37h6a678d5_0 44 KB\n", + " ------------------------------------------------------------\n", + " Total: 3.9 MB\n", + "\n", + "The following NEW packages will be INSTALLED:\n", + "\n", + " ujson pkgs/main/linux-64::ujson-5.4.0-py37h6a678d5_0 \n", + "\n", + "The following packages will be UPDATED:\n", + "\n", + " ca-certificates 2023.01.10-h06a4308_0 --> 2024.3.11-h06a4308_0 \n", + " openssl 1.1.1s-h7f8727e_0 --> 1.1.1w-h7f8727e_0 \n", + "\n", + "\n", + "Preparing transaction: ...working... done\n", + "Verifying transaction: ...working... done\n", + "Executing transaction: ...working... done\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "import sys\n", + "sys.path.append('/usr/local/lib/python3.7/site-packages/')" + ], + "metadata": { + "id": "USBm4kXcanWm" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# activate conda enviornment\n", + "import os\n", + "os.environ['CONDA_PREFIX'] = '/usr/local/envs/myenv'" + ], + "metadata": { + "id": "qfksN1jxanTl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "!python --version" + ], + "metadata": { + "id": "26OUWq5ianIQ", + "outputId": "65f94a86-eb42-4f1b-8f78-cf49986969a1", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Python 3.7.16\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "!pip install h5py\n", + "!pip install typing-extensions\n", + "!pip install wheel" + ], + "metadata": { + "id": "Kh8MhdADauQ6", + "outputId": "1c0bf9e1-1aea-4e61-9f96-e4d414b15e6d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting h5py\n", + " Downloading h5py-3.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.3/4.3 MB\u001b[0m \u001b[31m55.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting numpy>=1.14.5\n", + " Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m15.7/15.7 MB\u001b[0m \u001b[31m79.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hInstalling collected packages: numpy, h5py\n", + "Successfully installed h5py-3.8.0 numpy-1.21.6\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0mRequirement already satisfied: typing-extensions in /usr/local/lib/python3.7/site-packages (4.4.0)\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0mRequirement already satisfied: wheel in /usr/local/lib/python3.7/site-packages (0.37.1)\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0m" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "### Installing GaMPEN\n", + "\n", + "First, let's install GaMPEN.\n", + "\n", + "Some of these commands are specifically for Google Colab. If doing this on your own machine, please follow the steps outlined [here](https://gampen.readthedocs.io/en/latest/Getting_Started.html)\n", + "\n", + "**WARNING ⚠: If this cell triggers a \"Restrat Session\" suggestion from Colab, remember to re-run this cell after restarting the session.**" + ], "metadata": { "id": "68DIPdkdhXCE" } @@ -112,7 +342,7 @@ "source": [ "%cd /content/GaMPEN/\n", "\n", - "!pip install -r requirements.txt " + "!pip install -r requirements.txt" ], "metadata": { "colab": { @@ -651,9 +881,9 @@ { "cell_type": "markdown", "source": [ - "As long as the tests do not produce any errors, you are good to go! \n", + "As long as the tests do not produce any errors, you are good to go!\n", "\n", - "Note that warnings and and tests being skipped are ok! " + "Note that warnings and and tests being skipped are ok!" ], "metadata": { "id": "a3rNg_EYfcdO" @@ -664,9 +894,9 @@ "source": [ "## Getting Images & Training Data\n", "\n", - "Tto train a GaMPEN model from scratch, we need both images as well as training labels. \n", + "Tto train a GaMPEN model from scratch, we need both images as well as training labels.\n", "\n", - "*In* order for GaMPEN to detect training images and associated training labels, it requires a target data directory with a specific structure as outlined below. \n", + "*In* order for GaMPEN to detect training images and associated training labels, it requires a target data directory with a specific structure as outlined below.\n", "\n", "```\n", "- data_directory\n", @@ -677,7 +907,7 @@ "\n", " * a unique identifier for each image\n", " * the filename for each image\n", - " * target labels/variables for every image to be used for training. \n", + " * target labels/variables for every image to be used for training.\n", "\n", "In order for this tutorial, let's use some simulated Hyper Suprime-Cam images that can be downloaded from the Yale FTP servers. For this, we will use the pre-defined `make hsc_demo` command from root `GaMPEN` directory" ], @@ -1465,7 +1695,7 @@ { "cell_type": "markdown", "source": [ - "As the code block below shows we have now created a data-directory named `hsc/` with an `info.csv` file and a `cutouts/` folder within `hsc/` that contains all our images. " + "As the code block below shows we have now created a data-directory named `hsc/` with an `info.csv` file and a `cutouts/` folder within `hsc/` that contains all our images." ], "metadata": { "id": "t6aNqfD4fuEq" @@ -1832,12 +2062,12 @@ { "cell_type": "markdown", "source": [ - "As can be seen `info.csv` has information about all the 67 images in the cutouts folder. The `bt`,`R_e`,`total_flux` refer to the bulge-to-total light ratio, effective radius, and flux of the simulated images. We also have columns where we have transformed these columns to their logit or log transforms. For the purposes of this demo, we will use `custom_logit_bt`, `ln_R_e`, and `ln_total_flux` as the three variables we are trying to train the GaMPEN model to predict. \n", + "As can be seen `info.csv` has information about all the 67 images in the cutouts folder. The `bt`,`R_e`,`total_flux` refer to the bulge-to-total light ratio, effective radius, and flux of the simulated images. We also have columns where we have transformed these columns to their logit or log transforms. For the purposes of this demo, we will use `custom_logit_bt`, `ln_R_e`, and `ln_total_flux` as the three variables we are trying to train the GaMPEN model to predict.\n", "\n", "**⚠ NOTE:** The `ln_R_e` and `ln_total_flux` are simply the `log_e` transformations applied to the effective radius and flux. The `custom_logit_bt` is a logit transformation applied to the bulge-to-total light ratio. The only way this is different from the standard logit transformation is that we prevent function from blowing up for values close to zero or one. The custom function we use is mentioned below and also in the [GaMPEN Public Data Release Handbook](https://gampen.readthedocs.io/en/latest/Public_data.html#custom-scaling-function)\n", "\n", "```python\n", - "from scipy.special import logit \n", + "from scipy.special import logit\n", "\n", "def logit_custom(x_input):\n", " \n", @@ -1845,7 +2075,7 @@ " logit transformation\n", " \n", " x_input should be the entire column/array\n", - " in info.csv over which you are applying \n", + " in info.csv over which you are applying\n", " the transformation'''\n", " \n", " x = np.array(x_input)\n", @@ -1869,7 +2099,7 @@ "\n", "\n", "\n", - "**⚠ NOTE:** In order for GaMPEN to work correctly, you need to name the columns `object_id` and `file_name` exactly as it is done here. You can have as many additional/redundant columns in the `.csv` as you want to. But without these two, GaMPEN won't work properly. " + "**⚠ NOTE:** In order for GaMPEN to work correctly, you need to name the columns `object_id` and `file_name` exactly as it is done here. You can have as many additional/redundant columns in the `.csv` as you want to. But without these two, GaMPEN won't work properly." ], "metadata": { "id": "jZ9O0WjJjUMx" @@ -1983,11 +2213,11 @@ "\n", "### Splitting the dataset\n", "\n", - "We first use the `make_splits` module of GaMPEN to split the dataset into training, testing and devel (validation) sets. \n", + "We first use the `make_splits` module of GaMPEN to split the dataset into training, testing and devel (validation) sets.\n", "\n", "The two arguments for `make_splits` are:-\n", "* `data_dir`: This should point to the data-directory with `info.csv`\n", - "* `target_matric`: `make_splits` splits the data-set into training/test/devel into unbalanced splits, which randomly picks images for each split. These are called `unbalanced` splits. It also creates some `balanced` splits, where it first splits the dataset into 4 partitions based on the `target_metric` variable; and then draws samples such that the samples used for trianing are balanced across these 4 partitions. " + "* `target_matric`: `make_splits` splits the data-set into training/test/devel into unbalanced splits, which randomly picks images for each split. These are called `unbalanced` splits. It also creates some `balanced` splits, where it first splits the dataset into 4 partitions based on the `target_metric` variable; and then draws samples such that the samples used for trianing are balanced across these 4 partitions." ], "metadata": { "id": "nwg46dyps3Li" @@ -2053,7 +2283,7 @@ { "cell_type": "markdown", "source": [ - "As explained above, the `balanced` files refer to the files with balanced splits and the `unbalanced` files refer to the files with unbalanced splits. \n", + "As explained above, the `balanced` files refer to the files with balanced splits and the `unbalanced` files refer to the files with unbalanced splits.\n", "\n", "As you can see there are also some labels (`dev2`,`dev`,`lg`,`md`,`sm`,`xl`,`xs`) assigned to the different files. These indicate a set of different kinds of splits where different fractions of data have been assigned to the train/test/devel splits:-\n", "\n", @@ -2067,9 +2297,9 @@ "\n", "You can change these or definte your own splits. Simply alter `split_types` dictionary at the top of the `make_splits.py` file.\n", "\n", - "Finally, each split has a train,test, and devel (validation) portion. \n", + "Finally, each split has a train,test, and devel (validation) portion.\n", "\n", - "Let's choose the `balanced-dev2` split for ourwork. As can be seen below the training file is simply a copy of the `info.csv` file with only the selected galaxies. \n", + "Let's choose the `balanced-dev2` split for ourwork. As can be seen below the training file is simply a copy of the `info.csv` file with only the selected galaxies.\n", "\n", "1. List item\n", "2. List item\n", @@ -2891,7 +3121,7 @@ "\n", "Now, in order to train GaMPEN, we have to use the `train` module. There are lots of different arugments that can be set for the training module.\n", "\n", - "**⚠ NOTE: We strongly recommend that while this model trains, you go and read what these arguments refer to [on this page](https://gampen.readthedocs.io/en/latest/Using_GaMPEN.html#running-the-trainer)** Additionally, you can also run \n", + "**⚠ NOTE: We strongly recommend that while this model trains, you go and read what these arguments refer to [on this page](https://gampen.readthedocs.io/en/latest/Using_GaMPEN.html#running-the-trainer)** Additionally, you can also run\n", "```bash\n", "python /content/GaMPEN/ggt/train/train.py --help\n", "```" @@ -3085,7 +3315,7 @@ { "cell_type": "markdown", "source": [ - "### Monitoring Training Metrics \n", + "### Monitoring Training Metrics\n", "\n", "In order to make it easy for you to keep track of models during and after training, GaMPEN uses [MLFlow](https://www.mlflow.org/docs/latest/tracking.html)." ], @@ -3099,13 +3329,13 @@ "\n", "#### MLFlow UI\n", "\n", - "ML Flow provides a visual interface where you can track your models being trained and also compare models after training. \n", + "ML Flow provides a visual interface where you can track your models being trained and also compare models after training.\n", "\n", - "This is fairly easy when training GaMPEN on your local machine / a cluster / the cloud as after initializing ML Flow, all you need to do is to navigate to `localhost:5000` of the machine where you are running MLFlow. \n", + "This is fairly easy when training GaMPEN on your local machine / a cluster / the cloud as after initializing ML Flow, all you need to do is to navigate to `localhost:5000` of the machine where you are running MLFlow.\n", "\n", "However, because we are doing these demo on Google Colab, we have to use an external service called [ngrok](https://ngrok.com/) to access the Colab instance's 5000 port. **You do NOT need ngrok when doing this on your own machine.**\n", "\n", - "Before you run the following snippet, make an account on [ngrok](https://ngrok.com/) and copy your *authtoken* (Left Sidebar --> Getting Started --> Your Authtoken) which you will need after executing the following snippet. You will be prompted by the code-block below to enter this token. \n" + "Before you run the following snippet, make an account on [ngrok](https://ngrok.com/) and copy your *authtoken* (Left Sidebar --> Getting Started --> Your Authtoken) which you will need after executing the following snippet. You will be prompted by the code-block below to enter this token.\n" ], "metadata": { "id": "Lvn_jUq0-phh" @@ -3177,14 +3407,14 @@ "source": [ "Now, navigate to the URL mentioned above and you should be able to access the MLFlow UI.\n", "\n", - "In the ML Flow UI, click on the experiment titled \"demo\" and the current run should now be shown. If you click on that, you should be able to access a panel which shows you the parameters used to train the model, current training metrics (such as loss, mean absolute error etc.) on the training/devel set. \n", + "In the ML Flow UI, click on the experiment titled \"demo\" and the current run should now be shown. If you click on that, you should be able to access a panel which shows you the parameters used to train the model, current training metrics (such as loss, mean absolute error etc.) on the training/devel set.\n", "\n", "**Finally, most importantly, the bottom of this page will contain the Artificats section, the first entry of which will show the localtion of the saved model.**\n", "\n", "\n", - "#### If Running MLFlow on your own machine \n", + "#### If Running MLFlow on your own machine\n", "\n", - "Navigate to the directory from where you initiated your GaMPEN run. Then execute the following command \n", + "Navigate to the directory from where you initiated your GaMPEN run. Then execute the following command\n", "\n", "```bash\n", "mlflow ui\n", @@ -3195,23 +3425,23 @@ "\n", "#### If Running MLFlow on a Server/HPC Cluster\n", "\n", - "First, on the server/HPC, navigate to the directory from where you initiated your GaMPEN run (you can do this on separate machine as well -- only the filesystem needs to tbe same). Then execute the following command \n", + "First, on the server/HPC, navigate to the directory from where you initiated your GaMPEN run (you can do this on separate machine as well -- only the filesystem needs to tbe same). Then execute the following command\n", "\n", "```bash\n", "mlflow ui --host 0.0.0.0\n", "```\n", "\n", - "The `--host` option is important to make the MLFlow server accept connections from other machines. \n", + "The `--host` option is important to make the MLFlow server accept connections from other machines.\n", "\n", "Now from your local machine tunnel into the `5000` port of the server where you ran the above command.\n", "F\n", - "or example, let's say you are in an HPC environment, where the machine where you ran the above command is named `server1` and the login node to your HPC is named `hpc.university.edu` and you have the username `astronomer`. Then to forward the port you should type the following command in your local machine \n", + "or example, let's say you are in an HPC environment, where the machine where you ran the above command is named `server1` and the login node to your HPC is named `hpc.university.edu` and you have the username `astronomer`. Then to forward the port you should type the following command in your local machine\n", "\n", "```bash\n", "ssh -N -L 5000:server1:5000 astronomer@hpc.university.edu\n", "```\n", "\n", - "If performing the above step without a login node (e.g., a server whhich has the IP `server1.university.edu`), you should be able to do \n", + "If performing the above step without a login node (e.g., a server whhich has the IP `server1.university.edu`), you should be able to do\n", "\n", "```bash\n", "ssh -N -L 5000:localhost:5000 astronomer@server1.university.edu\n", @@ -3242,7 +3472,7 @@ "\n", "$$ \\left( \\log\\frac{L_B/L_T}{1-L_B/L_T}, \\log R_e, \\log \\mathrm{Flux} \\right) $$\n", "\n", - "We will perform the inverse transformation using some additional scripts later. \n", + "We will perform the inverse transformation using some additional scripts later.\n", "\n", "\n", "\n" @@ -3258,13 +3488,13 @@ "\n", "The backbone of performing inference is the `inference.py` file at `/GaMPEN/ggt/modules/`.\n", "\n", - "To use this file, we run it by passing different variables to the inferece file. In order to understand the various options that can be specified while running inference you can type `!python GaMPEN/ggt/modules/inference.py --help` in a Google Colab code cell or consult the documentation [here](https://gampen.readthedocs.io/en/latest/Using_GaMPEN.html#inference). \n", + "To use this file, we run it by passing different variables to the inferece file. In order to understand the various options that can be specified while running inference you can type `!python GaMPEN/ggt/modules/inference.py --help` in a Google Colab code cell or consult the documentation [here](https://gampen.readthedocs.io/en/latest/Using_GaMPEN.html#inference).\n", "\n", "**⚠ STOP: We strongly recommend that you go through the page linked above to understand the various options we have used for performing inference below.**.\n", "\n", - "The `data_dir`,`cutout_size`, `slug`, `normalize`, `parallel`, `label_cols`, `model_type`, `channels`, `label_scaling`, `repeat_dims`, `dropout_rate` must all be set to the values that were used during training the model. \n", + "The `data_dir`,`cutout_size`, `slug`, `normalize`, `parallel`, `label_cols`, `model_type`, `channels`, `label_scaling`, `repeat_dims`, `dropout_rate` must all be set to the values that were used during training the model.\n", "\n", - "The `--mc-dropout` and `--cov-errors` options specify that we want to perform both Monte Carlo dropout during inference as well include aleatoric errors in each of the Monte Carlo runs. The `n_runs` parameter controls the number of different Monte Carlo models generated for prediction. For a robust analysis, we recommend setting this to `500` or `1000`. We set this to `50` here just for demonstrative purposes. " + "The `--mc-dropout` and `--cov-errors` options specify that we want to perform both Monte Carlo dropout during inference as well include aleatoric errors in each of the Monte Carlo runs. The `n_runs` parameter controls the number of different Monte Carlo models generated for prediction. For a robust analysis, we recommend setting this to `500` or `1000`. We set this to `50` here just for demonstrative purposes." ], "metadata": { "id": "1umWWY934xyL" @@ -3273,7 +3503,7 @@ { "cell_type": "markdown", "source": [ - "We need a directory to store all the output files containing the predictions. Let's create a directory named `bayesian_inference_runs` to store these output \n", + "We need a directory to store all the output files containing the predictions. Let's create a directory named `bayesian_inference_runs` to store these output\n", "files" ], "metadata": { @@ -3296,7 +3526,7 @@ "source": [ "**⚠STOP: In the code block below, put the full path to the trained model in the `model_path` variable. Additionally, enclose the path in single quotes and NOT double quotes.**\n", "\n", - "You can use the path printed at the end of training output along with the \n", + "You can use the path printed at the end of training output along with the\n", "directory path (e.g.,`'/content/models/demo-balanced-dev2-xxxxxxxxx.pt'`).Alternatively, you can also get the path by copy pasting the full path in the MLFlow artificats section for the trained model. (e.g.,)" ], "metadata": { @@ -4160,7 +4390,7 @@ "$$ \\left( \\log\\frac{L_B/L_T}{1-L_B/L_T}, \\log R_e, \\log \\mathrm{Flux} \\right) $$\n", "\n", "\n", - "Now, we will use the [`result_aggregator.py`](https://github.com/aritraghsh09/GaMPEN/blob/master/ggt/modules/result_aggregator.py) file in GaMPEN to collate all these .csvs. The `result_aggregator` module of GaMPEN will collect all the csvs; scale the values back to $L_B/L_T$, $R_e$, Flux; produce summary statistics; as well as produce PDFs of the output variables for each image. \n", + "Now, we will use the [`result_aggregator.py`](https://github.com/aritraghsh09/GaMPEN/blob/master/ggt/modules/result_aggregator.py) file in GaMPEN to collate all these .csvs. The `result_aggregator` module of GaMPEN will collect all the csvs; scale the values back to $L_B/L_T$, $R_e$, Flux; produce summary statistics; as well as produce PDFs of the output variables for each image.\n", "\n", "**For an understanding of all the options available in the `result_aggregator` module, please refere to [this page.](https://gampen.readthedocs.io/en/latest/Using_GaMPEN.html#result-aggregator)**\n", "\n", @@ -4294,7 +4524,7 @@ { "cell_type": "markdown", "source": [ - "Now, let's inspect the `summary.csv` file as well as the predicted PDFs. " + "Now, let's inspect the `summary.csv` file as well as the predicted PDFs." ], "metadata": { "id": "UNTw5bRep-ZN" @@ -5028,7 +5258,7 @@ { "cell_type": "markdown", "source": [ - "As can be seen, the summary file contains all the galaxies in our test-set, and for every prediction column we have the \n", + "As can be seen, the summary file contains all the galaxies in our test-set, and for every prediction column we have the\n", "\n", " * mean (_mean)\n", " * median (_median)\n", @@ -5040,7 +5270,7 @@ " * $2\\sigma$ confidence interval (_twosig_ci)\n", " * $3\\sigma$ confidence interval (_threesig_ci)\n", "\n", - "for the predicted distribution. \n", + "for the predicted distribution.\n", "\n", "**⚠ STOP: Note that the `result_aggregator` module also converts flux to magnitudes; however this coversion assumes a photometric zeropoint that is only true for HSC. If you are using the `result_aggregator` module for some other survey, you should change this.**\n", "\n", @@ -5085,7 +5315,7 @@ "import random\n", "from matplotlib.patches import Rectangle\n", "from matplotlib.ticker import FormatStrFormatter, ScalarFormatter\n", - "from astropy.io import fits \n", + "from astropy.io import fits\n", "LOGMIN=1e-4" ], "metadata": { @@ -5121,7 +5351,7 @@ " font_size=15,\n", " cutout_size=239,\n", " imgs_to_print=10):\n", - " \n", + "\n", " summary_df = pd.read_csv(summary_file_path,nrows=imgs_to_print)\n", "\n", " fig,ax1 = plt.subplots(len(summary_df),4,figsize=(4*5.3,len(summary_df)*4),\n", @@ -5131,17 +5361,17 @@ " row_counter = 0\n", "\n", " for i, img_num in enumerate(summary_df[\"object_id\"]):\n", - " \n", + "\n", " ax = ax1[row_counter]\n", " ax[0].set_xticks([])\n", " ax[0].set_yticks([])\n", - " \n", + "\n", " img_data = fits.getdata(imgdir\n", " + str(img_num) + \".fits\")\n", " ax[0].imshow(img_data,norm=mpl.colors.LogNorm(vmin=max(img_data.min(),LOGMIN)))\n", "\n", " pred_arr = np.load(pdf_dir + str(img_num) + \".npy\")\n", - " \n", + "\n", " pred_cols = [\"preds_bt\",\"preds_R_e_asec\",\"preds_total_mag\"]\n", " true_cols = [\"bt\",\"R_e\",\"total_flux\"]\n", " pred_arr_idxs = [2,0,3] #indexes of columns in pred_arr\n", @@ -5151,7 +5381,7 @@ " ax[j+1].plot(pred_arr[pred_arr_idxs[j]],\n", " pred_arr[pred_arr_idxs[j]+4],\n", " label=\"PDF\",lw=3)\n", - " \n", + "\n", "\n", " mode = summary_df[column_name + \"_mode\"][i]\n", " sig_ci = summary_df[column_name + \"_sig_ci\"][i]\n", @@ -5164,52 +5394,52 @@ " threesig_ci = (float(threesig_ci.split(',')[0][1:]),\n", " float(threesig_ci.split(',')[1][:-1]))\n", " n_out = pred_arr[pred_arr_idxs[j]+4]\n", - " \n", + "\n", " ax[j+1].plot([mode,mode],[0,np.max(pred_arr[pred_arr_idxs[j]+4])],c='r',\n", " ls='solid',label=\"Mode\", lw =3) #plotting a x = Mode line\n", "\n", " if 'mag' in column_name.split('_'):\n", " flux = summary_df[true_cols[j]][i]\n", " true_value = -2.512*np.log10(flux) + 27\n", - " else: \n", + " else:\n", " true_value = summary_df[true_cols[j]][i]\n", "\n", " ax[j+1].plot([true_value,true_value],[0,np.max(pred_arr[pred_arr_idxs[j]+4])],\n", " c='blue',ls='--',label=\"True Value\", lw =3) #plotting a x = Mode line\n", - " \n", + "\n", " rect = Rectangle((sig_ci[0], 0), sig_ci[1]-sig_ci[0], 0.25*np.max(n_out),color='coral',\n", " alpha=0.5,label=\"68.27 %ile\",lw=1)\n", " border = Rectangle((sig_ci[0], 0), sig_ci[1]-sig_ci[0], 0.25*np.max(n_out),ec='coral',\n", " lw=3,fill=False)\n", - " ax[j+1].add_patch(rect) \n", + " ax[j+1].add_patch(rect)\n", " ax[j+1].add_patch(border)\n", - " \n", + "\n", " rect = Rectangle((twosig_ci[0], 0), twosig_ci[1]-twosig_ci[0], 0.175*np.max(n_out),color='goldenrod',\n", " alpha=0.5,label=\"95.45 %ile\")\n", " border = Rectangle((twosig_ci[0], 0), twosig_ci[1]-twosig_ci[0], 0.175*np.max(n_out),ec='goldenrod',\n", " lw=3,fill=False)\n", " ax[j+1].add_patch(rect)\n", " ax[j+1].add_patch(border)\n", - " \n", + "\n", " rect = Rectangle((threesig_ci[0], 0), threesig_ci[1]-threesig_ci[0], 0.10*np.max(n_out),\n", " color='seagreen',alpha=0.5,label=\"99.73 %ile\")\n", " border = Rectangle((threesig_ci[0], 0), threesig_ci[1]-threesig_ci[0], 0.10*np.max(n_out),ec='seagreen',\n", " lw=3,fill=False)\n", " ax[j+1].add_patch(rect)\n", " ax[j+1].add_patch(border)\n", - " \n", - " \n", - " \n", + "\n", + "\n", + "\n", " #ax[j+1].ticklabel_format(axis='both',style='sci',scilimits=(0,0))\n", " ax[j+1].set_yticks([])\n", " ax[j+1].tick_params(axis='x',labelsize=font_size-3)\n", - " \n", - " \n", + "\n", + "\n", " if row_counter == 0:\n", " ax[1].legend(loc='upper right',prop={'size': font_size-4})\n", - " \n", - " \n", - " \n", + "\n", + "\n", + "\n", " row_counter += 1\n", "\n", "\n", @@ -5229,7 +5459,7 @@ "source": [ "### Making Plots\n", "\n", - "Let's plot the predicted distributions for the first four galaxies in our test set. For this we use the `plot_hists` function we defined above. " + "Let's plot the predicted distributions for the first four galaxies in our test set. For this we use the `plot_hists` function we defined above." ], "metadata": { "id": "DDkGGpViv8I4" @@ -5273,9 +5503,9 @@ "source": [ "**As can be seen, our trained model does NOT seem to be doing pretty well!!**\n", "\n", - "This is primarily because we only used about 50 images for trianing and during inference we used only 50 runs. For realistic results, you would need to use thousands of images for training and run inference with about 500/1000 runs,. \n", + "This is primarily because we only used about 50 images for trianing and during inference we used only 50 runs. For realistic results, you would need to use thousands of images for training and run inference with about 500/1000 runs,.\n", "\n", - "The primary goal of this demo is to get you acquainted with how to train a GaMPEN model. In order to see a fully trained GaMPEN model at work, please take a look at the [Making Predictions Tutorial](https://gampen.readthedocs.io/en/latest/Tutorials.html#making-predictions) " + "The primary goal of this demo is to get you acquainted with how to train a GaMPEN model. In order to see a fully trained GaMPEN model at work, please take a look at the [Making Predictions Tutorial](https://gampen.readthedocs.io/en/latest/Tutorials.html#making-predictions)" ], "metadata": { "id": "g8Jh2eMMsSzR"