Merge branch 'main' into documentation

lincc-frameworks · Jan 2, 2025 · bed3304 · bed3304
2 parents 3fc7421 + 6eaa1ca
commit bed3304
Show file tree

Hide file tree

Showing 4 changed files with 492 additions and 6 deletions.
diff --git a/docs/notebooks/opsim_notebook.ipynb b/docs/notebooks/opsim_notebook.ipynb
@@ -0,0 +1,318 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using Rubin OpSims in Simulations\n",
+    "\n",
+    "A critical aspect to producing realistic simulations of time varying phenomena is correctly modeling the cadence at which the objects will be observed and capturing the relevant physical details about the observation. TDAstro provides an `OpSim` module that can load a Rubin opsim file and use that to filter observations based on location, associate observations with their filter, apply detector noise, etc.\n",
+    "\n",
+    "This notebook provides an introduction to using opsim data in the simulations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## OpSim Class\n",
+    "\n",
+    "The `OpSim` class provides a wrapper for loading and querying survey-specific information including pointings and survey information. The required information is:\n",
+    "  * the pointing data (RA, dec, and times for each pointing), and\n",
+    "  * the zero point information for each pointing (a zero point column or the information needed to derive it)\n",
+    "\n",
+    "Other common information includes the airmass, exposure time, and filter used (the information needed to derive the ero points).\n",
+    "\n",
+    "Internally, the `OpSim` class uses a table with the column names are given by the imported data. For Rubin opsim files, this means columns will have names such as \"observationStartMJD\" or \"fieldRA\". Since different inputs may have different columns, the class provides the ability to map simple column names such as \"ra\" and \"time\" to the corresponding table names. The constructor allows the user to pass a column-mapping dictionary that maps the short column name to the name used within the database. The default column-mapper corresponds to the Rubin opsim format:\n",
+    "  * \"airmass\" -> \"airmass\"\n",
+    "  * \"dec\" -> \"fieldDec\"\n",
+    "  * \"exptime\" -> \"visitExposureTime\"\n",
+    "  * \"filter\" -> \"filter\"\n",
+    "  * \"ra\" -> \"fieldRA\"\n",
+    "  * \"time\" -> \"observationStartMJD\"\n",
+    "By default the `OpSim` class will look for Rubin column names, such as \"observationStartMJD\" and \"fieldRA\".\n",
+    "\n",
+    "We can create a simple `OpSim` by manually specifying the data as a dict or a pandas dataframe."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from tdastro.astro_utils.opsim import OpSim\n",
+    "\n",
+    "input_data = {\n",
+    "    \"observationStartMJD\": np.array([0.0, 1.0, 2.0, 3.0, 4.0]),\n",
+    "    \"fieldRA\": np.array([15.0, 30.0, 15.0, 0.0, 60.0]),\n",
+    "    \"fieldDec\": np.array([-10.0, -5.0, 0.0, 5.0, 10.0]),\n",
+    "    \"zp_nJy\": np.ones(5),\n",
+    "}\n",
+    "ops_data = OpSim(input_data)\n",
+    "print(ops_data.table)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once we have an `OpSim` object, we can use the `[]` notation to access data directly. `OpSim` has the same mechanics as a pandas data frame. The columns are named and correspond to input attributes. Each row provides the information for a single pointing.\n",
+    "\n",
+    "As noted above, the column mapping allows us to access column by either their given name (\"observationStartMJD\") or their shortened name (\"time\")."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ops_data[\"observationStartMJD\"]  # Same as ops_data[\"time\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading an OpSim\n",
+    "\n",
+    "Most users will want to load a pre-existing `OpSim` from a database file using the `from_db()` function. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "from tdastro import _TDASTRO_TEST_DATA_DIR\n",
+    "\n",
+    "opsim_file = Path(_TDASTRO_TEST_DATA_DIR) / \"opsim_shorten.db\"\n",
+    "ops_data = OpSim.from_db(opsim_file)\n",
+    "\n",
+    "print(f\"Loaded an opsim database with {len(ops_data)} entries.\")\n",
+    "print(f\"Columns: {ops_data.columns}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Spatial Matching\n",
+    "\n",
+    "The `OpSim` class provides a framework for efficiently determining when an object was observed given its (RA, dec). We use the `range_search()` function to the indices of all pointings that are within a given radius of the query point.\n",
+    "\n",
+    "We start by taking the (RA, dec) of the first observation in the table and using that to determine all times this position was observed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_ra = ops_data[\"ra\"][0]\n",
+    "query_dec = ops_data[\"dec\"][0]\n",
+    "print(f\"Searching for ({query_ra}, {query_dec}).\")\n",
+    "\n",
+    "# Find everything within 0.5 degrees of the query point.\n",
+    "matches = ops_data.range_search(query_ra, query_dec, 0.5)\n",
+    "print(f\"Found {len(matches)} matches at {matches}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once we have the indices, we can use those to find the other information about the pointings."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ops_data[\"time\"][matches]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can run the spatial search in batch mode by providing a lists of RA and dec. The `range_search()` will return a list of numpy arrays where each element in the top-level list represents the matches for a single query (RA, dec). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_queries = 10\n",
+    "query_ra = ops_data[\"ra\"][0:num_queries]\n",
+    "query_dec = ops_data[\"dec\"][0:num_queries]\n",
+    "\n",
+    "matches = ops_data.range_search(query_ra, query_dec, 0.5)\n",
+    "for idx, m_ids in enumerate(matches):\n",
+    "    print(f\"{idx}: ({query_ra[idx]}, {query_dec[idx]}) matched {m_ids}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Applying Noise\n",
+    "\n",
+    "The `OpSim` object use the information provided in its table to compute the noise properties for each observation and use those to inject synthetic noise into the observations. The `bandflux_error_point_source` function takes the pre-noise flux densities (bandflux of the point source in nJy) and returns the simulated bandflux noise in nJy.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fluxes = np.array([100.0, 200.0, 300.0])\n",
+    "flux_err = ops_data.bandflux_error_point_source(fluxes, [0, 1, 2])\n",
+    "flux_err"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This can then be feed into the `apply_noise` utility function to apply it to the flux."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tdastro.astro_utils.noise_model import apply_noise\n",
+    "\n",
+    "noisy_fluxes = apply_noise(fluxes, flux_err)\n",
+    "noisy_fluxes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Filtering\n",
+    "\n",
+    "Depending on the analysis we run, we might want to need to use only a subset of the data.  The `OpSim` class provides a helper function `filter_rows()` that can be used to filter the data in an `OpSim` and thus speed up subsequent operations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "input_data[\"filter\"] = np.array([\"r\", \"g\", \"g\", \"r\", \"g\"])\n",
+    "ops_data2 = OpSim(input_data)\n",
+    "\n",
+    "print(\"BEFORE:\")\n",
+    "print(ops_data2.table)\n",
+    "\n",
+    "ops_data3 = ops_data2.filter_rows(ops_data2[\"filter\"] == \"g\")\n",
+    "\n",
+    "print(\"\\n\\nAFTER:\")\n",
+    "print(ops_data3.table)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In addition to filtering the rows in the table, the function will also update the cached data structures within the `OpSim` object, such as the KD-tree."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Scalability and Testing\n",
+    "\n",
+    "To test the scalability of the OpSim we can user the `create_random_opsim()` helper function to generate completely random data by sampling uniformly over the surface of a sphere. This will not be a realistic survey since it will cover both the Northern and Southern hemisphere, but can provide a good test set for timing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import timeit\n",
+    "from tdastro.astro_utils.opsim import create_random_opsim\n",
+    "\n",
+    "# Create an opsim with 1,000,000 observations.\n",
+    "num_obs = 1_000_000\n",
+    "opsim_data2 = create_random_opsim(num_obs)\n",
+    "\n",
+    "# Use the first 10,000 samples as queries.\n",
+    "num_queries = 100_000\n",
+    "query_ra = np.asarray(opsim_data2[\"ra\"][0:num_queries])\n",
+    "query_dec = np.asarray(opsim_data2[\"dec\"][0:num_queries])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using the large OpSim data, we can test where batching speeds up the range search."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%timeit\n",
+    "for i in range(num_queries):\n",
+    "    _ = opsim_data2.range_search(query_ra[i], query_dec[i], 0.5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%timeit\n",
+    "_ = opsim_data2.range_search(query_ra, query_dec, 0.5)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "tdastro",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}