Files
AstronomicalData/_sources/06_photo.ipynb
2020-11-18 19:32:36 -05:00

1378 lines
151 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 6\n",
"\n",
"This is the sixth in a series of notebooks related to astronomy data.\n",
"\n",
"As a continuing example, we will replicate part of the analysis in a recent paper, \"[Off the beaten path: Gaia reveals GD-1 stars outside of the main stream](https://arxiv.org/abs/1805.00425)\" by Adrian M. Price-Whelan and Ana Bonaca.\n",
"\n",
"In the previous lesson we downloaded photometry data from Pan-STARRS, which is available from the same server we've been using to get Gaia data. \n",
"\n",
"The next step in the analysis is to select candidate stars based on the photometry data. The following figure from the paper is a color-magnitude diagram for the stars selected based on proper motion:\n",
"\n",
"<img width=\"300\" src=\"https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-3.png\">\n",
"\n",
"In red is a theoretical isochrone, showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster. \n",
"\n",
"By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from younger background stars."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outline\n",
"\n",
"Here are the steps in this notebook:\n",
"\n",
"1. We'll reload the data from the previous notebook and make a color-magnitude diagram.\n",
"\n",
"2. Then we'll specify a polygon in the diagram that contains stars with the photometry we expect.\n",
"\n",
"3. Then we'll merge the photometry data with the list of candidate stars, storing the result in a Pandas `DataFrame`.\n",
"\n",
"After completing this lesson, you should be able to\n",
"\n",
"* Use Matplotlib to specify a `Polygon` and determine which points fall inside it.\n",
"\n",
"* Use Pandas to merge data from multiple `DataFrames`, much like a database `JOIN` operation."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"remove-cell"
]
},
"source": [
"## Installing libraries\n",
"\n",
"If you are running this notebook on Colab, you can run the following cell to install Astroquery and the other libraries we'll use.\n",
"\n",
"If you are running this notebook on your own computer, you might have to install these libraries yourself. See the instructions in the preface."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# If we're running on Colab, install libraries\n",
"\n",
"import sys\n",
"IN_COLAB = 'google.colab' in sys.modules\n",
"\n",
"if IN_COLAB:\n",
" !pip install astroquery astro-gala pyia python-wget"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reload the data\n",
"\n",
"The following cell downloads the photometry data we created in the previous notebook."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from wget import download\n",
"\n",
"filename = 'gd1_photo.fits'\n",
"filepath = 'https://github.com/AllenDowney/AstronomicalData/raw/main/data/'\n",
"\n",
"if not os.path.exists(filename):\n",
" print(download(filepath+filename))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can read the data back into an Astropy `Table`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from astropy.table import Table\n",
"\n",
"photo_table = Table.read(filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting photometry data\n",
"\n",
"Now that we have photometry data from Pan-STARRS, we can replicate the [color-magnitude diagram](https://en.wikipedia.org/wiki/Galaxy_color%E2%80%93magnitude_diagram) from the original paper:\n",
"\n",
"<img width=\"300\" src=\"https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-3.png\">\n",
"\n",
"The y-axis shows the apparent magnitude of each source with the [g filter](https://en.wikipedia.org/wiki/Photometric_system).\n",
"\n",
"The x-axis shows the difference in apparent magnitude between the g and i filters, which indicates color.\n",
"\n",
"Stars with lower values of (g-i) are brighter in g-band than in i-band, compared to other stars, which means they are bluer.\n",
"\n",
"Stars in the lower-left quadrant of this diagram are less bright and less metallic than the others, which means they are [likely to be older](http://spiff.rit.edu/classes/ladder/lectures/ordinary_stars/ordinary.html).\n",
"\n",
"Since we expect the stars in GD-1 to be older than the background stars, the stars in the lower-left are more likely to be in GD-1."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_cmd(table):\n",
" \"\"\"Plot a color magnitude diagram.\n",
" \n",
" table: Table or DataFrame with photometry data\n",
" \"\"\"\n",
" y = table['g_mean_psf_mag']\n",
" x = table['g_mean_psf_mag'] - table['i_mean_psf_mag']\n",
"\n",
" plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)\n",
"\n",
" plt.xlim([0, 1.5])\n",
" plt.ylim([14, 22])\n",
" plt.gca().invert_yaxis()\n",
"\n",
" plt.ylabel('$g_0$')\n",
" plt.xlabel('$(g-i)_0$')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`plot_cmd` uses a new function, `invert_yaxis`, to invert the `y` axis, which is conventional when plotting magnitudes, since lower magnitude indicates higher brightness.\n",
"\n",
"`invert_yaxis` is a little different from the other functions we've used. You can't call it like this:\n",
"\n",
"```\n",
"plt.invert_yaxis() # doesn't work\n",
"```\n",
"\n",
"You have to call it like this:\n",
"\n",
"```\n",
"plt.gca().invert_yaxis() # works\n",
"```\n",
"\n",
"`gca` stands for \"get current axis\". It returns an object that represents the axes of the current figure, and that object provides `invert_yaxis`.\n",
"\n",
"**In case anyone asks:** The most likely reason for this inconsistency in the interface is that `invert_yaxis` is a lesser-used function, so it's not made available at the top level of the interface."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what the results look like."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_cmd(photo_table)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our figure does not look exactly like the one in the paper because we are working with a smaller region of the sky, so we don't have as many stars. But we can see an overdense region in the lower left that contains stars with the photometry we expect for GD-1.\n",
"\n",
"The authors of the original paper derive a detailed polygon that defines a boundary between stars that are likely to be in GD-1 or not.\n",
"\n",
"As a simplification, we'll choose a boundary by eye that seems to contain the overdense region."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Drawing a polygon\n",
"\n",
"Matplotlib provides a function called `ginput` that lets us click on the figure and make a list of coordinates.\n",
"\n",
"It's a little tricky to use `ginput` in a Jupyter notebook. \n",
"Before calling `plt.ginput` we have to tell Matplotlib to use `TkAgg` to draw the figure in a new window.\n",
"\n",
"When you run the following cell, a figure should appear in a new window. Click on it 10 times to draw a polygon around the overdense area. A red cross should appear where you click."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib as mpl\n",
"\n",
"coords = None\n",
"\n",
"if not IN_COLAB:\n",
" mpl.use('TkAgg')\n",
" plot_cmd(photo_table)\n",
" coords = plt.ginput(10)\n",
" mpl.use('agg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The argument to `ginput` is the number of times the user has to click on the figure.\n",
"\n",
"The result from `ginput` is a list of coordinate pairs."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(0.2643369175627239, 17.84253127299485),\n",
" (0.3539426523297491, 18.799116997792495),\n",
" (0.47491039426523296, 19.682119205298015),\n",
" (0.6317204301075269, 20.454746136865342),\n",
" (0.7661290322580645, 20.785871964679913),\n",
" (0.8064516129032258, 21.41133186166299),\n",
" (0.5869175627240143, 21.300956585724798),\n",
" (0.39426523297491034, 20.565121412803535),\n",
" (0.22401433691756267, 19.240618101545255),\n",
" (0.19713261648745517, 18.02649006622517)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coords"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If `ginput` doesn't work for you, you could use the following coordinates."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"if coords is None:\n",
" coords = [(0.2, 17.5), \n",
" (0.2, 19.5), \n",
" (0.65, 22),\n",
" (0.75, 21),\n",
" (0.4, 19),\n",
" (0.4, 17.5)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to convert the coordinates to a format we can use to plot them, which is a sequence of `x` coordinates and a sequence of `y` coordinates. The NumPy function `transpose` does what we want. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([0.26433692, 0.35394265, 0.47491039, 0.63172043, 0.76612903,\n",
" 0.80645161, 0.58691756, 0.39426523, 0.22401434, 0.19713262]),\n",
" array([17.84253127, 18.799117 , 19.68211921, 20.45474614, 20.78587196,\n",
" 21.41133186, 21.30095659, 20.56512141, 19.2406181 , 18.02649007]))"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"xs, ys = np.transpose(coords)\n",
"xs, ys"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To display the polygon, we'll draw the figure again and use `plt.plot` to draw the polygon."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_cmd(photo_table)\n",
"plt.plot(xs, ys);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If it looks like your polygon does a good job surrounding the overdense area, go on to the next section. Otherwise you can try again.\n",
"\n",
"If you want a polygon with more points (or fewer), you can change the argument to `ginput`.\n",
"\n",
"The polygon does not have to be \"closed\". When we use this polygon in the next section, the last and first points will be connected by a straight line.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Which points are in the polygon?\n",
"\n",
"Matplotlib provides a `Path` object that we can use to check which points fall in the polygon we selected.\n",
"\n",
"Here's how we make a `Path` using a list of coordinates."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Path(array([[ 0.26433692, 17.84253127],\n",
" [ 0.35394265, 18.799117 ],\n",
" [ 0.47491039, 19.68211921],\n",
" [ 0.63172043, 20.45474614],\n",
" [ 0.76612903, 20.78587196],\n",
" [ 0.80645161, 21.41133186],\n",
" [ 0.58691756, 21.30095659],\n",
" [ 0.39426523, 20.56512141],\n",
" [ 0.22401434, 19.2406181 ],\n",
" [ 0.19713262, 18.02649007]]), None)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from matplotlib.path import Path\n",
"\n",
"path = Path(coords)\n",
"path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Path` provides `contains_points`, which figures out which points are inside the polygon.\n",
"\n",
"To test it, we'll create a list with two points, one inside the polygon and one outside."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"points = [(0.4, 20), \n",
" (0.4, 30)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can make sure `contains_points` does what we expect."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ True, False])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inside = path.contains_points(points)\n",
"inside"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is an array of Boolean values.\n",
"\n",
"We are almost ready to select stars whose photometry data falls in this polygon. But first we need to do some data cleaning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reloading the data\n",
"\n",
"Now we need to combine the photometry data with the list of candidate stars we identified in a previous notebook. The following cell downloads it:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from wget import download\n",
"\n",
"filename = 'gd1_candidates.hdf5'\n",
"filepath = 'https://github.com/AllenDowney/AstronomicalData/raw/main/data/'\n",
"\n",
"if not os.path.exists(filename):\n",
" print(download(filepath+filename))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"candidate_df = pd.read_hdf(filename, 'candidate_df')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`candidate_df` is the Pandas DataFrame that contains the results from Notebook XX, which selects stars likely to be in GD-1 based on proper motion. It also includes position and proper motion transformed to the ICRS frame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Merging photometry data\n",
"\n",
"Before we select stars based on photometry data, we have to solve two problems:\n",
"\n",
"1. We only have Pan-STARRS data for some stars in `candidate_df`.\n",
"\n",
"2. Even for the stars where we have Pan-STARRS data in `photo_table`, some photometry data is missing.\n",
"\n",
"We will solve these problems in two step:\n",
"\n",
"1. We'll merge the data from `candidate_df` and `photo_table` into a single Pandas `DataFrame`.\n",
"\n",
"2. We'll use Pandas functions to deal with missing data.\n",
"\n",
"`candidate_df` is already a `DataFrame`, but `results` is an Astropy `Table`. Let's convert it to Pandas:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"source_id\n",
"g_mean_psf_mag\n",
"i_mean_psf_mag\n"
]
}
],
"source": [
"photo_df = photo_table.to_pandas()\n",
"\n",
"for colname in photo_df.columns:\n",
" print(colname)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to combine `candidate_df` and `photo_df` into a single table, using `source_id` to match up the rows.\n",
"\n",
"You might recognize this task; it's the same as the JOIN operation in ADQL/SQL.\n",
"\n",
"Pandas provides a function called `merge` that does what we want. Here's how we use it."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>source_id</th>\n",
" <th>ra</th>\n",
" <th>dec</th>\n",
" <th>pmra</th>\n",
" <th>pmdec</th>\n",
" <th>parallax</th>\n",
" <th>parallax_error</th>\n",
" <th>radial_velocity</th>\n",
" <th>phi1</th>\n",
" <th>phi2</th>\n",
" <th>pm_phi1</th>\n",
" <th>pm_phi2</th>\n",
" <th>g_mean_psf_mag</th>\n",
" <th>i_mean_psf_mag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>635559124339440000</td>\n",
" <td>137.586717</td>\n",
" <td>19.196544</td>\n",
" <td>-3.770522</td>\n",
" <td>-12.490482</td>\n",
" <td>0.791393</td>\n",
" <td>0.271754</td>\n",
" <td>NaN</td>\n",
" <td>-59.630489</td>\n",
" <td>-1.216485</td>\n",
" <td>-7.361363</td>\n",
" <td>-0.592633</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>635860218726658176</td>\n",
" <td>138.518707</td>\n",
" <td>19.092339</td>\n",
" <td>-5.941679</td>\n",
" <td>-11.346409</td>\n",
" <td>0.307456</td>\n",
" <td>0.199466</td>\n",
" <td>NaN</td>\n",
" <td>-59.247330</td>\n",
" <td>-2.016078</td>\n",
" <td>-7.527126</td>\n",
" <td>1.748779</td>\n",
" <td>17.8978</td>\n",
" <td>17.517401</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>635674126383965568</td>\n",
" <td>138.842874</td>\n",
" <td>19.031798</td>\n",
" <td>-3.897001</td>\n",
" <td>-12.702780</td>\n",
" <td>0.779463</td>\n",
" <td>0.223692</td>\n",
" <td>NaN</td>\n",
" <td>-59.133391</td>\n",
" <td>-2.306901</td>\n",
" <td>-7.560608</td>\n",
" <td>-0.741800</td>\n",
" <td>19.2873</td>\n",
" <td>17.678101</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>635535454774983040</td>\n",
" <td>137.837752</td>\n",
" <td>18.864007</td>\n",
" <td>-4.335041</td>\n",
" <td>-14.492309</td>\n",
" <td>0.314514</td>\n",
" <td>0.102775</td>\n",
" <td>NaN</td>\n",
" <td>-59.785300</td>\n",
" <td>-1.594569</td>\n",
" <td>-9.357536</td>\n",
" <td>-1.218492</td>\n",
" <td>16.9238</td>\n",
" <td>16.478100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>635497276810313600</td>\n",
" <td>138.044516</td>\n",
" <td>19.009471</td>\n",
" <td>-7.172931</td>\n",
" <td>-12.291499</td>\n",
" <td>0.425404</td>\n",
" <td>0.337689</td>\n",
" <td>NaN</td>\n",
" <td>-59.557744</td>\n",
" <td>-1.682147</td>\n",
" <td>-9.000831</td>\n",
" <td>2.334407</td>\n",
" <td>19.9242</td>\n",
" <td>18.334000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" source_id ra dec pmra pmdec parallax \\\n",
"0 635559124339440000 137.586717 19.196544 -3.770522 -12.490482 0.791393 \n",
"1 635860218726658176 138.518707 19.092339 -5.941679 -11.346409 0.307456 \n",
"2 635674126383965568 138.842874 19.031798 -3.897001 -12.702780 0.779463 \n",
"3 635535454774983040 137.837752 18.864007 -4.335041 -14.492309 0.314514 \n",
"4 635497276810313600 138.044516 19.009471 -7.172931 -12.291499 0.425404 \n",
"\n",
" parallax_error radial_velocity phi1 phi2 pm_phi1 pm_phi2 \\\n",
"0 0.271754 NaN -59.630489 -1.216485 -7.361363 -0.592633 \n",
"1 0.199466 NaN -59.247330 -2.016078 -7.527126 1.748779 \n",
"2 0.223692 NaN -59.133391 -2.306901 -7.560608 -0.741800 \n",
"3 0.102775 NaN -59.785300 -1.594569 -9.357536 -1.218492 \n",
"4 0.337689 NaN -59.557744 -1.682147 -9.000831 2.334407 \n",
"\n",
" g_mean_psf_mag i_mean_psf_mag \n",
"0 NaN NaN \n",
"1 17.8978 17.517401 \n",
"2 19.2873 17.678101 \n",
"3 16.9238 16.478100 \n",
"4 19.9242 18.334000 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged = pd.merge(candidate_df, \n",
" photo_df, \n",
" on='source_id', \n",
" how='left')\n",
"merged.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first argument is the \"left\" table, the second argument is the \"right\" table, and the keyword argument `on='source_id'` specifies a column to use to match up the rows.\n",
"\n",
"The argument `how='left'` means that the result should have all rows from the left table, even if some of them don't match up with a row in the right table.\n",
"\n",
"If you are interested in the other options for `how`, you can [read the documentation of `merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html).\n",
"\n",
"You can also do different types of join in ADQL/SQL; [you can read about that here](https://www.w3schools.com/sql/sql_join.asp).\n",
"\n",
"The result is a `DataFrame` that contains the same number of rows as `candidate_df`. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7346, 3724, 7346)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(candidate_df), len(photo_df), len(merged)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And all columns from both tables."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"source_id\n",
"ra\n",
"dec\n",
"pmra\n",
"pmdec\n",
"parallax\n",
"parallax_error\n",
"radial_velocity\n",
"phi1\n",
"phi2\n",
"pm_phi1\n",
"pm_phi2\n",
"g_mean_psf_mag\n",
"i_mean_psf_mag\n"
]
}
],
"source": [
"for colname in merged.columns:\n",
" print(colname)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Detail** You might notice that Pandas also provides a function called `join`; it does almost the same thing, but the interface is slightly different. We think `merge` is a little easier to use, so that's what we chose. It's also more consistent with JOIN in SQL, so if you learn how to use `pd.merge`, you are also learning how to use SQL JOIN.\n",
"\n",
"Also, someone might ask why we have to use Pandas to do this join; why didn't we do it in ADQL. The answer is that we could have done that, but since we already have the data we need, we should probably do the computation locally rather than make another round trip to the Gaia server."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing data\n",
"\n",
"Let's add columns to the merged table for magnitude and color."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"merged['mag'] = merged['g_mean_psf_mag']\n",
"merged['color'] = merged['g_mean_psf_mag'] - merged['i_mean_psf_mag']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These columns contain the special value `NaN` where we are missing data.\n",
"\n",
"We can use `notnull` to see which rows contain value data, that is, not null values."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 True\n",
"2 True\n",
"3 True\n",
"4 True\n",
" ... \n",
"7341 True\n",
"7342 False\n",
"7343 False\n",
"7344 True\n",
"7345 False\n",
"Name: color, Length: 7346, dtype: bool"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged['color'].notnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And `sum` to count the number of valid values."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3724"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged['color'].notnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For scientific purposes, it's not obvious what we should do with candidate stars if we don't have photometry data. Should we give them the benefit of the doubt or leave them out?\n",
"\n",
"In part the answer depends on the goal: are we trying to identify more stars that might be in GD-1, or a smaller set of stars that have higher probability?\n",
"\n",
"In the next section, we'll leave them out, but you can experiment with the alternative."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting based on photometry\n",
"\n",
"Now let's see how many of these points are inside the polygon we chose.\n",
"\n",
"We can use a list of column names to select `color` and `mag`."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" <th>mag</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.3804</td>\n",
" <td>17.8978</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.6092</td>\n",
" <td>19.2873</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.4457</td>\n",
" <td>16.9238</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.5902</td>\n",
" <td>19.9242</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color mag\n",
"0 NaN NaN\n",
"1 0.3804 17.8978\n",
"2 1.6092 19.2873\n",
"3 0.4457 16.9238\n",
"4 1.5902 19.9242"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"points = merged[['color', 'mag']]\n",
"points.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is a `DataFrame` that can be treated as a sequence of coordinates, so we can pass it to `contains_points`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([False, False, False, ..., False, False, False])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inside = path.contains_points(points)\n",
"inside"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is a Boolean array. We can use `sum` to see how many stars fall in the polygon."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"481"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inside.sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can use `inside` as a mask to select stars that fall inside the polygon."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"selected2 = merged[inside]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make a color-magnitude plot one more time, highlighting the selected stars with green `x` marks."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_cmd(photo_table)\n",
"plt.plot(xs, ys)\n",
"\n",
"plt.plot(selected2['color'], selected2['mag'], 'gx');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like the selected stars are, in fact, inside the polygon, which means they have photometry data consistent with GD-1.\n",
"\n",
"Finally, we can plot the coordinates of the selected stars:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x180 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(10,2.5))\n",
"\n",
"x = selected2['phi1']\n",
"y = selected2['phi2']\n",
"\n",
"plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9)\n",
"\n",
"plt.xlabel('ra (degree GD1)')\n",
"plt.ylabel('dec (degree GD1)')\n",
"\n",
"plt.axis('equal');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This example includes two new Matplotlib commands:\n",
"\n",
"* `figure` creates the figure. In previous examples, we didn't have to use this function; the figure was created automatically. But when we call it explicitly, we can provide arguments like `figsize`, which sets the size of the figure.\n",
"\n",
"* `axis` with the parameter `equal` sets up the axes so a unit is the same size along the `x` and `y` axes.\n",
"\n",
"In an example like this, where `x` and `y` represent coordinates in space, equal axes ensures that the distance between points is represented accurately. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Write the data\n",
"\n",
"Let's write the merged DataFrame to a file."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"filename = 'gd1_merged.hdf5'\n",
"\n",
"merged.to_hdf(filename, 'merged')\n",
"selected2.to_hdf(filename, 'selected2')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-rw-r-- 1 downey downey 2.0M Nov 18 19:28 gd1_merged.hdf5\r\n"
]
}
],
"source": [
"!ls -lh gd1_merged.hdf5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are using Windows, `ls` might not work; in that case, try:\n",
"\n",
"```\n",
"!dir gd1_merged.hdf5\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the polygon\n",
"\n",
"[Reproducibile research](https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research) is \"the idea that ... the full computational environment used to produce the results in the paper such as the code, data, etc. can be used to reproduce the results and create new work based on the research.\"\n",
"\n",
"This Jupyter notebook is an example of reproducible research because it contains all of the code needed to reproduce the results, including the database queries that download the data and and analysis.\n",
"\n",
"However, when we used `ginput` to define a polygon by hand, we introduced a non-reproducible element to the analysis. If someone running this notebook chooses a different polygon, they will get different results. So it is important to record the polygon we chose as part of the data analysis pipeline.\n",
"\n",
"Since `coords` is a NumPy array, we can't use `to_hdf` to save it in a file. But we can convert it to a Pandas `DataFrame` and save that.\n",
"\n",
"As an alternative, we could use [PyTables](http://www.pytables.org/index.html), which is the library Pandas uses to read and write files. It is a powerful library, but not easy to use directly. So let's take advantage of Pandas."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"coords_df = pd.DataFrame(coords)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"filename = 'gd1_polygon.hdf5'\n",
"coords_df.to_hdf(filename, 'coords_df')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can read it back like this."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"coords2_df = pd.read_hdf(filename, 'coords_df')\n",
"coords2 = coords2_df.to_numpy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And verify that the data we read back is the same."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.all(coords2 == coords)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"In this notebook, we worked with two datasets: the list of candidate stars from Gaia and the photometry data from Pan-STARRS.\n",
"\n",
"We drew a color-magnitude diagram and used it to identify stars we think are likely to be in GD-1.\n",
"\n",
"Then we used a Pandas `merge` operation to combine the data into a single `DataFrame`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Best practices\n",
"\n",
"* If you want to perform something like a database `JOIN` operation with data that is in a Pandas `DataFrame`, you can use the `join` or `merge` function. In many cases, `merge` is easier to use because the arguments are more like SQL.\n",
"\n",
"* Use Matplotlib options to control the size and aspect ratio of figures to make them easier to interpret. In this example, we scaled the axes so the size of a degree is equal along both axes.\n",
"\n",
"* Matplotlib also provides operations for working with points, polygons, and other geometric entities, so it's not just for making figures.\n",
"\n",
"* Be sure to record every element of the data analysis pipeline that would be needed to replicate the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}