diff --git a/01_query.html b/01_query.html index 69d289a..645405e 100644 --- a/01_query.html +++ b/01_query.html @@ -5,7 +5,7 @@ - Queries — Astronomical Data in Python + 1. Queries — Astronomical Data in Python @@ -57,7 +57,7 @@ - + @@ -99,22 +99,22 @@ -
  • - - Cleaning up - -
  • Formatting queries @@ -350,7 +345,9 @@
    -

    Queries

    +

    1. Queries

    +

    This is the first in a series of lessons about working with astronomical data.

    +

    As a running example, we will replicate parts of the analysis in a recent paper, “Off the beaten path: Gaia reveals GD-1 stars outside of the main stream” by Adrian Price-Whelan and Ana Bonaca.

    Outline

    This lesson demonstrates the steps for selecting and downloading data from the Gaia Database:

    @@ -378,19 +375,19 @@ But you might find it easier to learn from
  • Notebooks are made up of code cells and text cells (and a few other less common kinds). Code cells contain code; text cells, like this one, contain explanatory text written in Markdown.

  • To run a code cell, click the cell to select it and press Shift-Enter. The output of the code should appear below the cell.

  • In general, notebooks only run correctly if you run every code cell in order from top to bottom. If you run cells out of order, you are likely to get errors.

  • You can modify existing cells, but then you have to run them again to see the effect.

  • -
  • You can add new cells, but again, you might have to be careful about the order you run them in.

  • +
  • You can add new cells, but again, you have to be careful about the order you run them in.

  • If you have added or modified cells and the behavior of the notebook seems strange, you can restart the “kernel”, which clears all of the variables and functions you have defined, and run the cells again from the beginning.

  • Before you go on, you might want to explore the other menus and the toolbar to see what else you can do.

    @@ -420,7 +417,7 @@ Created TAP+ (v1.2.1) - Connection: -

    Running this import statement has the effect of creating a TAP+ connection; TAP stands for “Table Access Protocol”, which is a network protocol for sending queries to the database and getting back the results.

    +

    This import statement creates a TAP+ connection; TAP stands for “Table Access Protocol”, which is a network protocol for sending queries to the database and getting back the results.

    Databases and Tables

    @@ -444,6 +441,7 @@ INFO: Done. [astroquery.utils.tap.core]
    +

    The following for loop prints the names of the tables.

    for table in tables:
    @@ -454,6 +452,7 @@ INFO: Done. [astroquery.utils.tap.core]
     
    external.apassdr9
     external.gaiadr2_geometric_distance
    +external.gaiaedr3_distance
     external.galex_ais
     external.ravedr5_com
     external.ravedr5_dr5
    @@ -482,17 +481,31 @@ gaiaedr3.agn_cross_id
     gaiaedr3.commanded_scan_law
     gaiaedr3.dr2_neighbourhood
     gaiaedr3.frame_rotator_source
    +gaiaedr3.allwise_best_neighbour
    +gaiaedr3.allwise_neighbourhood
    +gaiaedr3.apassdr9_best_neighbour
    +gaiaedr3.apassdr9_join
    +gaiaedr3.apassdr9_neighbourhood
    +gaiaedr3.gsc23_best_neighbour
    +gaiaedr3.gsc23_join
    +gaiaedr3.gsc23_neighbourhood
     gaiaedr3.hipparcos2_best_neighbour
     gaiaedr3.hipparcos2_neighbourhood
     gaiaedr3.panstarrs1_best_neighbour
     gaiaedr3.panstarrs1_join
     gaiaedr3.panstarrs1_neighbourhood
    +gaiaedr3.ravedr5_best_neighbour
    +gaiaedr3.ravedr5_join
    +gaiaedr3.ravedr5_neighbourhood
     gaiaedr3.sdssdr13_best_neighbour
     gaiaedr3.sdssdr13_join
     gaiaedr3.sdssdr13_neighbourhood
     gaiaedr3.skymapperdr2_best_neighbour
     gaiaedr3.skymapperdr2_join
     gaiaedr3.skymapperdr2_neighbourhood
    +gaiaedr3.tmass_psc_xsc_best_neighbour
    +gaiaedr3.tmass_psc_xsc_join
    +gaiaedr3.tmass_psc_xsc_neighbourhood
     gaiaedr3.tycho2tdsc_merge_best_neighbour
     gaiaedr3.tycho2tdsc_merge_neighbourhood
     gaiaedr3.urat1_best_neighbour
    @@ -595,7 +608,7 @@ Parsing table 'gaiadr2.gaia_source'...
     Done.
     
    -
    <astroquery.utils.tap.model.taptable.TapTableMeta at 0x7f2e23f089d0>
    +
    <astroquery.utils.tap.model.taptable.TapTableMeta at 0x7f50edd2aeb0>
     
    @@ -732,7 +745,7 @@ epoch_photometry_url
    -

    You can probably guess what many of these columns are by looking at the names, but you should resist the temptation to guess. +

    You can probably infer what many of these columns are by looking at the names, but you should resist the temptation to guess. To find out what the columns mean, read the documentation.

    If you want to know what can go wrong when you don’t read the documentation, you might like this article.

    @@ -855,15 +868,16 @@ quality_flag

    Writing queries

    -

    By now you might be wondering how we download the actual data. With tables this big, you generally don’t. Instead, you use queries to select only the data you want.

    +

    By now you might be wondering how we download these tables. With tables this big, you generally don’t. Instead, you use queries to select only the data you want.

    A query is a string written in a query language like SQL; for the Gaia database, the query language is a dialect of SQL called ADQL.

    Here’s an example of an ADQL query.

    query1 = """SELECT 
     TOP 10
    -source_id, ref_epoch, ra, dec, parallax 
    -FROM gaiadr2.gaia_source"""
    +source_id, ra, dec, parallax 
    +FROM gaiadr2.gaia_source
    +"""
     
    @@ -877,25 +891,26 @@ quality_flag

    The third line is a list of column names, indicating which columns we want.

    In this example, the keywords are capitalized and the column names are lowercase. This is a common style, but it is not required. ADQL and SQL are not case-sensitive.

    +

    Also, the query is broken into multiple lines to make it more readable. This is a common style, but not required. Line breaks don’t affect the behavior of the query.

    To run this query, we use the Gaia object, which represents our connection to the Gaia database, and invoke launch_job:

    -
    job1 = Gaia.launch_job(query1)
    -job1
    +
    job = Gaia.launch_job(query1)
    +job
     
    -
    <astroquery.utils.tap.model.job.Job at 0x7f2e23f2afa0>
    +
    <astroquery.utils.tap.model.job.Job at 0x7f50edd2adc0>
     

    The result is an object that represents the job running on a Gaia server.

    -

    If you print it, it displays metadata for the forthcoming table.

    +

    If you print it, it displays metadata for the forthcoming results.

    -
    print(job1)
    +
    print(job)
     
    @@ -904,14 +919,13 @@ quality_flag name dtype unit description n_bad --------- ------- ---- ------------------------------------------------------------------ ----- source_id int64 Unique source identifier (unique within a particular Data Release) 0 -ref_epoch float64 yr Reference epoch 0 ra float64 deg Right ascension 0 dec float64 deg Declination 0 parallax float64 mas Parallax 2 Jobid: None Phase: COMPLETED Owner: None -Output file: sync_20201229114647.xml.gz +Output file: sync_20210315090602.xml.gz Results: None
    @@ -921,8 +935,8 @@ Results: None

    However, Phase: COMPLETED indicates that the job is complete, so we can get the results like this:

    -
    results1 = job1.get_results()
    -type(results1)
    +
    results = job.get_results()
    +type(results)
     
    @@ -932,8 +946,9 @@ Results: None
    +

    The type function indicates that the result is an Astropy Table.

    Optional detail: Why is table repeated three times? The first is the name of the module, the second is the name of the submodule, and the third is the name of the class. Most of the time we only care about the last one. It’s like the Linnean name for gorilla, which is Gorilla gorilla gorilla.

    -

    The result is an Astropy Table, which is similar to a table in an SQL database except:

    +

    An Astropy Table is similar to a table in an SQL database except:

    • SQL databases are stored on disk drives, so they are persistent; that is, they “survive” even if you turn off the computer. An Astropy Table is stored in memory; it disappears when you turn off the computer (or shut down this Jupyter notebook).

    • SQL databases are designed to process queries. An Astropy Table can perform some query-like operations, like selecting columns and rows. But these operations use Python syntax, not SQL.

    • @@ -941,37 +956,53 @@ Results: None

      Jupyter knows how to display the contents of a Table.

      -
      results1
      +
      results
       
      Table length=10 - - - - - - - - - - - - - - +
      source_idref_epochradecparallax
      yrdegdegmas
      int64float64float64float64float64
      67585097575941414402015.5290.4899010727352-30.343172184207830.48023816159705535
      67585086924379769602015.5290.64176055082714-30.289720131422252.2625971293368154
      67585270362508492802015.5290.35703708993327-30.193657826181596-0.2763960334229464
      67585644582982429442015.5290.77376294142846-29.7653684392252380.5907906528352993
      67585586128421557762015.5290.67550630386245-29.921133960781690.2858563565989917
      67585563794599219202015.5290.5454006212404-29.900709054816964-1.0012355835832834
      67585375202603851522015.5290.7341683169994-30.158181298418626--
      67585063559800106242015.5290.586950869878-30.371376421675690.3769870991981157
      67585616064339681282015.5290.82697136098506-29.75247697212053--
      67585641834144084482015.5290.7133096669958-29.7817436736796070.9376387942856869
      + + + + + + + + + + + + +
      source_idradecparallax
      degdegmas
      int64float64float64float64
      5887983246081387776227.978818386372-53.649969624501031.0493172163332998
      5887971250213117952228.32280834041364-53.662707262037260.29455652682279093
      5887991866047288704228.1582047014091-53.454724911639794-0.5789179941669236
      5887968673232040832228.07420888099884-53.80646128959610.41030970779603076
      5887979844465854720228.42547805195946-53.48882284470035-0.23379683441525864
      5887978607515442688228.23831627636855-53.56328249482688-0.9252161956789068
      5887978298278520704228.26015640396173-53.607284412896476--
      5887995581231772928228.12871598211902-53.373625663608316-0.3325818206439385
      5887982043490374016227.985260087594-53.6834444990555750.02878111976456593
      5887982971205433856227.89884570686218-53.67430215342567--

      Each column has a name, units, and a data type.

      -

      For example, the units of ra and dec are degrees, and their data type is float64, which is a 64-bit floating-point number, used to store measurements with a fraction part.

      +

      For example, the units of ra and dec are degrees, and their data type is float64, which is a 64-bit floating-point number, used to store measurements with a fraction part.

      This information comes from the Gaia database, and has been stored in the Astropy Table by Astroquery.

      Exercise

      -

      Read the documentation of this table and choose a column that looks interesting to you. Add the column name to the query and run it again. What are the units of the column you selected? What is its data type?

      +

      Read the documentation of this table and choose a column that looks interesting to you. Add the column name to the query and run it again. What are the units of the column you selected? What is its data type?

      # Solution
      +
      +# Let's add
      +#
      +# radial_velocity : Radial velocity (double, Velocity[km/s] )
      +#
      +# Spectroscopic radial velocity in the solar barycentric 
      +# reference frame.
      +#
      +# The radial velocity provided is the median value of the 
      +# radial velocity measurements at all epochs.
      +
      +query = """SELECT 
      +TOP 10
      +source_id, ra, dec, parallax, radial_velocity
      +FROM gaiadr2.gaia_source
      +"""
       
      @@ -981,18 +1012,20 @@ Results: None

      Asynchronous queries

      launch_job asks the server to run the job “synchronously”, which normally means it runs immediately. But synchronous jobs are limited to 2000 rows. For queries that return more rows, you should run “asynchronously”, which mean they might take longer to get started.

      -

      If you are not sure how many rows a query will return, you can use the SQL command COUNT to find out how many rows are in the result without actually returning them. We’ll see an example of this later.

      -

      The results of an asynchronous query are stored in a file on the server, so you can start a query and come back later to get the results.

      -

      For anonymous users, files are kept for three days.

      -

      As an example, let’s try a query that’s similar to query1, with two changes:

      +

      If you are not sure how many rows a query will return, you can use the SQL command COUNT to find out how many rows are in the result without actually returning them. We’ll see an example in the next lesson.

      +

      The results of an asynchronous query are stored in a file on the server, so you can start a query and come back later to get the results. +For anonymous users, files are kept for three days.

      +

      As an example, let’s try a query that’s similar to query1, with these changes:

      • It selects the first 3000 rows, so it is bigger than we should run synchronously.

      • +
      • It selects two additional columns, pmra and pmdec, which are proper motions along the axes of ra and dec.

      • It uses a new keyword, WHERE.

      -
      query2 = """SELECT TOP 3000
      -source_id, ref_epoch, ra, dec, parallax
      +
      query2 = """SELECT 
      +TOP 3000
      +source_id, ra, dec, pmra, pmdec, parallax
       FROM gaiadr2.gaia_source
       WHERE parallax < 1
       """
      @@ -1000,31 +1033,22 @@ Results: None
       
      -

      A WHERE clause indicates which rows we want; in this case, the query selects only rows “where” parallax is less than 1. This has the effect of selecting stars with relatively low parallax, which are farther away. We’ll use this clause to exclude nearby stars that are unlikely to be part of GD-1.

      -

      WHERE is one of the most common clauses in ADQL/SQL, and one of the most useful, because it allows us to select only the rows we need from the database.

      +

      A WHERE clause indicates which rows we want; in this case, the query selects only rows “where” parallax is less than 1. This has the effect of selecting stars with relatively low parallax, which are farther away. +We’ll use this clause to exclude nearby stars that are unlikely to be part of GD-1.

      +

      WHERE is one of the most common clauses in ADQL/SQL, and one of the most useful, because it allows us to download only the rows we need from the database.

      We use launch_job_async to submit an asynchronous query.

      -
      job2 = Gaia.launch_job_async(query2)
      -print(job2)
      +
      job = Gaia.launch_job_async(query)
      +job
       
      INFO: Query finished. [astroquery.utils.tap.core]
      -<Table length=3000>
      -   name    dtype  unit                            description                            
      ---------- ------- ---- ------------------------------------------------------------------
      -source_id   int64      Unique source identifier (unique within a particular Data Release)
      -ref_epoch float64   yr                                                    Reference epoch
      -       ra float64  deg                                                    Right ascension
      -      dec float64  deg                                                        Declination
      - parallax float64  mas                                                           Parallax
      -Jobid: 1609260407863O
      -Phase: COMPLETED
      -Owner: None
      -Output file: async_20201229114648.vot
      -Results: None
      +
      +
      +
      <astroquery.utils.tap.model.job.Job at 0x7f50edd40f40>
       
      @@ -1032,44 +1056,34 @@ Results: None

      And here are the results.

      -
      results2 = job2.get_results()
      -results2
      +
      results = job.get_results()
      +results
       
      -
      Table length=3000 - - - +
      Table length=10 +
      source_idref_epochradecparallax
      yrdegdegmas
      + + - - - - - - - - - - - - - - - - - - - - + + + + + + + + + +
      source_idradecparallaxradial_velocity
      degdegmaskm / s
      int64float64float64float64float64
      67585097575941414402015.5290.4899010727352-30.343172184207830.48023816159705535
      67585270362508492802015.5290.35703708993327-30.193657826181596-0.2763960334229464
      67585644582982429442015.5290.77376294142846-29.7653684392252380.5907906528352993
      67585586128421557762015.5290.67550630386245-29.921133960781690.2858563565989917
      67585563794599219202015.5290.5454006212404-29.900709054816964-1.0012355835832834
      67585063559800106242015.5290.586950869878-30.371376421675690.3769870991981157
      67585641834144084482015.5290.7133096669958-29.7817436736796070.9376387942856869
      67585297893226531842015.5290.35449465190385-30.10231481465004-1.3305133038319952
      67585074855553360642015.5290.6745961953205-30.359651480610280.057302711920868686
      ...............
      46604648940382954242015.579.74509945051724-66.496560127377590.1631257821260955
      46604635196504801282015.580.06150680482719-66.42891601595986-0.5822669736733328
      46604876745540710402015.580.03536662790846-66.33136792885134-1.4978065466579966
      46604947183008867842015.579.6181997343967-66.174723353182120.1488652133861382
      46604764732741593602015.579.12508291180023-66.356448447572750.10671540523101529
      46604710100939402242015.579.1908378165463-66.489168402940960.8152150079365807
      46604983948071677442015.580.23431532382547-66.08688681815092-0.14060596426163252
      46604987383917228802015.580.20205898946918-66.043923132951260.1894642263668475
      46604740680910969602015.578.93686107060341-66.457955767458960.4665513634059366
      46604373031626977282015.580.63872120225027-66.46144560233674-1.1319163320535979
      5895270396817359872213.08433715252883-56.641047010056942.041947005434917--
      5895272561481374080213.2606587905109-56.550444015357150.15693467895110133--
      5895247410183786368213.38479712976664-56.97008551185148-0.19017525742552605--
      5895249226912448000213.41587389088238-56.849596577635786----
      5895261875598904576213.5508930114549-56.61037780154348-0.29471722363529257--
      5895258302187834624213.87631129557286-56.6785372590399060.6468437015289753--
      5895247444506644992213.33215109206796-56.9753477593809950.390215490234287--
      5895259470417635968213.78815034206346-56.645850474515940.953377710788918--
      5895264899260932352213.21521027193236-56.78420864489118----
      5895265925746051584213.17082359534547-56.745408851077540.2986918215101751--
      -

      You might notice that some values of parallax are negative. As this FAQ explains, “Negative parallaxes are caused by errors in the observations.” Negative parallaxes have “no physical meaning,” but they can be a “useful diagnostic on the quality of the astrometric solution.”

      +

      You might notice that some values of parallax are negative. As this FAQ explains, “Negative parallaxes are caused by errors in the observations.” They have “no physical meaning,” but they can be a “useful diagnostic on the quality of the astrometric solution.”

      Exercise

      -

      The clauses in a query have to be in the right order. Go back and change the order of the clauses in query2 and run it again.

      -

      The query should fail, but notice that you don’t get much useful debugging information.

      +

      The clauses in a query have to be in the right order. Go back and change the order of the clauses in query2 and run it again. +The modified query should fail, but notice that you don’t get much useful debugging information.

      For this reason, developing and debugging ADQL queries can be really hard. A few suggestions that might help:

      • Whenever possible, start with a working query, either an example you find online or a query you have used in the past.

      • @@ -1080,6 +1094,15 @@ Results: None
        # Solution
        +
        +# In this example, the WHERE clause is in the wrong place
        +
        +query = """SELECT 
        +TOP 3000
        +WHERE parallax < 1
        +source_id, ref_epoch, ra, dec, parallax
        +FROM gaiadr2.gaia_source
        +"""
         
        @@ -1127,14 +1150,14 @@ Be careful to keep your Python out of your ADQL!

        Exercise

        Read about SQL operators here and then modify the previous query to select rows where bp_rp is between -0.75 and 2.

        -

        You can read about this variable here.

        # Solution
         
         # Here's a solution using > and < operators
         
        -query = """SELECT TOP 10
        +query = """SELECT 
        +TOP 10
         source_id, ref_epoch, ra, dec, parallax
         FROM gaiadr2.gaia_source
         WHERE parallax < 1 
        @@ -1143,7 +1166,8 @@ Be careful to keep your Python out of your ADQL!

        # And here's a solution using the BETWEEN operator -query = """SELECT TOP 10 +query = """SELECT +TOP 10 source_id, ref_epoch, ra, dec, parallax FROM gaiadr2.gaia_source WHERE parallax < 1 @@ -1153,51 +1177,23 @@ Be careful to keep your Python out of your ADQL!

        +

        bp_rp contains BP-RP color, which is the difference between two other columns, phot_bp_mean_mag and phot_rp_mean_mag. +You can read about this variable here.

        This Hertzsprung-Russell diagram shows the BP-RP color and luminosity of stars in the Gaia catalog (Copyright: ESA/Gaia/DPAC, CC BY-SA 3.0 IGO).

        https://github.com/AllenDowney/AstronomicalData/raw/main/images/1567214809100-ESA_Gaia_DR2_HRD_Gaia_625.jpg

        Selecting stars with bp-rp less than 2 excludes many class M dwarf stars, which are low temperature, low luminosity. A star like that at GD-1’s distance would be hard to detect, so if it is detected, it it more likely to be in the foreground.

        -
        -

        Cleaning up

        -

        Asynchronous jobs have a jobid.

        -
        -
        -
        job1.jobid, job2.jobid
        -
        -
        -
        -
        -
        (None, '1609260407863O')
        -
        -
        -
        -
        -

        Which you can use to remove the job from the server.

        -
        -
        -
        Gaia.remove_jobs([job2.jobid])
        -
        -
        -
        -
        -
        Removed jobs: '['1609260407863O']'.
        -
        -
        -
        -
        -

        If you don’t remove it job from the server, it will be removed eventually, so don’t feel too bad if you don’t clean up after yourself.

        -

        Formatting queries

        -

        So far the queries have been string “literals”, meaning that the entire string is part of the program. +

        The queries we have written so far are string “literals”, meaning that the entire string is part of the program. But writing queries yourself can be slow, repetitive, and error-prone.

        -

        It is often a good idea to write Python code that assembles a query for you. One useful tool for that is the string format method.

        +

        It is often better to write Python code that assembles a query for you. One useful tool for that is the string format method.

        As an example, we’ll divide the previous query into two parts; a list of column names and a “base” for the query that contains everything except the column names.

        Here’s the list of columns we’ll select.

        -
        columns = 'source_id, ra, dec, pmra, pmdec, parallax, radial_velocity'
        +
        columns = 'source_id, ra, dec, pmra, pmdec, parallax'
         
        @@ -1205,7 +1201,8 @@ But writing queries yourself can be slow, repetitive, and error-prone.

        And here’s the base; it’s a string that contains at least one format specifier in curly brackets (braces).

        -
        query3_base = """SELECT TOP 10 
        +
        query3_base = """SELECT 
        +TOP 10 
         {columns}
         FROM gaiadr2.gaia_source
         WHERE parallax < 1
        @@ -1224,6 +1221,8 @@ But writing queries yourself can be slow, repetitive, and error-prone.

        +

        In this example, the variable that contains the column names and the variable in the format specifier have the same name. +That’s not required, but it is a common style.

        The result is a string with line breaks. If you display it, the line breaks appear as \n.

        @@ -1232,7 +1231,7 @@ But writing queries yourself can be slow, repetitive, and error-prone.

        -
        'SELECT TOP 10 \nsource_id, ra, dec, pmra, pmdec, parallax, radial_velocity\nFROM gaiadr2.gaia_source\nWHERE parallax < 1\n  AND bp_rp BETWEEN -0.75 AND 2\n'
        +
        'SELECT \nTOP 10 \nsource_id, ra, dec, pmra, pmdec\nFROM gaiadr2.gaia_source\nWHERE parallax < 1\n  AND bp_rp BETWEEN -0.75 AND 2\n'
         
        @@ -1245,8 +1244,9 @@ But writing queries yourself can be slow, repetitive, and error-prone.

        -
        SELECT TOP 10 
        -source_id, ra, dec, pmra, pmdec, parallax, radial_velocity
        +
        SELECT 
        +TOP 10 
        +source_id, ra, dec, pmra, pmdec
         FROM gaiadr2.gaia_source
         WHERE parallax < 1
           AND bp_rp BETWEEN -0.75 AND 2
        @@ -1258,26 +1258,24 @@ WHERE parallax < 1
         

        Let’s run it and see if it works:

        -
        job3 = Gaia.launch_job(query3)
        -print(job3)
        +
        job = Gaia.launch_job(query3)
        +print(job)
         
        <Table length=10>
        -      name       dtype    unit                              description                             n_bad
        ---------------- ------- -------- ------------------------------------------------------------------ -----
        -      source_id   int64          Unique source identifier (unique within a particular Data Release)     0
        -             ra float64      deg                                                    Right ascension     0
        -            dec float64      deg                                                        Declination     0
        -           pmra float64 mas / yr                         Proper motion in right ascension direction     0
        -          pmdec float64 mas / yr                             Proper motion in declination direction     0
        -       parallax float64      mas                                                           Parallax     0
        -radial_velocity float64   km / s                                                    Radial velocity    10
        +   name    dtype    unit                              description                            
        +--------- ------- -------- ------------------------------------------------------------------
        +source_id   int64          Unique source identifier (unique within a particular Data Release)
        +       ra float64      deg                                                    Right ascension
        +      dec float64      deg                                                        Declination
        +     pmra float64 mas / yr                         Proper motion in right ascension direction
        +    pmdec float64 mas / yr                             Proper motion in declination direction
         Jobid: None
         Phase: COMPLETED
         Owner: None
        -Output file: sync_20201229114653.xml.gz
        +Output file: sync_20210315091929.xml.gz
         Results: None
         
        @@ -1285,27 +1283,27 @@ Results: None
        -
        results3 = job3.get_results()
        -results3
        +
        results = job.get_results()
        +results
         
        Table length=10 - - - - - - - - - - - - - - +
        source_idradecpmrapmdecparallaxradial_velocity
        degdegmas / yrmas / yrmaskm / s
        int64float64float64float64float64float64float64
        466046637150777484879.49100199261952-66.414117776726682.6073927760139983-1.1968907801968522-0.13534455558687877--
        466049870403198400080.19346436358076-66.049076752328561.644062563874402-1.954975321006136-0.1774586376397--
        466045894982481702479.68666778641992-66.522209596625571.47776972758805810.3328961930362448-0.030149510331454386--
        466045197477392652879.89972073493868-66.750378580711912.4635048563021065-0.7474574729840501-0.005219591141134416--
        466047403373135705678.93214036467484-66.464836445916671.2369118132455594-1.71871430346386770.6018032937392243--
        466045410507787468879.91045649235578-66.681068461359711.47478554075336770.022464361420165786-0.19949109843083218--
        466049485575374054479.63101881952036-66.150633158135361.62278632632221130.204908932619051520.0013030724292998944--
        466043953657130816080.24817961417894-66.463032706640921.8652105429518493-0.5618584442373111-0.10578442360531497--
        466047018544046656079.25942306061864-66.511207775496053.3735611720039890.25326243152876704-0.6932631338638376--
        466044146069036556880.38862155241748-66.344044941350461.6588478675146066-0.09646951910977163-0.2100859303087445--
        + + + + + + + + + + + + +
        source_idradecpmrapmdec
        degdegmas / yrmas / yr
        int64float64float64float64float64
        5895272561481374080213.2606587905109-56.550444015357150.38944388983017151.2299266281737415
        5895261875598904576213.5508930114549-56.610377801543480.16203641364393007-4.672602679543312
        5895247444506644992213.33215109206796-56.975347759380995-7.474003156859284-3.538080792097856
        5895259470417635968213.78815034206346-56.64585047451594-5.287202255231853-0.8163762113468646
        5895265925746051584213.17082359534547-56.74540885107754-7.880749306158471-4.8585444120179595
        5895260913525974528213.66936020541976-56.66655190442016-4.7820929042428215-1.5566420086447643
        5895264212062283008213.7755742121852-56.51570859067397-6.657690998559842-1.7616494482071872
        5895253457497979136213.30929960610283-56.78849448744587-5.242106718924749-0.18655636353898095
        4143614130253524096269.1749117455479-18.534151399721172.61642745108048261.3244248889980894
        4065443904433108992273.26868565443743-24.421651815402857-1.663096652191022-2.6514745376067683

        Good so far.

        @@ -1317,25 +1315,27 @@ Results: None
        # Solution
         
        -query4_base = """SELECT TOP 10
        +query_base = """SELECT 
        +TOP 10
         {columns}
         FROM gaiadr2.gaia_source
         WHERE parallax < {max_parallax} AND 
         bp_rp BETWEEN -0.75 AND 2
         """
         
        -query4 = query4_base.format(columns=columns,
        -                            max_parallax=0.5)
        +query = query_base.format(columns=columns,
        +                          max_parallax=0.5)
         print(query)
         
        -
        SELECT TOP 10
        -source_id, ref_epoch, ra, dec, parallax
        +
        SELECT 
        +TOP 10
        +source_id, ra, dec, pmra, pmdec
         FROM gaiadr2.gaia_source
        -WHERE parallax < 1 
        -  AND bp_rp BETWEEN -0.75 AND 2
        +WHERE parallax < 0.5 AND 
        +bp_rp BETWEEN -0.75 AND 2
         
        @@ -1360,22 +1360,16 @@ WHERE parallax < 1
      • Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean.

      • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

      • Use ADQL features like TOP and COUNT to test before you run a query that might return a lot of data.

      • -
      • If you know your query will return fewer than 3000 rows, you can run it synchronously, which might complete faster (but it doesn’t seem to make much difference). If it might return more than 3000 rows, you should run it asynchronously.

      • +
      • If you know your query will return fewer than 2000 rows, you can run it synchronously, which might complete faster. If it might return more than 2000 rows, you should run it asynchronously.

      • ADQL and SQL are not case-sensitive, so you don’t have to capitalize the keywords, but you should.

      • ADQL and SQL don’t require you to break a query into multiple lines, but you should.

      Jupyter notebooks can be good for developing and testing code, but they have some drawbacks. In particular, if you run the cells out of order, you might find that variables don’t have the values you expect.

      -

      There are a few things you can do to mitigate these problems:

      +

      To mitigate these problems:

      • Make each section of the notebook self-contained. Try not to use the same variable name in more than one section.

      • Keep notebooks short. Look for places where you can break your analysis into phases with one notebook per phase.

      -
      -

      One of the other tables we’ll use is -gaiadr2.panstarrs1_original_valid. Use load_table to get the -metadata for this table. How many columns are there and what are -their names?

      -
      @@ -1408,7 +1402,7 @@ their names?

      -
      -

      Selecting a region

      -

      One of the most common ways to restrict a query is to select stars in a particular region of the sky.

      -

      For example, here’s a query from the Gaia archive documentation that selects “all the objects … in a circular region centered at (266.41683, -29.00781) with a search radius of 5 arcmin (0.08333 deg).”

      -
      -
      -
      query = """
      -SELECT 
      -TOP 10 source_id
      -FROM gaiadr2.gaia_source
      -WHERE 1=CONTAINS(
      -  POINT(ra, dec),
      -  CIRCLE(266.41683, -29.00781, 0.08333333))
      -"""
      -
      -
      -
      -
      -

      This query uses three keywords that are specific to ADQL (not SQL):

      -
        -
      • POINT: a location in ICRS coordinates, specified in degrees of right ascension and declination.

      • -
      • CIRCLE: a circle where the first two values are the coordinates of the center and the third is the radius in degrees.

      • -
      • CONTAINS: a function that returns 1 if a POINT is contained in a shape and 0 otherwise.

      • -
      -

      Here is the documentation of CONTAINS.

      -

      A query like this is called a cone search because it selects stars in a cone.

      -

      Here’s how we run it.

      -
      -
      -
      from astroquery.gaia import Gaia
      -
      -job = Gaia.launch_job(query)
      -job
      -
      -
      -
      -
      -
      Created TAP+ (v1.2.1) - Connection:
      -	Host: gea.esac.esa.int
      -	Use HTTPS: True
      -	Port: 443
      -	SSL Port: 443
      -Created TAP+ (v1.2.1) - Connection:
      -	Host: geadata.esac.esa.int
      -	Use HTTPS: True
      -	Port: 443
      -	SSL Port: 443
      -
      -
      -
      <astroquery.utils.tap.model.job.Job at 0x7f59bd93e490>
      -
      -
      -
      -
      -
      -
      -
      result = job.get_results()
      -result
      -
      -
      -
      -
      -
      Table length=10 - - - - - - - - - - - - - -
      source_id
      int64
      4057468321929794432
      4057468287575835392
      4057482027171038976
      4057470349160630656
      4057470039924301696
      4057469868125641984
      4057468351995073024
      4057469661959554560
      4057470520960672640
      4057470555320409600
      -
      -
      -

      Exercise

      -

      When you are debugging queries like this, you can use TOP to limit the size of the results, but then you still don’t know how big the results will be.

      -

      An alternative is to use COUNT, which asks for the number of rows that would be selected, but it does not return them.

      -

      In the previous query, replace TOP 10 source_id with COUNT(source_id) and run the query again. How many stars has Gaia identified in the cone we searched?

      -
      -
      -
      # Solution
      -
      -query = """
      -SELECT 
      -COUNT(source_id)
      -FROM gaiadr2.gaia_source
      -WHERE 1=CONTAINS(
      -  POINT(ra, dec),
      -  CIRCLE(266.41683, -29.00781, 0.08333333))
      -"""
      -
      -
      -
      -
      -
      -
      -
      -

      Getting GD-1 Data

      -

      From the Price-Whelan and Bonaca paper, we will try to reproduce Figure 1, which includes this representation of stars likely to belong to GD-1:

      -https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-4.png -

      The axes of this figure are defined so the x-axis is aligned with the stars in GD-1, and the y-axis is perpendicular.

      -
        -
      • Along the x-axis (\(\phi_1\)) the figure extends from -100 to 20 degrees.

      • -
      • Along the y-axis (\(\phi_2\)) the figure extends from about -8 to 4 degrees.

      • -
      -

      Ideally, we would select all stars from this rectangle, but there are more than 10 million of them, so

      -
        -
      • That would be difficult to work with,

      • -
      • As anonymous Gaia users, we are limited to 3 million rows in a single query, and

      • -
      • While we are developing and testing code, it will be faster to work with a smaller dataset.

      • -
      -

      So we’ll start by selecting stars in a smaller rectangle near the center of GD-1, from -55 to -45 degrees \(\phi_1\) and -8 to 4 degrees \(\phi_2\).

      -

      But first we let’s see how to represent quantities with units like degrees.

      -
      -
      -

      Working with coordinates

      -

      Coordinates are physical quantities, which means that they have two parts, a value and a unit.

      -

      For example, the coordinate \(30^{\circ}\) has value 30 and its units are degrees.

      +
      +

      Working with Units

      +

      The measurements we will work with are physical quantities, which means that they have two parts, a value and a unit. +For example, the coordinate \(30^{\circ}\) has value 30 and its units are degrees.

      Until recently, most scientific computation was done with values only; units were left out of the program altogether, often with catastrophic results.

      Astropy provides tools for including units explicitly in computations, which makes it possible to detect errors before they cause disasters.

      To use Astropy units, we import them like this:

      @@ -1487,8 +1378,8 @@ Created TAP+ (v1.2.1) - Connection:

      To create a quantity, we multiply a value by a unit.

      -
      quantity = 30 * u.degree
      -type(quantity)
      +
      angle = 10 * u.degree
      +type(angle)
       
      @@ -1498,24 +1389,210 @@ Created TAP+ (v1.2.1) - Connection:
      -

      The result is a Quantity object.

      -

      Jupyter knows how to display Quantities like this:

      +

      The result is a Quantity object. +Jupyter knows how to display Quantities like this:

      -
      quantity
      +
      angle
       
      -\[30 \; \mathrm{{}^{\circ}}\]
      +\[10 \; \mathrm{{}^{\circ}}\]
      +

      Quantities provide a method called to that converts to other units. For example, we can compute the number of arcminutes in angle:

      +
      +
      +
      angle_arcmin = angle.to(u.arcmin)
      +angle_arcmin
      +
      +
      +
      +
      +
      +\[600 \; \mathrm{{}^{\prime}}\]
      +
      +
      +

      If you add quantities, Astropy converts them to compatible units, if possible:

      +
      +
      +
      angle + 30 * u.arcmin
      +
      +
      +
      +
      +
      +\[10.5 \; \mathrm{{}^{\circ}}\]
      +
      +
      +

      If the units are not compatible, you get an error. +For example:

      +
      angle + 5 * u.second
      +
      +
      +

      causes a UnitConversionError.

      +
      +

      Exercise

      +

      Create a quantity that represents 5 arcminutes and assign it to a variable called radius.

      +

      Then convert it to degrees.

      +
      +
      +
      ## Solution
      +
      +radius = 5 * u.arcmin
      +print(radius)
      +
      +radius.to(u.degree)
      +
      +
      +
      +
      +
      5.0 arcmin
      +
      +
      +
      +\[0.083333333 \; \mathrm{{}^{\circ}}\]
      +
      +
      +
      +
      +
      +

      Selecting a Region

      +

      One of the most common ways to restrict a query is to select stars in a particular region of the sky. +For example, here’s a query from the Gaia archive documentation that selects objects in a circular region centered at (88.8, 7.4) with a search radius of 5 arcmin (0.08333 deg).

      +
      +
      +
      query_cone = """SELECT 
      +TOP 10 
      +source_id
      +FROM gaiadr2.gaia_source
      +WHERE 1=CONTAINS(
      +  POINT(ra, dec),
      +  CIRCLE(88.8, 7.4, 0.08333333))
      +"""
      +
      +
      +
      +
      +

      This query uses three keywords that are specific to ADQL (not SQL):

      +
        +
      • POINT: a location in ICRS coordinates, specified in degrees of right ascension and declination.

      • +
      • CIRCLE: a circle where the first two values are the coordinates of the center and the third is the radius in degrees.

      • +
      • CONTAINS: a function that returns 1 if a POINT is contained in a shape and 0 otherwise.

      • +
      +

      Here is the documentation of CONTAINS.

      +

      A query like this is called a cone search because it selects stars in a cone. +Here’s how we run it.

      +
      +
      +
      from astroquery.gaia import Gaia
      +
      +job = Gaia.launch_job(query_cone)
      +job
      +
      +
      +
      +
      +
      Created TAP+ (v1.2.1) - Connection:
      +	Host: gea.esac.esa.int
      +	Use HTTPS: True
      +	Port: 443
      +	SSL Port: 443
      +Created TAP+ (v1.2.1) - Connection:
      +	Host: geadata.esac.esa.int
      +	Use HTTPS: True
      +	Port: 443
      +	SSL Port: 443
      +
      +
      +
      <astroquery.utils.tap.model.job.Job at 0x7f277785fa30>
      +
      +
      +
      +
      +
      +
      +
      results = job.get_results()
      +results
      +
      +
      +
      +
      +
      Table length=10 + + + + + + + + + + + + + +
      source_id
      int64
      3322773965056065536
      3322773758899157120
      3322774068134271104
      3322773930696320512
      3322774377374425728
      3322773724537891456
      3322773724537891328
      3322773930696321792
      3322773724537890944
      3322773930696322176
      +
      +
      +

      Exercise

      +

      When you are debugging queries like this, you can use TOP to limit the size of the results, but then you still don’t know how big the results will be.

      +

      An alternative is to use COUNT, which asks for the number of rows that would be selected, but it does not return them.

      +

      In the previous query, replace TOP 10 source_id with COUNT(source_id) and run the query again. How many stars has Gaia identified in the cone we searched?

      +
      +
      +
      # Solution
      +
      +query = """SELECT 
      +COUNT(source_id)
      +FROM gaiadr2.gaia_source
      +WHERE 1=CONTAINS(
      +  POINT(ra, dec),
      +  CIRCLE(88.8, 7.4, 0.08333333))
      +"""
      +
      +job = Gaia.launch_job(query)
      +results = job.get_results()
      +results
      +
      +
      +
      +
      +
      Table length=1 + + + + +
      count
      int64
      594
      +
      +
      +
      +
      +

      Getting GD-1 Data

      +

      From the Price-Whelan and Bonaca paper, we will try to reproduce Figure 1, which includes this representation of stars likely to belong to GD-1:

      +https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-4.png +

      The axes of this figure are defined so the x-axis is aligned with the stars in GD-1, and the y-axis is perpendicular.

      +
        +
      • Along the x-axis (\(\phi_1\)) the figure extends from -100 to 20 degrees.

      • +
      • Along the y-axis (\(\phi_2\)) the figure extends from about -8 to 4 degrees.

      • +
      +

      Ideally, we would select all stars from this rectangle, but there are more than 10 million of them, so

      +
        +
      • That would be difficult to work with,

      • +
      • As anonymous Gaia users, we are limited to 3 million rows in a single query, and

      • +
      • While we are developing and testing code, it will be faster to work with a smaller dataset.

      • +
      +

      So we’ll start by selecting stars in a smaller rectangle near the center of GD-1, from -55 to -45 degrees \(\phi_1\) and -8 to 4 degrees \(\phi_2\).

      +

      But first we let’s see how to represent these coordinates with Astropy.

      Transforming coordinates

      Astropy provides a SkyCoord object that represents sky coordinates relative to a specified frame.

      The following example creates a SkyCoord object that represents the approximate coordinates of Betelgeuse (alf Ori) in the ICRS frame.

      +

      ICRS is the +“International Celestial Reference System”, adopted in 1997 by the International Astronomical Union.

      from astropy.coordinates import SkyCoord
      @@ -1551,8 +1628,9 @@ For example, we can transform 
       
      +

      Notice that in the Galactic frame, the coordinates are called l and b, not ra and dec.

      To transform to and from GD-1 coordinates, we’ll use a frame defined by Gala, which is an Astropy-affiliated library that provides tools for galactic dynamics.

      -

      Gala provides GD1Koposov10, which is “a Heliocentric spherical coordinate system defined by the orbit of the GD-1 stream

      +

      Gala provides GD1Koposov10, which is “a Heliocentric spherical coordinate system defined by the orbit of the GD-1 stream”.

      from gala.coordinates import GD1Koposov10
      @@ -1583,8 +1661,10 @@ For example, we can transform 
       
      -
      -

      Exercise

      +

      Notice that the coordinates are called phi1 and phi2. +These are the coordinates shown in the figure from the paper, above.

      +
      +

      Exercise

      Let’s find the location of GD-1 in ICRS coordinates.

      1. Create a SkyCoord object at 0°, 0° in the GD-1 frame.

      2. @@ -1595,16 +1675,22 @@ For example, we can transform
        # Solution
         
        -coord_gd1 = SkyCoord(0*u.degree, 0*u.degree, frame=gd1_frame)
        +origin_gd1 = SkyCoord(0*u.degree, 0*u.degree, frame=gd1_frame)
        +
        +# OR
        +
        +origin_gd1 = SkyCoord(phi1=0*u.degree, 
        +                      phi2=0*u.degree, 
        +                      frame=gd1_frame)
         
         # Note: because ICRS is built into Astropy, 
        -# we can identify it by name
        -coord_gd1.transform_to('icrs')
        +# we can identify it by string name
        +origin_gd1.transform_to('icrs')
         
         # More formally, we could instantiate it
         from astropy.coordinates import ICRS
         icrs_frame = ICRS()
        -coord_gd1.transform_to(icrs_frame)
        +origin_gd1.transform_to(icrs_frame)
         
      @@ -1615,6 +1701,7 @@ For example, we can transform
      +

      Notice that the origin of the GD-1 frame maps to ra=200, exactly, in ICRS. That’s by design.

      @@ -1631,7 +1718,7 @@ For example, we can transform
      -

      To represent a rectangle, we’ll use two lists of coordinates and multiply by their units.

      +

      To create a rectangle, we’ll use the following function, which takes the lower and upper bounds as parameters.

      def make_rectangle(x1, x2, y1, y2):
      @@ -1643,6 +1730,7 @@ For example, we can transform 
       
      +

      The return value is a tuple containing a list of coordinates in phi1 followed by a list of coordinates in phi2.

      phi1_rect, phi2_rect = make_rectangle(
      @@ -1651,13 +1739,11 @@ For example, we can transform 
       
      -

      phi1_rect and phi2_rect represent the coordinates of the corners of a rectangle in the GD-1 frame.

      -

      In order to use them in a Gaia query, we have to convert them to ICRS.

      +

      phi1_rect and phi2_rect contains the coordinates of the corners of a rectangle in the GD-1 frame.

      +

      In order to use them in a Gaia query, we have to convert them to ICRS. First we’ll put them into a SkyCoord object.

      -
      import gala.coordinates as gc
      -
      -corners = SkyCoord(phi1=phi1_rect, phi2=phi2_rect, frame=gd1_frame)
      +
      corners = SkyCoord(phi1=phi1_rect, phi2=phi2_rect, frame=gd1_frame)
       corners
       
      @@ -1672,9 +1758,7 @@ For example, we can transform Now we can use transform_to to convert to ICRS coordinates.

      -
      import astropy.coordinates as coord
      -
      -corners_icrs = corners.transform_to('icrs')
      +
      corners_icrs = corners.transform_to('icrs')
       corners_icrs
       
      @@ -1688,10 +1772,10 @@ For example, we can transform
      -

      Notice that a rectangle in one coordinate system is not necessarily a rectangle in another. In this example, the result is a polygon.

      +

      Notice that a rectangle in one coordinate system is not necessarily a rectangle in another. In this example, the result is a (non-rectangular) polygon.

      -
      -

      Selecting a polygon

      +
      +

      Defining a polygon

      In order to use this polygon as part of an ADQL query, we have to convert it to a string with a comma-separated list of coordinates, as in this example:

      """
       POLYGON(143.65, 20.98, 
      @@ -1701,7 +1785,52 @@ For example, we can transform """
       
      -

      The following function does the job:

      +

      SkyCoord provides to_string, which produces a list of strings.

      +
      +
      +
      t = corners_icrs.to_string()
      +t
      +
      +
      +
      +
      +
      ['146.275 19.2619',
      + '135.422 25.8774',
      + '141.603 34.3048',
      + '152.817 27.1361',
      + '146.275 19.2619']
      +
      +
      +
      +
      +

      We can use the Python string function join to join t into a single string (with spaces between the pairs):

      +
      +
      +
      s = ' '.join(t)
      +s
      +
      +
      +
      +
      +
      '146.275 19.2619 135.422 25.8774 141.603 34.3048 152.817 27.1361 146.275 19.2619'
      +
      +
      +
      +
      +

      That’s almost what we need, but we have to replace the spaces with commas.

      +
      +
      +
      s.replace(' ', ', ')
      +
      +
      +
      +
      +
      '146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619'
      +
      +
      +
      +
      +

      The following function combines these steps.

      def skycoord_to_string(skycoord):
      @@ -1713,9 +1842,7 @@ For example, we can transform 
       
      -

      SkyCoord provides to_string, which returns a list of strings.

      -

      We use join to make a single string with spaces between the coordinates.

      -

      Then we use replace to add commas between the coordinates.

      +

      Here’s how we use it.

      point_list = skycoord_to_string(corners_icrs)
      @@ -1729,18 +1856,38 @@ For example, we can transform 
       
      -

      Before we can assemble the query, we need columns again (as we saw in the previous notebook).

      +
      +
      +

      Assembling the query

      +

      Now we’re ready to assemble the query. +We need columns again (as we saw in the previous lesson).

      -
      columns = 'source_id, ra, dec, pmra, pmdec, parallax, parallax_error, radial_velocity'
      +
      columns = 'source_id, ra, dec, pmra, pmdec, parallax'
       
      -

      Here’s the base for the query, with format specifiers for columns and point_list.

      +

      And here’s the query base we used in the previous lesson:

      -
      query_base = """SELECT {columns}
      +
      query3_base = """SELECT 
      +TOP 10 
      +{columns}
      +FROM gaiadr2.gaia_source
      +WHERE parallax < 1
      +  AND bp_rp BETWEEN -0.75 AND 2
      +"""
      +
      +
      +
      +
      +

      Now we’ll add a WHERE clause to select stars in the polygon we defined.

      +
      +
      +
      query4_base = """SELECT
      +TOP 10
      +{columns}
       FROM gaiadr2.gaia_source
       WHERE parallax < 1
         AND bp_rp BETWEEN -0.75 AND 2 
      @@ -1751,17 +1898,20 @@ For example, we can transform 
       
      -

      And here’s the result:

      +

      The query base contains format specifiers for columns and point_list.

      +

      We’ll use format to fill in these values.

      -
      query = query_base.format(columns=columns, 
      +
      query4 = query4_base.format(columns=columns, 
                                 point_list=point_list)
      -print(query)
      +print(query4)
       
      -
      SELECT source_id, ra, dec, pmra, pmdec, parallax, parallax_error, radial_velocity
      +
      SELECT
      +TOP 10
      +source_id, ra, dec, pmra, pmdec
       FROM gaiadr2.gaia_source
       WHERE parallax < 1
         AND bp_rp BETWEEN -0.75 AND 2 
      @@ -1772,10 +1922,97 @@ WHERE parallax < 1
       

      As always, we should take a minute to proof-read the query before we launch it.

      -

      The result will be bigger than our previous queries, so it will take a little longer.

      -
      job = Gaia.launch_job_async(query)
      +
      job = Gaia.launch_job_async(query4)
      +print(job)
      +
      +
      +
      +
      +
      INFO: Query finished. [astroquery.utils.tap.core]
      +<Table length=10>
      +   name    dtype    unit                              description                            
      +--------- ------- -------- ------------------------------------------------------------------
      +source_id   int64          Unique source identifier (unique within a particular Data Release)
      +       ra float64      deg                                                    Right ascension
      +      dec float64      deg                                                        Declination
      +     pmra float64 mas / yr                         Proper motion in right ascension direction
      +    pmdec float64 mas / yr                             Proper motion in declination direction
      +Jobid: 1615815873808O
      +Phase: COMPLETED
      +Owner: None
      +Output file: async_20210315094433.vot
      +Results: None
      +
      +
      +
      +
      +

      Here are the results.

      +
      +
      +
      results = job.get_results()
      +results
      +
      +
      +
      +
      +
      Table length=10 + + + + + + + + + + + + + + +
      source_idradecpmrapmdec
      degdegmas / yrmas / yr
      int64float64float64float64float64
      637987125186749568142.4830193599102321.75771616932985-2.51683846838757662.941813096629439
      638285195917112960142.2545294134634422.4761681711413782.6627020143457996-12.165984395577347
      638073505568978688142.6452855746807422.1669322495307818.30674739454163-7.950659620550862
      638086386175786752142.5773943092603422.227919514013650.9877856720147953-2.584105480335548
      638049655615392384142.5891356447861822.1107831666774180.24443878227817095-4.941079187010136
      638267565075964032141.8176222899961422.375696125322275-3.4131745896607961.8838892877285924
      638028902333511168143.1833980131767722.25124658123697.848511762712128-21.391145547787154
      638085767700610432142.934731946458922.46244080823965-3.6585960944321476-12.486419770278376
      638299863230178304142.2676974582326722.640183776884836-1.81683708922182971.0537342990941316
      637973067758974208142.8955129286901221.612824100339875-8.645166256559063-44.41164172204947
      +
      +

      Finally, we can remove TOP 10 run the query again.

      +

      The result is bigger than our previous queries, so it will take a little longer.

      +
      +
      +
      query5_base = """SELECT
      +{columns}
      +FROM gaiadr2.gaia_source
      +WHERE parallax < 1
      +  AND bp_rp BETWEEN -0.75 AND 2 
      +  AND 1 = CONTAINS(POINT(ra, dec), 
      +                   POLYGON({point_list}))
      +"""
      +
      +
      +
      +
      +
      +
      +
      query5 = query5_base.format(columns=columns, 
      +                          point_list=point_list)
      +print(query5)
      +
      +
      +
      +
      +
      SELECT
      +source_id, ra, dec, pmra, pmdec
      +FROM gaiadr2.gaia_source
      +WHERE parallax < 1
      +  AND bp_rp BETWEEN -0.75 AND 2 
      +  AND 1 = CONTAINS(POINT(ra, dec), 
      +                   POLYGON(146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619))
      +
      +
      +
      +
      +
      +
      +
      job = Gaia.launch_job_async(query5)
       print(job)
       
      @@ -1783,26 +2020,22 @@ WHERE parallax < 1
      INFO: Query finished. [astroquery.utils.tap.core]
       <Table length=140339>
      -      name       dtype    unit                              description                             n_bad 
      ---------------- ------- -------- ------------------------------------------------------------------ ------
      -      source_id   int64          Unique source identifier (unique within a particular Data Release)      0
      -             ra float64      deg                                                    Right ascension      0
      -            dec float64      deg                                                        Declination      0
      -           pmra float64 mas / yr                         Proper motion in right ascension direction      0
      -          pmdec float64 mas / yr                             Proper motion in declination direction      0
      -       parallax float64      mas                                                           Parallax      0
      - parallax_error float64      mas                                         Standard error of parallax      0
      -radial_velocity float64   km / s                                                    Radial velocity 139373
      -Jobid: 1609277169233O
      +   name    dtype    unit                              description                            
      +--------- ------- -------- ------------------------------------------------------------------
      +source_id   int64          Unique source identifier (unique within a particular Data Release)
      +       ra float64      deg                                                    Right ascension
      +      dec float64      deg                                                        Declination
      +     pmra float64 mas / yr                         Proper motion in right ascension direction
      +    pmdec float64 mas / yr                             Proper motion in declination direction
      +Jobid: 1615815886707O
       Phase: COMPLETED
       Owner: None
      -Output file: async_20201229162609.vot
      +Output file: async_20210315094446.vot
       Results: None
       
      -

      Here are the results.

      results = job.get_results()
      @@ -1833,23 +2066,22 @@ Results: None
       

      Because the filename ends with fits, the table is written in the FITS format, which preserves the metadata associated with the table.

      If the file already exists, the overwrite argument causes it to be overwritten.

      -

      To see how big the file is, we can use ls with the -lh option, which prints information about the file including its size in human-readable form.

      +

      We can use getsize to confirm that the file exists and check the size:

      -
      !ls -lh gd1_results.fits
      +
      from os.path import getsize
      +
      +MB = 1024 * 1024
      +getsize(filename) / MB
       
      -
      -rw-rw-r-- 1 downey downey 8.6M Dec 29 11:47 gd1_results.fits
      +
      5.36407470703125
       
      -

      The file is about 8.6 MB. If you are using Windows, ls might not work; in that case, try:

      -
      !dir gd1_results.fits
      -
      -

      Summary

      @@ -1860,7 +2092,7 @@ Results: None

      Best practices

      • For measurements with units, use Quantity objects that represent units explicitly and check for errors.

      • -
      • Use the format function to compose queries; it is often faster and less error-prone.

      • +
      • Use the format function to compose queries; code written this way is easier to read and less error-prone.

      • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

      • Once you have a query working, save the data in a local file. If you shut down the notebook and come back to it later, you can reload the file; you don’t have to run the query again.

      @@ -1895,8 +2127,8 @@ Results: None
      diff --git a/03_motion.html b/03_motion.html index 79b4925..e992366 100644 --- a/03_motion.html +++ b/03_motion.html @@ -5,7 +5,7 @@ - Proper Motion — Astronomical Data in Python + 3. Proper Motion — Astronomical Data in Python @@ -57,8 +57,8 @@ - - + + @@ -99,22 +99,22 @@ -

      We can select the coordinates like this:

      +

      We can select the coordinates and plot them like this:

      -
      phi1 = gd1_coord.phi1
      -phi2 = gd1_coord.phi2
      -
      -
      -
      -
      -

      And plot them like this:

      -
      -
      -
      plt.plot(phi1, phi2, 'ko', markersize=0.1, alpha=0.2)
      +
      x = skycoord_gd1.phi1
      +y = skycoord_gd1.phi2
      +plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
       
      -plt.xlabel('ra (degree GD1)')
      -plt.ylabel('dec (degree GD1)');
      +plt.xlabel('phi1 (degree GD1)')
      +plt.ylabel('phi2 (degree GD1)');
       
      -_images/03_motion_43_0.png +_images/03_motion_45_0.png
      -

      Remember that we started with a rectangle in GD-1 coordinates. When transformed to ICRS, it’s a non-rectangular polygon. Now that we have transformed back to GD-1 coordinates, it’s a rectangle again.

      +

      Remember that we started with a rectangle in the GD-1 frame. When transformed to the ICRS frame, it’s a non-rectangular region. Now, transformed back to the GD-1 frame, it’s a rectangle again.

      Pandas DataFrame

      @@ -730,10 +730,10 @@ The coordinates in
      -

      And gd1_coord is a SkyCoord object that contains the transformed coordinates and proper motions.

      +

      And skycoord_gd1 is a SkyCoord object that contains the transformed coordinates and proper motions.

      -
      type(gd1_coord)
      +
      type(skycoord_gd1)
       
      @@ -744,9 +744,9 @@ The coordinates in

      On one hand, this division of labor makes sense because each object provides different capabilities. But working with multiple object types can be awkward.

      -

      It will be more convenient to choose one object and get all of the data into it. We’ll use a Pandas DataFrame, for two reasons:

      +

      It will be more convenient to choose one object and get all of the data into it. We’ll choose a Pandas DataFrame, for two reasons:

        -
      1. It provides capabilities that are pretty much a superset of the other data structures, so it’s the all-in-one solution.

      2. +
      3. It provides capabilities that (almost) a superset of the other data structures, so it’s the all-in-one solution.

      4. Pandas is a general-purpose tool that is useful in many domains, especially data science. If you are going to develop expertise in one tool, Pandas is a good choice.

      However, compared to an Astropy Table, Pandas has one big drawback: it does not keep the metadata associated with the table, including the units for the columns.

      @@ -756,18 +756,24 @@ The coordinates in
      import pandas as pd
       
       results_df = results.to_pandas()
      -results_df.shape
      -
      -
      -
      -
      -
      (140340, 8)
       

      DataFrame provides shape, which shows the number of rows and columns.

      -

      It also provides head, which displays the first few rows. It is useful for spot-checking large results as you go along.

      +
      +
      +
      results_df.shape
      +
      +
      +
      +
      +
      (140339, 5)
      +
      +
      +
      +
      +

      It also provides head, which displays the first few rows. head is useful for spot-checking large results as you go along.

      results_df.head()
      @@ -798,9 +804,6 @@ The coordinates in 
             dec
             pmra
             pmdec
      -      parallax
      -      parallax_error
      -      radial_velocity
           
         
         
      @@ -811,9 +814,6 @@ The coordinates in 
             21.757716
             -2.516838
             2.941813
      -      -0.257345
      -      0.823721
      -      1.000000e+20
           
           
             1
      @@ -822,9 +822,6 @@ The coordinates in 
             22.476168
             2.662702
             -12.165984
      -      0.422728
      -      0.297472
      -      1.000000e+20
           
           
             2
      @@ -833,9 +830,6 @@ The coordinates in 
             22.166932
             18.306747
             -7.950660
      -      0.103640
      -      0.544584
      -      1.000000e+20
           
           
             3
      @@ -844,9 +838,6 @@ The coordinates in 
             22.227920
             0.987786
             -2.584105
      -      -0.857327
      -      1.059607
      -      1.000000e+20
           
           
             4
      @@ -855,26 +846,23 @@ The coordinates in 
             22.110783
             0.244439
             -4.941079
      -      0.099625
      -      0.486224
      -      1.000000e+20
           
         
       
       
      -

      Python detail: shape is an attribute, so we display its value without calling it as a function; head is a function, so we need the parentheses.

      -

      Now we can extract the columns we want from gd1_coord and add them as columns in the DataFrame. phi1 and phi2 contain the transformed coordinates.

      +

      Python detail: shape is an attribute, so we display its value without calling it as a function; head is a function, so we need the parentheses.

      +

      Now we can extract the columns we want from skycoord_gd1 and add them as columns in the DataFrame. phi1 and phi2 contain the transformed coordinates.

      -
      results_df['phi1'] = gd1_coord.phi1
      -results_df['phi2'] = gd1_coord.phi2
      +
      results_df['phi1'] = skycoord_gd1.phi1
      +results_df['phi2'] = skycoord_gd1.phi2
       results_df.shape
       
      -
      (140340, 10)
      +
      (140339, 7)
       
      @@ -882,14 +870,14 @@ The coordinates in

      pm_phi1_cosphi2 and pm_phi2 contain the components of proper motion in the transformed frame.

      -
      results_df['pm_phi1'] = gd1_coord.pm_phi1_cosphi2
      -results_df['pm_phi2'] = gd1_coord.pm_phi2
      +
      results_df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
      +results_df['pm_phi2'] = skycoord_gd1.pm_phi2
       results_df.shape
       
      -
      (140340, 12)
      +
      (140339, 9)
       
      @@ -899,8 +887,8 @@ The coordinates in

      Exploring data

      -

      One benefit of using Pandas is that it provides functions for exploring the data and checking for problems.

      -

      One of the most useful of these functions is describe, which computes summary statistics for each column.

      +

      One benefit of using Pandas is that it provides functions for exploring the data and checking for problems. +One of the most useful of these functions is describe, which computes summary statistics for each column.

      results_df.describe()
      @@ -931,9 +919,6 @@ The coordinates in 
             dec
             pmra
             pmdec
      -      parallax
      -      parallax_error
      -      radial_velocity
             phi1
             phi2
             pm_phi1
      @@ -943,48 +928,39 @@ The coordinates in 
         
           
             count
      -      1.403400e+05
      -      140340.000000
      -      140340.000000
      -      140340.000000
      -      140340.000000
      -      140340.000000
      -      140340.000000
      -      1.403400e+05
      -      140340.000000
      -      140340.000000
      -      140340.000000
      -      140340.000000
      +      1.403390e+05
      +      140339.000000
      +      140339.000000
      +      140339.000000
      +      140339.000000
      +      140339.000000
      +      140339.000000
      +      140339.000000
      +      140339.000000
           
           
             mean
      -      6.792378e+17
      -      143.822971
      -      26.780161
      -      -2.484410
      -      -6.100784
      -      0.179474
      -      0.518068
      -      9.931167e+19
      -      -50.091337
      -      -1.803264
      -      -0.868980
      -      1.409215
      +      6.792399e+17
      +      143.823122
      +      26.780285
      +      -2.484404
      +      -6.100777
      +      -50.091158
      +      -1.803301
      +      -0.868963
      +      1.409208
           
           
             std
      -      3.792015e+16
      -      3.697824
      -      3.052639
      -      5.913923
      -      7.202013
      -      0.759622
      -      0.505558
      -      8.267982e+18
      -      2.892321
      -      3.444439
      -      6.657700
      -      6.518573
      +      3.792177e+16
      +      3.697850
      +      3.052592
      +      5.913939
      +      7.202047
      +      2.892344
      +      3.444398
      +      6.657714
      +      6.518615
           
           
             min
      @@ -993,9 +969,6 @@ The coordinates in 
             19.286617
             -106.755260
             -138.065163
      -      -15.287602
      -      0.020824
      -      -1.792684e+02
             -54.999989
             -8.029159
             -115.275637
      @@ -1003,48 +976,39 @@ The coordinates in 
           
           
             25%
      -      6.443515e+17
      -      140.967807
      -      24.592348
      -      -5.038746
      -      -8.341641
      -      -0.035983
      -      0.141108
      -      1.000000e+20
      -      -52.603097
      -      -4.750410
      -      -2.948851
      -      -1.107074
      +      6.443517e+17
      +      140.967966
      +      24.592490
      +      -5.038789
      +      -8.341561
      +      -52.602952
      +      -4.750426
      +      -2.948723
      +      -1.107128
           
           
             50%
      -      6.888056e+17
      -      143.734183
      -      26.746169
      -      -1.834971
      -      -4.689570
      -      0.362705
      -      0.336103
      -      1.000000e+20
      -      -50.147567
      -      -1.671497
      -      0.585038
      -      1.987196
      +      6.888060e+17
      +      143.734409
      +      26.746261
      +      -1.834943
      +      -4.689596
      +      -50.147362
      +      -1.671502
      +      0.585037
      +      1.987149
           
           
             75%
      -      6.976578e+17
      -      146.607180
      -      28.990490
      -      0.452995
      -      -1.937833
      -      0.657636
      -      0.751171
      -      1.000000e+20
      -      -47.593466
      -      1.160632
      -      3.001761
      -      4.628859
      +      6.976579e+17
      +      146.607350
      +      28.990500
      +      0.452893
      +      -1.937809
      +      -47.593279
      +      1.160514
      +      3.001768
      +      4.628965
           
           
             max
      @@ -1053,11 +1017,8 @@ The coordinates in 
             34.285481
             104.319923
             20.981070
      -      0.999957
      -      4.171221
      -      1.000000e+20
      -      -45.000086
      -      4.014794
      +      -44.999985
      +      4.014609
             39.802471
             79.275199
           
      @@ -1069,23 +1030,22 @@ The coordinates in 
       

      Exercise

      Review the summary statistics in this table.

        -
      • Do the values makes senses based on what you know about the context?

      • +
      • Do the values make sense based on what you know about the context?

      • Do you see any values that seem problematic, or evidence of other data issues?

      -
      +
      # Solution
       
      -# A few issues that are likely to come up:
      +# The most noticeable issue is that some of the
      +# parallax values are negative, which is non-physical.
       
      -# 1. Why are some of the parallax values negative?
      -#    Some parallax measurements are inaccurate, especially
      -#    stars that are far away.
      +# The reason is that parallax measurements are less accurate
      +# for stars that are far away.
       
      -# 2. Why are some of the radial velocities 1e20?
      -#    It seems like this value is used to indicate invalid data.
      -#    Notice that the 25th percentile is 1e20, which indicates
      -#    that at least 75% of these values are invalid.
      +# Fortunately, we don't use the parallax measurements in
      +# the analysis (one of the reasons we used constant distance
      +# for reflex correction).
       
      @@ -1094,20 +1054,60 @@ The coordinates in

      Plot proper motion

      -

      Now we are ready to replicate one of the panels in Figure 1 of the Price-Whelan and Bonaca paper, the one that shows the components of proper motion as a scatter plot:

      +

      Now we are ready to replicate one of the panels in Figure 1 of the Price-Whelan and Bonaca paper, the one that shows components of proper motion as a scatter plot:

      https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-1.png -

      In this figure, the shaded area is a high-density region of stars with the proper motion we expect for stars in GD-1.

      +

      In this figure, the shaded area identifies stars that are likely to be in GD-1 because:

        -
      • Due to the nature of tidal streams, we expect the proper motion for most stars to be along the axis of the stream; that is, we expect motion in the direction of phi2 to be near 0.

      • +
      • Due to the nature of tidal streams, we expect the proper motion for stars in GD-1 to be along the axis of the stream; that is, we expect motion in the direction of phi2 to be near 0.

      • In the direction of phi1, we don’t have a prior expectation for proper motion, except that it should form a cluster at a non-zero value.

      -

      To locate this cluster, we will select stars near the centerline_df of GD-1.

      +

      By plotting proper motion in the GD-1 frame, we hope to find this cluster. +Then we will use the bounds of the cluster to select stars that are more likely to be in GD-1.

      +

      The following figure is a scatter plot of proper motion, in the GD-1 frame, for the stars in results_df.

      +
      +
      +
      x = results_df['pm_phi1']
      +y = results_df['pm_phi2']
      +plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
      +    
      +plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
      +plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)');
      +
      +
      +
      +
      +_images/03_motion_69_0.png +
      +
      +

      Most of the proper motions are near the origin, but there are a few extreme values. +Following the example in the paper, we’ll use xlim and ylim to zoom in on the region near the origin.

      +
      +
      +
      x = results_df['pm_phi1']
      +y = results_df['pm_phi2']
      +plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
      +    
      +plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
      +plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')
      +
      +plt.xlim(-12, 8)
      +plt.ylim(-10, 10);
      +
      +
      +
      +
      +_images/03_motion_71_0.png +
      +
      +

      There is a hint of an overdense region near (-7.5, 0), but if you didn’t know where to look, you would miss it.

      +

      To see the cluster more clearly, we need a sample that contains a higher proportion of stars in GD-1. +We’ll do that by selecting stars close to the centerline.

      Selecting the centerline

      As we can see in the following figure, many stars in GD-1 are less than 1 degree from the line phi2=0.

      https://github.com/datacarpentry/astronomy-python/raw/gh-pages/fig/gd1-4.png -

      So stars near this line have the highest probability of being in GD-1.

      +

      Stars near this line have the highest probability of being in GD-1.

      To select them, we will use a “Boolean mask”. We’ll start by selecting the phi2 column from the DataFrame:

      @@ -1126,8 +1126,8 @@ The coordinates in

      We can use a comparison operator, >, to compare the values in a Series to a constant.

      -
      phi2_min = -1.0 * u.deg
      -phi2_max = 1.0 * u.deg
      +
      phi2_min = -1.0 * u.degree
      +phi2_max = 1.0 * u.degree
       
       mask = (phi2 > phi2_min)
       type(mask)
      @@ -1158,7 +1158,8 @@ Name: phi2, dtype: bool
       
      -

      The & operator computes “logical AND”, which means the result is true where elements from both Boolean Series are true.

      +

      To select values that fall between phi2_min and phi2_max, we’ll use the & operator, which computes “logical AND”. +The result is true where elements from both Boolean Series are true.

      mask = (phi2 > phi2_min) & (phi2 < phi2_max)
      @@ -1166,7 +1167,8 @@ Name: phi2, dtype: bool
       
      -

      Python note: We need the parentheses around the conditions; otherwise the order of operations is incorrect.

      +

      Python detail: Python’s logical operators (and, or, and not) don’t work with NumPy or Pandas. Both libraries use the bitwise operators (&, |, and ~) to do elementwise logical operations (explanation here).

      +

      Also, we need the parentheses around the conditions; otherwise the order of operations is incorrect.

      The sum of a Boolean Series is the number of True values, so we can use sum to see how many stars are in the selected region.

      @@ -1194,49 +1196,71 @@ Name: phi2, dtype: bool
      -

      centerline_df is a DataFrame that contains only the rows from results_df that correspond to True values in mask; that is, in contains the stars near the centerline of GD-1.

      +

      centerline_df is a DataFrame that contains only the rows from results_df that correspond to True values in mask. +So it contains the stars near the centerline of GD-1.

      +

      We can use len to see how many rows are in centerline_df:

      +
      +
      +
      len(centerline_df)
      +
      +
      +
      +
      +
      25084
      +
      +
      +
      +
      +

      And what fraction of the rows we’ve selected.

      +
      +
      +
      len(centerline_df) / len(results_df)
      +
      +
      +
      +
      +
      0.1787386257562046
      +
      +
      +
      +
      +

      There are about 25,000 stars in this region, about 18% of the total.

      Plotting proper motion

      -

      Here’s a scatter plot of proper motion for the selected stars.

      +

      Since we’ve plotted proper motion several times, let’s put that code in a function.

      -
      pm1 = centerline_df['pm_phi1']
      -pm2 = centerline_df['pm_phi2']
      +
      def plot_proper_motion(df):
      +    """Plot proper motion.
      +    
      +    df: DataFrame with `pm_phi1` and `pm_phi2`
      +    """
      +    x = df['pm_phi1']
      +    y = df['pm_phi2']
      +    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)
       
      -plt.plot(pm1, pm2, 'ko', markersize=0.1, alpha=0.1)
      -    
      -plt.xlabel('Proper motion phi1 (GD1 frame)')
      -plt.ylabel('Proper motion phi2 (GD1 frame)');
      +    plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
      +    plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')
      +
      +    plt.xlim(-12, 8)
      +    plt.ylim(-10, 10)
      +
      +
      +
      +
      +

      And we can call it like this:

      +
      +
      +
      plot_proper_motion(centerline_df)
       
      -_images/03_motion_80_0.png +_images/03_motion_94_0.png
      -

      Looking at these results, we see a large cluster around (0, 0), and a smaller cluster near (0, -10).

      -

      We can use xlim and ylim to set the limits on the axes and zoom in on the region near (0, 0).

      -
      -
      -
      pm1 = centerline_df['pm_phi1']
      -pm2 = centerline_df['pm_phi2']
      -
      -plt.plot(pm1, pm2, 'ko', markersize=0.3, alpha=0.3)
      -    
      -plt.xlabel('Proper motion phi1 (GD1 frame)')
      -plt.ylabel('Proper motion phi2 (GD1 frame)')
      -
      -plt.xlim(-12, 8)
      -plt.ylim(-10, 10);
      -
      -
      -
      -
      -_images/03_motion_82_0.png -
      -
      -

      Now we can see the smaller cluster more clearly.

      +

      Now we can see more clearly that there is a cluster near (-7.5, 0).

      You might notice that our figure is less dense than the one in the paper. That’s because we started with a set of stars from a relatively small region. The figure in the paper is based on a region about 10 times bigger.

      In the next lesson we’ll go back and select stars from a larger region. But first we’ll use the proper motion data to identify stars likely to be in GD-1.

      @@ -1257,7 +1281,7 @@ Name: phi2, dtype: bool
      -

      To draw these bounds, we’ll make two lists containing the coordinates of the corners of the rectangle.

      +

      To draw these bounds, we’ll use make_rectangle to make two lists containing the coordinates of the corners of the rectangle.

      def make_rectangle(x1, x2, y1, y2):
      @@ -1280,39 +1304,27 @@ Name: phi2, dtype: bool
       

      Here’s what the plot looks like with the bounds we chose.

      -
      plt.plot(pm1, pm2, 'ko', markersize=0.3, alpha=0.3)
      -plt.plot(pm1_rect, pm2_rect, '-')
      -    
      -plt.xlabel('Proper motion phi1 (GD1 frame)')
      -plt.ylabel('Proper motion phi2 (GD1 frame)')
      -
      -plt.xlim(-12, 8)
      -plt.ylim(-10, 10);
      +
      plot_proper_motion(centerline_df)
      +plt.plot(pm1_rect, pm2_rect, '-');
       
      -_images/03_motion_90_0.png +_images/03_motion_102_0.png
      -

      To select rows that fall within these bounds, we’ll use the following function, which uses Pandas operators to make a mask that selects rows where series falls between low and high.

      +

      Now that we’ve identified the bounds of the cluster in proper motion, we’ll use it to select rows from results_df.

      +

      We’ll use the following function, which uses Pandas operators to make a mask that selects rows where series falls between low and high.

      def between(series, low, high):
      -    """Make a Boolean Series.
      -    
      -    series: Pandas Series
      -    low: lower bound
      -    high: upper bound
      -    
      -    returns: Boolean Series
      -    """
      +    """Check whether values are between `low` and `high`."""
           return (series > low) & (series < high)
       
      -

      The following mask select stars with proper motion in the region we chose.

      +

      The following mask selects stars with proper motion in the region we chose.

      pm1 = results_df['pm_phi1']
      @@ -1354,18 +1366,17 @@ Name: phi2, dtype: bool
       

      These are the stars we think are likely to be in GD-1. Let’s see what they look like, plotting their coordinates (not their proper motion).

      -
      phi1 = selected_df['phi1']
      -phi2 = selected_df['phi2']
      +
      x = selected_df['phi1']
      +y = selected_df['phi2']
      +plt.plot(x, y, 'ko', markersize=1, alpha=1)
       
      -plt.plot(phi1, phi2, 'ko', markersize=0.5, alpha=0.5)
      -
      -plt.xlabel(r'$\phi_1$ (degree GD1)')
      -plt.ylabel(r'$\phi_2$ (degree GD1)');
      +plt.xlabel('phi1 (degree GD1)')
      +plt.ylabel('phi2 (degree GD1)');
       
      -_images/03_motion_100_0.png +_images/03_motion_112_0.png

      Now that’s starting to look like a tidal stream!

      @@ -1397,68 +1408,62 @@ Name: phi2, dtype: bool

      We can write a Pandas DataFrame to an HDF5 file like this:

      -
      filename = 'gd1_dataframe.hdf5'
      +
      filename = 'gd1_data.hdf'
       
      -results_df.to_hdf(filename, 'results_df')
      +selected_df.to_hdf(filename, 'selected_df', mode='w')
       

      Because an HDF5 file can contain more than one Dataset, we have to provide a name, or “key”, that identifies the Dataset in the file.

      -

      We could use any string as the key, but in this example I use the variable name results_df.

      +

      We could use any string as the key, but it will be convenient to give the Dataset in the file the same name as the DataFrame.

      +

      The argument mode='w' means that if the file already exists, we should overwrite it.

      Exercise

      -

      We’re going to need centerline_df and selected_df later as well. Write a line or two of code to add it as a second Dataset in the HDF5 file.

      +

      We’re going to need centerline_df later as well. Write a line of code to add it as a second Dataset in the HDF5 file.

      +

      Hint: Since the file already exists, you should not use mode='w'.

      # Solution
       
       centerline_df.to_hdf(filename, 'centerline_df')
      -selected_df.to_hdf(filename, 'selected_df')
       
      -

      Detail: Reading and writing HDF5 tables requires a library called PyTables that is not always installed with Pandas. You can install it with pip like this:

      -
      pip install tables
      -
      -
      -

      If you install it using Conda, the name of the package is pytables.

      -
      conda install pytables
      -
      -
      -

      We can use ls to confirm that the file exists and check the size:

      +

      We can use getsize to confirm that the file exists and check the size:

      -
      !ls -lh gd1_dataframe.hdf5
      +
      from os.path import getsize
      +
      +MB = 1024 * 1024
      +getsize(filename) / MB
       
      -
      -rw-rw-r-- 1 downey downey 20M Dec 29 11:48 gd1_dataframe.hdf5
      +
      2.0090408325195312
       
      -

      If you are using Windows, ls might not work; in that case, try:

      -
      !dir gd1_dataframe.hdf5
      -
      -
      -

      We can read the file back like this:

      +

      If you forget what the names of the Datasets in the file are, you can read them back like this:

      -
      read_back_df = pd.read_hdf(filename, 'results_df')
      -read_back_df.shape
      +
      with pd.HDFStore(filename) as hdf:
      +    print(hdf.keys())
       
      -
      (140340, 12)
      +
      ['/centerline_df', '/selected_df']
       
      -

      Pandas can write a variety of other formats, which you can read about here.

      +

      Python note: We use a with statement here to open the file before the print statement and (automatically) close it after. Read more about context managers.

      +

      The keys are the names of the Datasets. Notice that they start with /, which indicates that they are at the top level of the Dataset hierarchy, and not in a named “group”.

      +

      In future lessons we will add a few more Datasets to this file, but not so many that we need to organize them into groups.

      @@ -1479,6 +1484,7 @@ Name: phi2, dtype: bool
    • When you make a scatter plot, adjust the size of the markers and their transparency so the figure is not overplotted; otherwise it can misrepresent the data badly.

    • For simple scatter plots in Matplotlib, plot is faster than scatter.

    • An Astropy Table and a Pandas DataFrame are similar in many ways and they provide many of the same functions. They have pros and cons, but for many projects, either one would be a reasonable choice.

    • +
    • To store data from a Pandas DataFrame, a good option is an HDF file, which can contain multiple Datasets.

    @@ -1511,8 +1517,8 @@ Name: phi2, dtype: bool