02. The Rubin/LSST Data Butler from the Command Line (intermediate)

RSP Aspect: Notebook (terminal command-line)

Contact author: Aaron Meisner

Last verified to run: 10/28/2024

LSST Science Pipelines version: Weekly 2024_16

Targeted learning level: Intermediate

Container size: small

Introduction: This tutorial is an introduction to the “Butler” command line functionality available from the LSST Science Pipelines on the Rubin Science Platform (RSP). It is meant to parallel, in part, the Jupyter Notebook tutorial “Data Discovery and Query with the Butler” (DP02_04b_Intermediate_Butler_Queries.ipynb). The Butler is the LSST Science Pipelines’ infrastructure that handles retrieval of pipeline inputs and saving of pipeline outputs in such a way that the end user does not need to know about any details like file names, locations, or formats. For this reason, Butler is sometimes referred to as the LSST Science Pipelines’ “middleware”. This tutorial emphasizes command line invocations of the Butler, but it is also common to use Butler from within Python; several Rubin Data Preview Jupyter notebook tutorials demonstrate the usage of Butler from within Python. You can also learn more from the Butler documentation.

This tutorial uses the Data Preview 0.2 (DP0.2) data set. This data set uses a subset of the DESC’s Data Challenge 2 (DC2) simulated images, which have been reprocessed by Rubin Observatory using Version 23 of the LSST Science Pipelines. More information about the simulated data can be found in the DESC’s DC2 paper and in the DP0.2 data release documentation.

Step 1. Access the terminal and set up

1.1. Log in to the RSP’s Notebook Aspect. The JupyterLab terminal is a subcomponent of the Notebook Aspect.

1.2. In the launcher window under “Other”, select the terminal.

1.3. Set up the LSST Science Pipelines environment

setup lsst_distrib

Then verify that you do indeed have the expected recommended weekly version of the LSST Science Pipelines in your RSP JupyterLab terminal environment:

eups list lsst_distrib

1.4. First Butler command: look at the Butler help documentation options

Help documentation for Butler command line invocations is available as follows:

butler --help

Running the preceding command is possible because butler is an executable available once the LSST Science Pipelines environment has been set up (you can see the exact location of this executable by running which butler at the terminal, if desired). There are many Butler command line options shown in this help documentation. This command line tutorial only explores a subset of the Butler’s full command line capabilities.

Step 2. Query dataset types

Typically, you will identify a specific Rubin/LSST dataset using the following three pieces of information: the dataset type, the data ID, and the collection. You can think of dataset type as being the kind of data you wish to identify, such as a single-band coadded image, a calibrated visit image, or a coadd-level source catalog. The data ID is a concept that’s been reviewed in multiple Rubin tutorial notebooks. For instance, in the tutorial notebook entitled “Detect and Measure Sources in a Custom Coadded Image”, source detection is only run on one specific i-band coadd, which has its data ID specified by tract = 4431, patch = 17, band = “i”, where the tract and patch together specify the sky region and the band specifies which LSST filter. The collection is a string that specifies a particular set of persisted data for Butler to examine (note that in many contexts it is possible to specify a list of collections for Butler to search rather than a single collection). The concept of a collection is valuable because it allows for tailoring queries to work with either the default, Rubin-provided DP0.2 data sets (this default collection is called 2.2i/runs/DP0.2), custom user-created collections, or both.

Zooming out one level, collections exist within Butler repositories or “repos”. On the RSP, the standard DP0.2 repo of Rubin-provided content is called dp02. dp02 therefore appears as a command line argument in many of the Butler invocations throughout this tutorial.

2.1. View the list of Butler dataset types within DP0.2

The aforementioned butler --help command shows that there is a Butler invocation pattern beginning with butler query-dataset-types with help description “Get the dataset types in a repository.’’ This is useful for understanding what types of data products are or are not contained within a given Butler repo. The following command will print out a list of all dataset types within the DP0.2 repository on the RSP:

butler query-dataset-types dp02

Specifying a collection is not needed here; only the repository name dp02 is provided as an argument. The full output is shown on a separate page for brevity. There are more than 900 dataset types in the DP0.2 repo on RSP! Some of these are relatively recognizable and commonly used dataset types, including deepCoadd, deepCoadd_obj, raw, calexp, bias, src, and srcMatch.

Note that you can also get help documentation specific to the butler query-dataset-types invocation pattern from the command line:

butler query-dataset-types --help

Step 3. Query dimension records

Per the LSST Science Pipelines documentation, the Butler has a notion of Dimensions, summarized as “astronomical concepts that are used to label and organize datasets…Examples of dimensions include instruments, detectors, visits, and tracts.” So a Dimension is any type of parameter you might use to help fully specify a dataset of interest.

The Butler command line utility includes an invocation pattern that begins with butler query-dimension-records. These query-dimension-records commands can be very valuable for exploring the content of a Butler repository from the command line. For instance, as shown below, query-dimension-records can be used to learn what different filters, visit numbers, and detectors are present within a given Butler repository, all from the command line. Full help documentation for butler query-dimension-records is available via butler query-dimension-records --help.

3.1. Explore the list of filters from the command line

Let’s ask and answer the following question with butler query-dimension-records: what filters are present within the dp02 Butler repository? In this context, band is the Dimension of interest, and query-dimension-records lists the unique band dimension values represented within the dp02 repository. Execute the following command:

butler query-dimension-records dp02 band

The two arguments for the above command are the Butler repository (dp02) and the Dimension of interest, in this case band because of the desire to obtain the list of filters. It is not necessary to specify a collection name for butler query-dimension-records. The full output is shown on a separate page for brevity. From the results, it appears that there are some “bonus” bands included within the DP0.2 repository, beyond the ugrizy bands that will be present in the Rubin/LSST science survey. It is recommended to only work with the Rubin/LSST science filters ugrizy within DP0.2 on the RSP; the other band values listed generally correspond to early engineering exercises.

3.2. Instruments included in the dp02 Butler repository

For several commands later in this tutorial, it’s useful to restrict to data generated by the LSST “imSim” image simulation tool. Let’s check what unique Instrument values are present in the dp02 repo on RSP (noting that these are simulated data and so the instruments are simulation software packages not actual hardware):

butler query-dimension-records dp02 instrument

The output of the above command is:

     name      visit_max exposure_max detector_max          class_name
-------------- --------- ------------ ------------ ---------------------------
LSSTCam-PhoSim   9999999      9999999         1000 lsst.obs.lsst.LsstCamPhoSim
LSSTCam-imSim   9999999      9999999         1000  lsst.obs.lsst.LsstCamImSim

Generally, restricting to images generated by imSim (Instrument = LSSTCam-imSim) is helpful for the purposes of this tutorial; “PhoSim” (Instrument = LSSTCam-PhoSim) is a lower level photon simulator.

3.3. Explore the list of detectors from the command line

Another reasonable question to ask when initially exploring the DP0.2 data products on RSP would be: what is the list of detectors that contribute to the overall data inventory? To do this, use butler query-dimension-records with detector as the dimension rather than band. detector here means a specific CCD. Run the following command:

butler query-dimension-records dp02 detector

Like before, the two arguments for the above command are the Butler repository (dp02) and the Dimension of interest, in this case detector because of the desire to obtain the list of CCDs. It is not necessary to specify a collection name for butler query-dimension-records. From the above command’s results, it’s interesting to note that there are some WAVEFRONT and GUIDER CCDs present in the simulated DP0.2 data set, in addition to the SCIENCE CCDs. It is recommended to only work with the simulated SCIENCE CCDs within DP0.2 on the RSP. The full output of the above command is shown on a separate page for brevity. Note that this output contains both LSSTCam-imSim and LSSTCam-PhoSim results, which will be addressed in the next subsection.

3.4. Refine a Butler dimension record query

butler query-dimension-records, and other Butler command line invocation patterns, offer the very valuable ability to perform SQL-like filtering of returned results via the --where argument. The where argument for Butler command line invocations must be a string enclosed in quotes, with syntax similar to that used for WHERE clauses in SQL or ADQL queries.

Here’s an example of a butler query-dimension-records invocation that also brings in a SQL-like where clause to limit the amount of output, and focus only on detectors with id numbers between 6 and 8 (inclusive):

butler query-dimension-records dp02 detector --where "instrument='LSSTCam-imSim' AND detector.id IN (6..8)"

The output of the above command is:

  instrument   id full_name name_in_raft raft purpose
------------- --- --------- ------------ ---- -------
LSSTCam-imSim   6   R01_S20          S20  R01 SCIENCE
LSSTCam-imSim   7   R01_S21          S21  R01 SCIENCE
LSSTCam-imSim   8   R01_S22          S22  R01 SCIENCE

The instrument='LSSTCam-imSim' portion of the query is required (if absent, an error would result). The LSST Science Pipelines documentation contains further information about Butler query syntax.

3.5. Spatially restricted query of DP0.2 exposures

Putting together Butler’s query-dimension-records and where argument filtering, perform a spatial query on exposures in the dp02 Butler repo as follows:

butler query-dimension-records dp02 exposure --where "instrument='LSSTCam-imSim' AND exposure.tracking_ra > 53.0 AND exposure.tracking_ra < 53.0002"

The output of the above command is:

  instrument     id   physical_filter  obs_id exposure_time dark_time observation_type observation_reason day_obs  seq_num group_name group_id target_name science_program    tracking_ra       tracking_dec        sky_angle         zenith_angle                  timespan (TAI)
------------- ------- --------------- ------- ------------- --------- ---------------- ------------------ -------- ------- ---------- -------- ----------- --------------- ----------------- ------------------ ------------------ ------------------ ------------------------------------------
LSSTCam-imSim  202462       g_sim_1.4  202462          30.0      30.0          science              imsim 20221001       0     202462   202462     UNKNOWN          202462 53.00018875481526 -27.39918586728378   300.340730287346 30.851948317324634 [2022-10-02T05:10:33, 2022-10-02T05:11:03)
LSSTCam-imSim  427087       z_sim_1.4  427087          30.0      30.0          science              imsim 20230830       0     427087   427087     UNKNOWN          427087 53.00006696621878  -31.6170143466886  74.39265729448658 14.782812793171615 [2023-08-31T08:29:14, 2023-08-31T08:29:44)
LSSTCam-imSim  470443       i_sim_1.4  470443          30.0      30.0          science              imsim 20231118       0     470443   470443     UNKNOWN          470443 53.00006155463241 -27.50071382700686  41.46698793005794  38.04876864900619 [2023-11-19T01:29:17, 2023-11-19T01:29:47)
LSSTCam-imSim  709719       z_sim_1.4  709719          30.0      30.0          science              imsim 20241112       0     709719   709719     UNKNOWN          709719 53.00003150603273 -27.35089052186188  69.77305663201832   35.7790473845536 [2024-11-13T07:27:01, 2024-11-13T07:27:31)
LSSTCam-imSim  732227       z_sim_1.4  732227          30.0      30.0          science              imsim 20241217       0     732227   732227     UNKNOWN          732227 53.00011899044281 -27.53881903559373  257.8766252269814  21.50198032659604 [2024-12-18T04:03:34, 2024-12-18T04:04:04)
LSSTCam-imSim  950384       u_sim_1.4  950384          30.0      30.0          science              imsim 20251023       0     950384   950384     UNKNOWN          950384  53.0000299236379   -27.406067390404  48.48529963673576  43.61972031905932 [2025-10-24T02:43:36, 2025-10-24T02:44:06)
LSSTCam-imSim  955121       r_sim_1.4  955121          30.0      30.0          science              imsim 20251029       0     955121   955121     UNKNOWN          955121 53.00000406576612 -27.37569446275185 227.18038983167162  36.82874161171403 [2025-10-30T02:51:42, 2025-10-30T02:52:12)
LSSTCam-imSim  976771       z_sim_1.4  976771          30.0      30.0          science              imsim 20251209       0     976771   976771     UNKNOWN          976771 53.00006086158266 -27.35854559192086 185.72856377252242   8.46669843785567 [2025-12-10T03:34:51, 2025-12-10T03:35:21)
LSSTCam-imSim 1194599       y_sim_1.4 1194599          30.0      30.0          science              imsim 20261016       0    1194599  1194599     UNKNOWN         1194599 53.00007698977164 -27.44389935977285  291.1579519122609 21.546698575031996 [2026-10-17T04:54:23, 2026-10-17T04:54:53)

The first two arguments for the above command are, as seen previously, the Butler repository (dp02) and the Dimension of interest (in this case exposure). Without the where argument, a dramatically longer list of simulated DP0.2 exposures would be printed out. The where clause specified above restricts the list of returned exposures to only 9 items. As shown in the DP0.2 Data Products Definition Document, DP0.2 covers a subregion within the eventual LSST footprint, with RA values spanning from roughly 50 degrees to 75 degrees, hence why the above query has chosen to look at a narrow range of RA around RA = 53 degrees. The instrument='LSSTCam-imSim' portion of the where argument query is required (if absent, an error would result); this specifies the instrument for Butler to consider when retrieving the exposure list. It is not currently possible to perform a butler command line query that would do a cone search based on the RA and Dec coordinates of exposures.

3.6. Temporally restricted query

It is also possible to perform a temporally constrained query rather than a spatially constrained query. The following command is an example using the where argument of query-dimension-records:

butler query-data-ids dp02 exposure --where "instrument='LSSTCam-imSim' AND exposure.timespan OVERLAPS (T'2023-11-19T01:29:17',T'2023-11-19T01:31:25')"

The output of the above command is:

  instrument  exposure band physical_filter
------------- -------- ---- ---------------
LSSTCam-imSim   470444    i       i_sim_1.4
LSSTCam-imSim   470445    i       i_sim_1.4
LSSTCam-imSim   470446    i       i_sim_1.4
LSSTCam-imSim   470447    i       i_sim_1.4

This example command uses the OVERLAPS operator. Note that time literals like T'2023-11-19T01:29:17' in the query begin with T and enclose the timestamp in single quotes. Then two timestamp literals separated by a comma and enclosed together within parentheses specifies a time interval within which to search. The Butler query documentation provides further information about timestamp literals, time intervals, and the OVERLAPS operator.

Step 4. Querying dataID values

To identify lists of relevant datasets, which are specified by data IDs, use the butler query-data-ids invocation pattern. butler query-data-ids prints out a list of all datasets of a user-specified type. Full help for butler query-data-ids is available via the following command:

butler query-data-ids --help

Let’s start off with a relatively standard, familiar dataset type within DP0.2: deepCoadd. Recall that an LSST Science Pipelines deepCoadd data product is a stacked image using all good exposures to make a coadd that emphasizes depth (as apposed to, say, goodSeeingCoadd, which emphasizes angular resolution). deepCoadd data products are very generally useful, for instance to study faint galaxies or distant stars within the Milky Way. The following command prints out a full list of deepCoadd tract identifiers within the dp02 Butler repository on RSP:

butler query-data-ids dp02 tract --collections 2.2i/runs/DP0.2 --datasets 'deepCoadd'

Recall that a tract is a relatively large sky region within a given sky map, and then patches are smaller subregions within each tract. The first argument to butler query-data-ids above specifies, as usual, the dp02 Butler repo on RSP. The second argument specifies the Dimension of interest – that Butler should return the list of unique tract values corresponding to deepCoadd products in the DP0.2 repository. The third --collections argument specifies that Butler should consider only the production DP0.2 output collection; this will ignore any results that might be found from bespoke user-created collections on RSP. The full output of the above command is shown on a separate page for brevity.

Let’s restrict the output to only the returned data rows, as follows, upon noting that each data row contains the string “DC2” for the skymap, and then count the number of returned results with the wc unix utility program:

butler query-data-ids dp02 tract --collections 2.2i/runs/DP0.2 --datasets 'deepCoadd' |grep DC2 |wc -l

This resulting printout value of 157 makes sense, as there are 157 tracts worth of sky coverage in DP0.2.

Now use Butler from the command line to figure out how many coadd patches there are in the DP0.2 data set:

butler query-data-ids dp02 patch --collections 2.2i/runs/DP0.2 --datasets 'deepCoadd' |grep DC |wc -l

Note that this command is almost identical to the one before it, but with patch rather than tract specified as the Dimension of interest. The resulting printout value of 7693 makes sense, because there are 157 DP0.2 coadd tracts, and each of these tracts consists of a grid of 7x7 = 49 patches. So then there should be 157 tracts x 49 patches/tract = 7693 patches in DP0.2.

Step 5. The limit and order-by arguments

butler-query-data-ids has a limit argument that restricts the output to only at most a certain user-specified number of results. The following command displays just a first set of 4 deepCoadd dataIDs:

butler query-data-ids dp02 patch --collections 2.2i/runs/DP0.2 --datasets 'deepCoadd' --limit 4

The output of the above command is:

skymap tract patch
------ ----- -----
   DC2  3265     7
   DC2  3633    22
   DC2  3831    47
   DC2  4852    18

Note that Butler has a default limit value of 20,000 which would become relevant for a query that might yield a very large number of results. A Butler query that hits the default limit of 20,000 results will issue a warning about this default limit.

The order-by command line argument is also available for multiple Butler command line utilities, including query-dimension-records. To order query-dimension-records results for a list of detectors by detector full name:

butler query-dimension-records dp02 detector --limit 4 --order-by full_name --where "instrument='LSSTCam-imSim'"

The output of the above command is:

instrument   id full_name name_in_raft raft purpose
------------- --- --------- ------------ ---- -------
LSSTCam-imSim   0   R01_S00          S00  R01 SCIENCE
LSSTCam-imSim   1   R01_S01          S01  R01 SCIENCE
LSSTCam-imSim   2   R01_S02          S02  R01 SCIENCE
LSSTCam-imSim   3   R01_S10          S10  R01 SCIENCE

Note that the above command combines the order-by and limit arguments, only showing the first 4 results sorted by ascending detector full_name.

Step 6. Optional exercises for the learner

  1. butler query-data-ids also accepts a where argument to narrow down queries. Try issuing a butler query-data-ids command that only returns a list of i-band deepCoadd products, rather than all bands.

  2. Use butler query-data-ids to obtain a list of tracts that have goodSeeingCoadd data products within the DP0.2 repository on RSP.

  3. Refine the Section 3.3 query for available detectors so as to remove results that arise from the “PhoSim” simulator using a query restriction based on instrument.