Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-23033: Clarify reference catalog creation docs. #186

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions doc/lsst.meas.algorithms/creating-a-reference-catalog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This page uses `Gaia DR2`_ as an example.
1. Gathering data
=================

:lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` reads text or FITS files from an external catalog (e.g. ``GaiaSource*.csv.gz``).
:lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` reads reference catalog data from one or more text or FITS files representing an external catalog (e.g. :file:`GaiaSource*.csv.gz`).
In order to ingest these files, you must have a copy of them on a local disk.
Network storage (such as NFS and GPFS) are not recommended for this work, due to performance issues involving tens of thousands of small files.
Ensure that you have sufficient storage capacity.
Expand Down Expand Up @@ -89,9 +89,29 @@ This is an example configuration that was used to ingest the Gaia DR2 catalog:
3. Ingest the files
===================

The main difference when running :lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` compared with other LSST tasks is that you specify the full list of files to be ingested.
For many input catalogs, this may be tens of thousands of files: more than most shells support.
Instead, you can write a small Python script that finds files with the `glob` package to run the :lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` task programatically.
:lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` takes three important parameters:

- The name of a Butler repository.

This repository is only used to initialize the Butler, and doesn't have to contain any useful data.
You can point to any repository you have available, or you could create a temporary one like this:

.. prompt:: bash

mkdir /path/to/my_repo
echo "lsst.obs.test.TestMapper" > /path/to/my_repo/_mapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not hugely happy about this, since this method of making a pseudo-butler is going away in gen3, but then again the code I wrote isn't gen3 compatible, so I guess its fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a Gen 3 neophyte. Is there a better way to do this? I didn't know of one, but if there is I'll happily use that instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gen3 there is not; you call Butler.makeRepo(), but the refcat ingester is only gen2 right now anyway...

I do wonder whether this bit about making a fake repo should just be removed, as anyone running this is going to have a butler repo available.


- The name(s) of the input FITS or text files.
- The path to the configuration file (say, :file:`/path/to/my_config.cfg`).

The task could then be invoked from the command line as:

.. prompt:: bash

ingestReferenceCatalog.py /path/to/my_repo input_catalog.txt --configfile /path/to/my_config.cfg

However, be aware that external catalogs may be split across tens of thousands of files: attempting to specify the full list on the command line is likely to be impossible due to limits imposed by the underlying operating system and shell.
Instead, you can write a small Python script that finds files with the `glob` package and then runs the :lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` task for you.

Here is a sample script that was used to generate the Gaia DR2 refcat.
In order to deal with the way that Gaia released their photometric data, we have subclassed :lsst-task:`~lsst.meas.algorithms.ingestIndexReferenceTask.IngestIndexedReferenceTask` as `~lsst.meas.algorithms.ingestIndexReferenceTask.IngestGaiaReferenceTask`, and also subclassed the ingestion manager with `lsst.meas.algorithms.ingestIndexManager.IngestGaiaManager`.
Expand Down