How training data is created

Introduction

The data generation pipeline is a multi-step process that involves identifying, determining, downloading and processing HLS scenes. The final output of the processed HLS scenes is used as the input for running the for the multi-temporal crop classification model.

Process Outline

The steps for generating the training data are as follows:

Specify an area of interest (AOI) representing a region for which training data will be generated*
Prepare CDL chips and identify intersecting HLS scenes that correspond to each chip
Determine candidate scenes that meet cloud cover and spatial coverage criteria
Select and download scenes from IPFS
Reproject each tile based on the CDL projection
Merge scene bands and clip to chip boundaries
Discard clipped results that do not meet QA and NA criteria

Crop Classification `data_prep` module

The prior workflow was self contained within a Jupyter notebook, but has been refactored into a more modular and reusable format. The process outlined above is now broken down into a series of scripts that can be run independently or as a whole.

identify_hls_scenes.py - This script identifies HLS scenes that correspond to each intersecting chip. The output of this script is a CSV file containing the HLS scenes that intersect with each training chip.
retrieve_hls_scenes.py - This script downloads the HLS scenes from IPFS based on the CSV file of selected scene, generated in the previous step.
reproject_hls_scenes.py - This script reprojects the HLS scene to match CDL projection.
generate_training_chips.py - This script generates the training chip dataset that's sent through the model pipeline. merges scene bands and clips to chip boundaries. It also discards clipped results that do not meet QA and NA criteria.

Run the scripts in the order listed above to generate the training data:

python crop_classification/data_prep/<name>.py

Scripts

create_bb.py

Responsible for generating bounding boxes for chips based on the Cropland Data Layer (CDL) data. It processes the CDL data to create a GeoJSON file containing the bounding boxes and their associated properties. These bounding boxes are used in subsequent steps of the pipeline to define the spatial extent of the chips that will be processed.

When to run: Run at the beginning of the pipeline to prepare the spatial extents for the chips.

select_aoi.py

Generates an interactive web map tool for users to select an Area of Interest (AOI). Users can draw polygons on a map to define the AOI, which is then saved as a GeoJSON file. The selected AOI is used to filter the chips and tiles that will be processed in the pipeline, ensuring that only data within the specified area is considered.

When to run: This script is only necessary to run if the bounding box chips were updated with create_bb.py.

create_chip_5070_payload.py

A copy of the the geojson file passed into the chip_payload_filename property in the configuration file but reprojected in the EPSG:5070 CRS. The reprojection is necessary so the HLS imagery matches CDL coordinate system. The output geojson is saved to the same directory as the input file but with the suffix _5070 appended to the filename.

Note: This script is called in the reproject_hls_scenes.py if the EPSG:5070 CRS file does not exist.

grab_cids_from_selected_tiles.py

Handles getting the CIDs for the selected tiles and generates a JSON file containing the Content Identifiers (CIDs) for each tile.

When to run: Run after selecting the AOI and downloading the required HLS data to generate the CIDs for each tile, which are necessary for further processing in the pipeline. The resulting output can be used to run ipfs_cli_download.py to download the HLS scenes or check_ipfs_content_retrievability.

ipfs_cli_download.py

Handles the downloading of HLS scenes based on the selected tiles, with the ipfs CLI tool. It retrieves the assets associated with each tile from IPFS and saves them to the local directory for further processing. This script ensures that all necessary HLS data is available locally for subsequent steps in the pipeline.

When to run: After selecting the AOI to download the required HLS data.

Note: This script is optional and can be used as an alternative to the retrieve_hls_scenes.py script. It can also be used as an auxiliary tool to help identify and download HLS scenes from IPFS using the ipfs CLI tool directly.

check_ipfs_content_retrievability.py

Checks the retrievability of content from IPFS by attempting to download each Content Identifier (CID). It iterates through a list of CIDs, tries to download the corresponding content, and logs any missing or inaccessible CIDs to a JSON file. This information is used to identify and address any issues with data availability on IPFS.

When to run: After generating the json from grab_cids_from_selected_tiles.py.

split_training_data.py

Splits the chip IDs into training and validation sets for the specified training dataset in the configuration file. A unique list of chip IDs are determined from the files in the chips_filtered directory. It reads the list of chip IDs, randomly assigns them to either the training or validation set, and saves the resulting lists to CSV files. These CSV files are used to train and validate the crop classification model, ensuring that the model is evaluated on a separate set of data from what it was trained on.

When to run: After preparing the chip data to create the training and validation datasets.

calc_band_mean_sd.py

Calculates statistical metrics (mean, standard deviation, minimum, and maximum values) for each set of bands in the final merged image in the filtered_chips directory. The statistics are saved to a text files and also stores all band values in a binary file. An additional file is created named global_stats.txt that contains the global statistics from each band. This information can be passed into the global_stats parameter in the model pipeline configuration file.

When to run: After preparing the chip data to create the training and validation datasets. This output can be used to compute the mean and standard deviation for normalization purposes during model training.

calc_class_weights.py

Calculates class weights for crop classification based on the frequency of each class of the mask image files found in the filtered_chips directory. The data in each image file is flattened to calculate the class weights, normalizing them so their sum equals 1. It inserts a weight of 0 for the background class (no crop) and saves the calculated class weights to a CSV file.

When to run: After preparing the chip data to create the training and validation datasets. The calculated class weights can then be used for training the model.

generate_class_distribution.py.py

Generates and plots the class distribution for training and validation datasets in a crop classification pipeline. Processes each TIFF mask file to calculate the frequency of each class, to plot the class distributions. resulting plots are saved as PNG files, providing a visual representation of the class occurrences in the training and validation datasets.

When to run: After preparing the chip data to create the training and validation datasets. The plots aid in understanding the data distribution and ensuring balanced class representation for model training.

Data Prerequisites

The required source files to download are:

2022 National Cropland Data Layer (CDL)
Harmonized Landsat-Sentinel (HLS) imagery
- Sourced from Earthdata and prepared with Singularity and onboard to Filecoin and IPFS
Sentinel-2 Tile Grid - Details on the tile grid for Sentinel-2 can be found here.

Dataset Summary

Each HLS scene requires the following spectral bands in a GeoTIFF format.

Blue
Green
Red
Narrow NIR
SWIR 1
SWIR 2

HLS scenes are clipped to match the bounding box of the CDL chips, covering a 224 x 224 pixel area at 30m spatial resolution The scenes are then merged into a single GeoTIFF, containing 18 bands for three time steps, where each time step represents three observations throughout the growing season. Each GeoTIFF is accompanied with a mask, containing one band with the target classes for each pixel.

The processed HLS scenes are saved to the directory ./data/training_datasets/<training dataset name>, where <training dataset name> represents the value passed into train_dataset_name property in the configuration file.

Band Order

In each input GeoTIFF the following bands are repeated three times for three observations throughout the growing season: Channel, Name, HLS S30 Band number

Channel, Name,   HLSS30 Band number
1,       Blue,   B02
2,       Green,  B03
3,       Red,    B04
4,       NIR,    B8A
5,       SW 1,   B11
6,       SW 2,   B12

Masks are a single band with values:

0 : "No Data" 1 : "Natural Vegetation" 2 : "Forest" 3 : "Corn" 4 : "Soybeans" 5 : "Wetlands" 6 : "Developed/Barren" 7 : "Open Water" 8 : "Winter Wheat" 9 : "Alfalfa" 10 : "Fallow/Idle Cropland" 11 : "Cotton" 12 : "Sorghum" 13 : "Other"

Data Splits

The 3,854 chips have been randomly split into training (80%) and validation (20%) with corresponding ids recorded in cvs files train_data.txt and validation_data.txt.

Detailed Explanation of the Pipeline Process

Prepare Chip Detail Payloads

This payload contains the details about the chip information that will be used to identify corresponding tiles. This information is derived from the chips_id.json file to generate the following dataframe:

Chip ID
Chip X centroid coordinate
Chip Y centroid coordinate
Tile Name

Prepare HLS Tile Spatial Context

This payload contains the spatial context about the Sentinel scene tiles in sentinel_tile_grid.kml. The information derived is:

Tile Name
Tile X centroid coordinate
Tile Y centroid coordinate
Tile bounding box

Find and Identify the nearest tile to each chip

Loop each chip and identify closest tile by the xy centroid. The chip payload dataframe is modified with tile information and is saved to file for reuse later.

Query metadata on the overlapping tiles

Here, we need to capture and store metadata details all scenes for a given tile that are above a specific cloud cover threshold. This information comes from a xml metadata file found at this STAC instance endpoint.

Create a unique list of tile names that are captured in the chip payload dataframe.
Vist STAC endpoint and loop through each tile in the list and check if the scene is above the cloud cover threshold
1. If true, read in the XML file and capture the following details:
  - Tile ID
  - Cloud Cover
  - Scene Date
  - Spatial Coverage
  - HTTP & S3 links for the bands B02, B03, B04, B8A, B11, B12 and fmask
Save scene results down as tile payload dataframe to file for reuse later.

Filter the tile metadata based on the spatial coverage threshold

Here we identify scenes that meet our criteria of:

Cloud cover of 5% or less total cloud coverage in entire image.
Spatial coverage threshold of 50% and above.

The cloud cover criteria is < 5% total cloud in entire image. The spatial coverage criteria first attempts to find 3 candidate images with 100% spatial coverage, and then decreases this threshold to 90%, 80%... 50%.

Select Candidate images based on timestamps

Sort the filtered coverage list by date, grouped by tile ID
Select from the list the first, middle and last scene
- If list contains an even number of scenes, the lower of the two is selected for the middle scene
Save list of selected tiles to disk

Download the selected tiles from IPFS

Loop through the selected tiles list and grab the CIDs for the required bands.
With the CIDs, retrieve the image bands from IPFS and save to disk, to a directory as tile-id.tile-date.v2.0

Reproject each tile based on the CDL projection

Loop through each band and reproject, matching the CDL coordinate system EPSG:5070. Files are saved to the directory tiles_reprojected.

Import saved dataframe files and preparing tile chipping process

Import the tile chip df files and reinstate as a dataframe
Create a unique list of scenes that need to be clipped
Set-up the CDL reclass properties in order to reclass chips
Run the Process Chip function to clip scenes for all tiles.
- The bands for each scene is processed and clipped.
- Resultant clip is saved to the chips folder, grouped by Chip ID.
Chip details are saved to a file to validation

chip_xxx_xxx_merged.tif: Three HLS scenes merged together per chip, with band order R, G, B, NIR for first date, then second date, then third date. chip_xxx_xxx_Fmask.tif: used to handle cloud cover and other quality aspects of the images. Specifically, it helps in filtering out bad quality pixels and ensuring only valid data is used for generating the training dataset. NOTE: Fmask files are saved to the chips_fmask directory. chip_xxx_xxx.mask.tif: contains the target classes for each pixel in the merged image. The mask is used to train the model to classify the land cover types in the training dataset.

Filter out resultant chipped scenes

Review the chip details and filter chips based on QA values and NA values. Chips are filtered based on the following logic:

Exclude chips that have >5% image values for a single bad QA class, in any of the three HLS image scenes.
Exclude any chips that have 1 or more NA Pixels in any HLS image, in any band.

The merged.tif and mask.tif images for the resultant selection are then copied to the chips filtered directory.

Notes

When first determining which HLS tiles to use in the pipeline, please check that there are erroneous HLS tiles (see step 0a in workflow.ipynb). In our use case, we found that certain chips in southern CONUS were associated with HLS tile 01SBU, which is wrong.

Diagrams

%%{init: {
    "theme": "dark",
    "themeCSS": ".cardinality text { fill: #ededed }",
    "themeVariables": {
    "primaryTextColor": "#ededed",
        "nodeBorder": "#393939",
        "mainBkg": "#292929",
        "lineColor": "orange"
        },
    "flowchart": {
        "curve": "basis",
        "useMaxWidth": true,
        "htmlLabels": true
    }
}}%%
flowchart TD
    %% Properties for the subgraph notes
    classDef sub opacity:0
    classDef note fill:#ffd, stroke:#ccb

    %% Flowchart steps
    D --> E[Query metadata on the overlapping tiles]
    E --> F[Filter set of scenes for each tile based on the spatial coverage threshold]
    F --> G[Select set of scenes based on timestamps for each tile]
    G --> H[Download the selected tiles from the LPDAAC cloud endpoint]
    H --> I[Reproject each tile based on the CDL projection]
    I --> J[Import saved dataframe files and preparing tile chipping process]
    K[Filter out resultant chipped scenes]

    %% Subgraphs groups that self-contain processes
    subgraph loop1 [Prepare CDL Chip Detail and HLS Tile Payloads]
        B["Extract details from bounding box file"]
        C["Extract details from Sentinel Tile Grid KML"]
        D["Find and Identify the nearest tile to each chip
        Store results down to disk"]
        B --> D
        C --> D
    end
  
    A[Start] --> loop1

    subgraph loop2 [Gather metadata on the overlapping tiles]
        E1[Create a unique list of tile names] --> E2[Vist STAC endpoint]
        E2 --> E3[Check if the scene is above the cloud cover threshold]
        E3 --> |"True"| E4[Read XML file and capture details]
        E4 --> E2
        E3 --> |"False"| E2
    end

    E --> loop2

    subgraph loop3 [Loop through the tiles]
        H1[Download the set of bands for each scene]
        H2[Capture and store process details for tracking] 
        H1 --> H2
        H2 --> H1
    end

    H --> loop3

    subgraph loop4 [Loop through the tiles]
        I1[Loop through scene bands and reproject]
        I2[Save reprojection to disk]
        I1 --> I2
        I2 --> I1

    end

    I --> loop4

    subgraph loop5 [Run the Process Chip function]
        J1[Set-up the CDL reclass properties] --> J2[Run the Process Chip function]
        J2 --> J3[Save resultant clip to the chips folder]
        J3 --> J1
    end

    J --> loop5
    loop5 --> K
    %% Notes for the subgraphs
    subgraph noteLoop1 [" "]
        loop1
        loop1-note("
        Generation of the chip detail payload is dependent on the 
        bounding box file generated using gen_chip_bbbox.ipynb.
        ")
    end

    subgraph noteG [" "]
        G
        g-note("
        Tiles are sorted by group id and date. The first, middle 
        and last scene is selected for each group.
        ")
    end

    subgraph noteK [" "]
        K
        k-note("
        Any tile sets with scenes containing
        'NA' pixels are removed from the
        final training dataset.
        ")
    end

    %% Setting the note properties for the subgraphs
    class noteLoop1,noteG,noteK sub
    class loop1-note,g-note,k-note note

Loading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Data Overview.md

Training Data Overview.md

How training data is created

Table of Contents

Introduction

Process Outline

Crop Classification `data_prep` module

Scripts

create_bb.py

select_aoi.py

create_chip_5070_payload.py

grab_cids_from_selected_tiles.py

ipfs_cli_download.py

check_ipfs_content_retrievability.py

split_training_data.py

calc_band_mean_sd.py

calc_class_weights.py

generate_class_distribution.py.py

Data Prerequisites

Dataset Summary

Band Order

Data Splits

Detailed Explanation of the Pipeline Process

Prepare Chip Detail Payloads

Prepare HLS Tile Spatial Context

Find and Identify the nearest tile to each chip

Query metadata on the overlapping tiles

Filter the tile metadata based on the spatial coverage threshold

Select Candidate images based on timestamps

Download the selected tiles from IPFS

Reproject each tile based on the CDL projection

Import saved dataframe files and preparing tile chipping process

Filter out resultant chipped scenes

Notes

Diagrams

Files

Training Data Overview.md

Latest commit

History

Training Data Overview.md

File metadata and controls

How training data is created

Table of Contents

Introduction

Process Outline

Crop Classification data_prep module

Scripts

create_bb.py

select_aoi.py

create_chip_5070_payload.py

grab_cids_from_selected_tiles.py

ipfs_cli_download.py

check_ipfs_content_retrievability.py

split_training_data.py

calc_band_mean_sd.py

calc_class_weights.py

generate_class_distribution.py.py

Data Prerequisites

Dataset Summary

Band Order

Data Splits

Detailed Explanation of the Pipeline Process

Prepare Chip Detail Payloads

Prepare HLS Tile Spatial Context

Find and Identify the nearest tile to each chip

Query metadata on the overlapping tiles

Filter the tile metadata based on the spatial coverage threshold

Select Candidate images based on timestamps

Download the selected tiles from IPFS

Reproject each tile based on the CDL projection

Import saved dataframe files and preparing tile chipping process

Filter out resultant chipped scenes

Notes

Diagrams

Crop Classification `data_prep` module