diff --git a/NEWS.md b/NEWS.md index 0c65b437..ffd233e4 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,6 +1,7 @@ # soilDB 2.9.2 (development) - EDIT base URL (for `get_EDIT_ecoclass_by_geoUnit()` and `make_EDIT_service_URL()`) updated to new USDA-managed server: - SoilWeb-based Web Coverage Services (`soilColor.wcs()`, `ISSR800.wcs()`, `mukey.wcs()`) have been updated with FY26 maps, now including most OCONUS soil surveys (AK, HI, PR, PW, GU, AS, MP) + - `downloadSSURGO()` gains arguments `include_spatial` and `include_tabular` that are analogous to arguments of the same name from `createSSURGO()`. Thanks to feature request from @dylanbeaudette (#470). # soilDB 2.9.1 (2026-04-01) - `ROSETTA()` updates thanks to Todd Skaggs (USDA-ARS): diff --git a/R/createSSURGO.R b/R/createSSURGO.R index ed127e79..75714f98 100644 --- a/R/createSSURGO.R +++ b/R/createSSURGO.R @@ -8,32 +8,41 @@ #' of `sacatalog` table such as `areasymbol = 'CA067'`, `"areasymbol IN ('CA628', 'CA067')"` or #' `areasymbol LIKE 'CT%'`. #' -#' @param WHERE _character_. A SQL `WHERE` clause expression used to filter records in `sacatalog` table. -#' Alternately `WHERE` can be any spatial object supported by `SDA_spatialQuery()` for defining -#' the target extent. -#' @param areasymbols _character_. Character vector of soil survey area symbols e.g. `c("CA067", "CA077")`. Used -#' in lieu of `WHERE` argument. +#' @param WHERE _character_. A SQL `WHERE` clause expression used to filter records in `sacatalog` +#' table. Alternately `WHERE` can be any spatial object supported by `SDA_spatialQuery()` for +#' defining the target extent. +#' @param areasymbols _character_. Character vector of soil survey area symbols e.g. `c("CA067", +#' "CA077")`. Used in lieu of `WHERE` argument. #' @param destdir _character_. Directory to download ZIP files into. Default `tempdir()`. -#' @param exdir _character_. Directory to extract ZIP archives into. May be a directory that does not yet exist. -#' Each ZIP file will extract to a folder labeled with `areasymbol` in this directory. Default: -#' `destdir` -#' @param include_template _logical_. Include the (possibly state-specific) MS Access template database? -#' Default: `FALSE` -#' @param db _character_. Either `"SSURGO"` (default; detailed soil map) or `"STATSGO"` (general soil map). +#' @param exdir _character_. Directory to extract ZIP archives into. May be a directory that does +#' not yet exist. Each ZIP file will extract to a folder labeled with `areasymbol` in this +#' directory. Default: `destdir` +#' @param include_template _logical_. Include the (possibly state-specific) MS Access template +#' database? Default: `FALSE` +#' @param include_spatial _logical_ or _character_. Extract spatial data layers from ZIP file? +#' Default: `TRUE` inserts all spatial tables. If `include_spatial` is a _character_ vector +#' containing table names, only that set is extracted from the downloaded ZIP files. e.g. +#' `include_spatial=c("mupolygon", "featpoint")` extracts only the shapefiles (with side car +#' files) for mapunit polygons and special feature points. +#' @param include_tabular _logical_ or _character_. Include tabular data layers in database? +#' Default: `TRUE` inserts all tabular tables. If `include_tabular` is a _character_ vector +#' containing table names, only that set is extracted from the downloaded ZIP files. e.g. +#' `include_tabular=c("mapunit", "muaggatt")` writes only the `mapunit` and `muaggatt` tables. +#' Note that special feature descriptions are stored in table `"featdesc"` and metadata for each +#' soil survey area are stored in `"soil_metadata"` tables. +#' @param db _character_. Either `"SSURGO"` (default; detailed soil map) or `"STATSGO"` (general +#' soil map). #' @param extract _logical_. Extract ZIP files to `exdir`? Default: `TRUE` #' @param remove_zip _logical_. Remove ZIP files after extracting? Default: `FALSE` #' @param overwrite _logical_. Overwrite by re-extracting if directory already exists? Default: #' `FALSE` #' @param quiet _logical_. Passed to `curl::curl_download()`. -#' -#' @details When `db="STATSGO"` the `WHERE` argument is not supported. Allowed `areasymbols` include -#' `"US"` and two-letter state codes e.g. `"WY"` for the Wyoming general soils map. #' #' @export #' #' @details Pipe-delimited TXT files are found in _/tabular/_ folder extracted from a SSURGO ZIP. -#' The files are named for tables in the SSURGO schema. There is no header / the files do not have -#' column names. See the _Soil Data Access Tables and Columns Report_: +#' The files are named for tables in the SSURGO schema. There is no header and the files do not +#' have column names. See the _Soil Data Access Tables and Columns Report_: #' \url{https://sdmdataaccess.nrcs.usda.gov/documents/TablesAndColumnsReport.pdf} for details on #' tables, column names and metadata including the default sequence of columns used in TXT files. #' The function returns a `try-error` if the `WHERE`/`areasymbols` arguments result in @@ -42,57 +51,70 @@ #' have prefix `soilmu_` (mapunit), `soilsa_` (survey area), `soilsf_` (special features). There #' will also be a TXT file with prefix `soilsf_` describing any special features. Shapefile names #' then have an `a_` (polygon), `l_` (line), `p_` (point) followed by the soil survey area symbol. +#' When `db="STATSGO"` the `WHERE` argument is not supported. Allowed `areasymbols` include +#' `"US"` and two-letter state codes e.g. `"WY"` for the Wyoming general soils map. +#' +#' As in `createSSURGO()`, the `include_spatial` and `include_tabular` arguments either take a +#' logical value (default `TRUE`) or a character vector of the specific table names to include. Note +#' that when used in `downloadSSURGO()` the required metadata files are _always_ extracted to +#' facilitate mapping to user-facing table names. These arguments allow for customizing the files +#' that get extracted from ZIP files, not just filtering on file names (as is implemented with +#' pre-existing `pattern` argument). This can dramatically improve efficiency of extraction and the +#' overall size of the data in `exdir`. These arguments can be used in conjunction with the +#' `pattern` argument to fine-tune the files included in the generated snapshot database. #' #' @return _character_. Paths to downloaded ZIP files (invisibly). May not exist if `remove_zip = #' TRUE`. #' @seealso [createSSURGO()] -downloadSSURGO <- function(WHERE = NULL, +downloadSSURGO <- function(WHERE = NULL, areasymbols = NULL, - destdir = tempdir(), - exdir = destdir, + destdir = tempdir(), + exdir = destdir, include_template = FALSE, + include_spatial = TRUE, + include_tabular = TRUE, db = c('SSURGO', 'STATSGO'), - extract = TRUE, + extract = TRUE, remove_zip = FALSE, overwrite = FALSE, quiet = FALSE) { - + db <- match.arg(toupper(db), c('SSURGO', 'STATSGO')) - + if (!is.null(WHERE) && db == "STATSGO") { stop('custom WHERE clause not supported with db="STATSGO"', call. = FALSE) } - + if (!is.null(areasymbols) && db == "STATSGO") { WHERE <- areasymbols } - + if (is.null(WHERE) && is.null(areasymbols)) { stop('must specify either `WHERE` or `areasymbols` argument', call. = FALSE) } - + if (is.null(WHERE) && !is.null(areasymbols)) { WHERE <- sprintf("areasymbol IN %s", format_SQL_in_statement(areasymbols)) } - + if (!is.character(WHERE)) { # attempt passing WHERE to SDA_spatialQuery res <- suppressMessages(SDA_spatialQuery(WHERE, what = 'areasymbol')) WHERE <- paste("areasymbol IN", format_SQL_in_statement(res$areasymbol)) } - + # make WSS download URLs from areasymbol, template, date urls <- .make_WSS_download_url(WHERE, include_template = include_template, db = db) - + if (inherits(urls, 'try-error')) { message(urls[1]) return(invisible(urls)) } - + if (!dir.exists(destdir)) { dir.create(destdir, recursive = TRUE) } - + # download files for (i in seq_along(urls)) { destfile <- file.path(destdir, basename(urls[i])) @@ -100,39 +122,55 @@ downloadSSURGO <- function(WHERE = NULL, try(curl::curl_download(urls[i], destfile = destfile, quiet = quiet, mode = "wb", handle = .soilDB_curl_handle()), silent = quiet) } } - + paths <- list.files(destdir, pattern = "\\.zip$", full.names = TRUE) paths2 <- paths[grep(".*wss_(SSA|gsmsoil)_(.*)_.*", paths)] - + if (extract) { if (!quiet) { message("Extracting downloaded ZIP files...") } - + if (length(paths2) == 0) { stop("Could not find SSURGO ZIP files in `destdir`: ", destdir, call. = FALSE) } - + if (!dir.exists(exdir)) { dir.create(exdir, recursive = TRUE) } - - for (i in seq_along(paths2)) { + + res <- lapply(seq_along(paths2), function(i) { ssa <- gsub(".*wss_SSA_(.*)_.*", "\\1", paths2[i]) - if ((!dir.exists(file.path(exdir, ssa)) || overwrite) && - length(utils::unzip(paths2[i], exdir = exdir)) == 0) { + if (isTRUE(include_spatial) && isTRUE(include_tabular)) { + lz <- NULL + } else { + lz <- utils::unzip(paths2[i], list = TRUE)$Name + # need to pre-extract mstab data to map to real column names + utils::unzip(paths2[i], files = lz[grepl( + "^(mstab|mdstattabs|MetadataTable|mstabcol|mdstattabcol|MetadataColumnLookup|msidxdet|mdstatidxdet|MetadataIndexDetail)$", + tools::file_path_sans_ext(basename(lz)) + )], exdir = exdir) + inv <- .inventory_ssurgo_files(lz, exdir = exdir, include_spatial = include_spatial, include_tabular = include_tabular) + lz <- unlist(c(inv$f.shp.sc, inv$f.txt.grp)) + } + uz <- utils::unzip(paths2[i], files = lz, exdir = exdir) + if ((!dir.exists(file.path(exdir, ssa)) || overwrite) && length(uz) == 0) { message(paste('Invalid zipfile:', paths2[i])) + } else { + if (!quiet) { + message("Extracted: ", paths2[i]) + } } - } - + }) + if (remove_zip) { file.remove(paths2) } } - + invisible(paths2) } - + #' Create a database from SSURGO Exports #' #' The following database types are tested and fully supported: @@ -187,7 +225,7 @@ downloadSSURGO <- function(WHERE = NULL, #' } createSSURGO <- function(filename = NULL, exdir, - conn = NULL, + conn = NULL, pattern = NULL, include_spatial = TRUE, include_tabular = TRUE, @@ -198,16 +236,16 @@ createSSURGO <- function(filename = NULL, header = FALSE, quiet = TRUE, ...) { - + if ((missing(filename) || length(filename) == 0) && missing(conn)) { stop("`filename` should be a path to a .gpkg or .sqlite file to create or append to, or a DBIConnection should be provided via `conn`.") } - + if (missing(conn) || is.null(conn)) { # delete existing file if overwrite=TRUE; does _not_ apply to DBIConnection if (file.exists(filename)) { if (isTRUE(overwrite) && isFALSE(append)) { - file.remove(filename) + file.remove(filename) } else if (isTRUE(overwrite) && isTRUE(append)) { stop("Both overwrite=TRUE and append=TRUE; set only one argument to TRUE", call. = FALSE) } else if (isFALSE(overwrite) && isFALSE(append)) { @@ -215,10 +253,10 @@ createSSURGO <- function(filename = NULL, } } } - + # DuckDB has special spatial format, so it gets custom handling for IS_DUCKDB <- inherits(conn, "duckdb_connection") - + if (inherits(conn, 'SQLiteConnection')) { IS_GPKG <- grepl("\\.gpkg$", conn@dbname, ignore.case = TRUE)[1] filename <- conn@dbname @@ -228,49 +266,52 @@ createSSURGO <- function(filename = NULL, IS_GPKG <- FALSE } } - + if (!IS_DUCKDB && !requireNamespace("sf")) { stop("package `sf` is required to write spatial datasets to DBI data sources", call. = FALSE) - } - - f <- list.files(exdir, recursive = TRUE, pattern = pattern, full.names = TRUE) - - # create and add combined vector datasets: - # "soilmu_a", "soilmu_l", "soilmu_p", "soilsa_a", "soilsf_l", "soilsf_p" - f.shp <- f[grepl(".*\\.shp$", f)] - shp.grp <- do.call('rbind', strsplit(gsub(".*soil([musfa]{2})_([apl])_([a-z]{2}\\d{3}|[a-z]{2})\\.shp", "\\1;\\2;\\3", f.shp), ";", fixed = TRUE)) - - layer_names <- c(`mu_a` = "mupolygon", `mu_l` = "muline", `mu_p` = "mupoint", - `sa_a` = "sapolygon", `sf_l` = "featline", `sf_p` = "featpoint") - - if (is.character(include_spatial)) { - idx <- paste0(shp.grp[, 1], "_", shp.grp[, 2]) %in% names(layer_names[layer_names %in% include_spatial]) - shp.grp <- shp.grp[idx, ] - f.shp <- f.shp[idx] - include_spatial <- TRUE } - + + layer_names <- .get_spatial_layer_names() + f <- list.files(exdir, recursive = TRUE, full.names = TRUE) + fdx <- rep(TRUE, length(f)) + + if (!is.null(pattern)) { + fdx <- grepl(pattern, f) + } + + inv <- .inventory_ssurgo_files( + files = f[fdx], + layer_names = layer_names, + include_spatial = include_spatial, + include_tabular = include_tabular, + header = header + ) + + # inventory method converts partial sets (character vectors) to logical for include_* args + include_spatial <- inv$include_spatial + include_tabular <- inv$include_tabular + if ((missing(conn) || is.null(conn)) && !IS_GPKG) { - + if (!requireNamespace("RSQLite")) { stop("package 'RSQLite' is required (when `conn` is not specified)", call. = FALSE) } - + conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), - filename, + filename, loadable.extensions = TRUE) - + # if user did not specify their own connection, close on exit on.exit(DBI::dbDisconnect(conn)) - } - - if (nrow(shp.grp) >= 1 && ncol(shp.grp) == 3 && include_spatial) { - f.shp.grp <- split(f.shp, list(feature = shp.grp[, 1], geom = shp.grp[, 2]), drop = TRUE) - + } + + if (nrow(inv$shp.grp) >= 1 && ncol(inv$shp.grp) == 3 && include_spatial) { + f.shp.grp <- split(inv$f.shp, list(feature = inv$shp.grp[, 1], geom = inv$shp.grp[, 2]), drop = TRUE) + if (IS_DUCKDB) { DBI::dbExecute(conn, "INSTALL spatial; LOAD spatial;") } - + for (i in seq_along(f.shp.grp)) { for (j in seq_along(f.shp.grp[[i]])) { lnm <- layer_names[match(gsub(".*soil([musfa]{2}_[apl])_.*", "\\1", f.shp.grp[[i]][j]), @@ -286,10 +327,10 @@ createSSURGO <- function(filename = NULL, } } else { shp <- sf::read_sf(f.shp.grp[[i]][j]) - + colnames(shp) <- tolower(colnames(shp)) sf::st_geometry(shp) <- "geometry" - + .st_write_sf_conn <- function(x, dsn, layer, j, overwrite) { if ((i == 1 && j == 1) && isFALSE(append)) { sf::write_sf( @@ -309,7 +350,7 @@ createSSURGO <- function(filename = NULL, ) } } - + # dissolve on dissolve_field # TODO: add nationalmusym to spatial layer? if (!is.null(dissolve_field) && dissolve_field %in% colnames(shp)) { @@ -325,20 +366,20 @@ createSSURGO <- function(filename = NULL, } y[1] }, do_union = TRUE) - + shp[[paste0(dissolve_field, ".1")]] <- NULL - + destgeom <- "MULTIPOLYGON" if (any(sf::st_geometry_type(shp) %in% c("POINT", "MULTIPOINT"))) { destgeom <- "MULTIPOINT" } else if (any(sf::st_geometry_type(shp) %in% c("LINESTRING", "MULTILINESTRING"))) { destgeom <- "MULTILINESTRING" } - + shp <- sf::st_cast(shp, destgeom) } } - + if (IS_GPKG && missing(conn)) { # writing to SQLiteConnection fails to create proper gpkg_contents entries # so use the path for GPKG only @@ -351,68 +392,37 @@ createSSURGO <- function(filename = NULL, } } } - + if (IS_GPKG) { - + if (!requireNamespace("RSQLite")) { stop("package 'RSQLite' is required (when `conn` is not specified)", call. = FALSE) } - + conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), - filename, + filename, loadable.extensions = TRUE) - + # if user did not specify their own connection, close on exit on.exit(DBI::dbDisconnect(conn)) - } - - # create and add combined tabular datasets - f.txt <- f[grepl(".*\\.txt$", f)] - txt.grp <- gsub("\\.txt", "", basename(f.txt)) - - # explicit handling special feature descriptions -> "featdesc" table - txt.grp[grepl("soilsf_t_", txt.grp)] <- "featdesc" - txt.grp[grepl("soil_metadata_", txt.grp)] <- "soil_metadata" - txt.first <- unique(txt.grp[grep("^sdv|^md*s|^Metadata", txt.grp)]) - - f.txt.grp <- split(f.txt, txt.grp) - f.txt.grp[txt.first] <- lapply(f.txt.grp[txt.first], .subset, 1) - - # get table, column, index lookup tables - mstabn <- f.txt.grp[[which(names(f.txt.grp) %in% c("mstab", "mdstattabs", "MetadataTable"))[1]]][[1]] - mstabcn <- f.txt.grp[[which(names(f.txt.grp) %in% c("mstabcol", "mdstattabcols", "MetadataColumnLookup"))[1]]][[1]] - msidxdn <- f.txt.grp[[which(names(f.txt.grp) %in% c("msidxdet", "mdstatidxdet", "MetadataIndexDetail"))[1]]][[1]] - - if (length(mstabn) >= 1) { - mstab <- read.delim(mstabn[1], sep = "|", stringsAsFactors = FALSE, header = header) - mstab_lut <- c(mstab[[1]], "soil_metadata") - names(mstab_lut) <- c(mstab[[5]], "soil_metadata") - } else { - mstab_lut <- names(f.txt.grp) - names(mstab_lut) <- names(f.txt.grp) - } - - if (is.character(include_tabular)) { - f.txt.grp <- f.txt.grp[names(mstab_lut[mstab_lut %in% include_tabular])] - include_tabular <- TRUE } - + if (include_tabular) { - if (length(mstabcn) >= 1) { - mstabcol <- read.delim(mstabcn[1], sep = "|", stringsAsFactors = FALSE, header = header) + if (length(inv$mstabcn) >= 1) { + mstabcol <- read.delim(inv$mstabcn[1], sep = "|", stringsAsFactors = FALSE, header = header) } - - if (length(msidxdn) >= 1) { - msidxdet <- read.delim(msidxdn[1], sep = "|", stringsAsFactors = FALSE, header = header) + + if (length(inv$msidxdn) >= 1) { + msidxdet <- read.delim(inv$msidxdn[1], sep = "|", stringsAsFactors = FALSE, header = header) } # build type mapping from SSURGO logicaldatatype metadata # mstabcol columns: 1=tabphyname, 2=colsequence, 3=colphyname, 4=collogname, # 5=uomabbrev, 6=logicaldatatype, 7=notnull, 8=fieldsize - if (exists("mstabcol") && length(mstabcol) >= 6) { + if (length(mstabcol) >= 6) { .ssurgo_type_map <- c( - String = "character", Choice = "character", Vtext = "character", + String = "character", Choice = "character", Vtext = "character", `Date/Time` = "character", Integer = "integer", Float = "numeric", Boolean = "logical" ) @@ -420,119 +430,101 @@ createSSURGO <- function(filename = NULL, .ssurgo_type_map <- NULL } - # helper: coerce columns to schema types from mstabcol metadata - .coerce_ssurgo_types <- function(y, tablename, mstabcol, type_map) { - if (is.null(type_map) || length(mstabcol) < 6) return(y) - col_meta <- mstabcol[mstabcol[[1]] == tablename, c(3L, 6L), drop = FALSE] - for (j in seq_len(nrow(col_meta))) { - col <- col_meta[[1L]][j] - rtyp <- type_map[col_meta[[2L]][j]] - if (is.na(rtyp) || !col %in% names(y)) next - y[[col]] <- switch(rtyp, - character = as.character(y[[col]]), - integer = suppressWarnings(as.integer(as.character(y[[col]]))), - numeric = suppressWarnings(as.numeric(as.character(y[[col]]))), - logical = as.logical(y[[col]]) - ) - } - y - } + lapply(names(inv$f.txt.grp), function(x) { - lapply(names(f.txt.grp), function(x) { - if (!is.null(mstabcol)) { - newnames <- mstabcol[[3]][mstabcol[[1]] == mstab_lut[x]] + newnames <- mstabcol[[3]][mstabcol[[1]] == inv$mstab_lut[x]] } - + if (!is.null(msidxdet)) { - indexPK <- na.omit(msidxdet[[4]][msidxdet[[1]] == mstab_lut[x] & grepl("PK_", msidxdet[[2]])]) - indexDI <- na.omit(msidxdet[[4]][msidxdet[[1]] == mstab_lut[x] & grepl("DI_", msidxdet[[2]])]) + indexPK <- na.omit(msidxdet[[4]][msidxdet[[1]] == inv$mstab_lut[x] & grepl("PK_", msidxdet[[2]])]) + indexDI <- na.omit(msidxdet[[4]][msidxdet[[1]] == inv$mstab_lut[x] & grepl("DI_", msidxdet[[2]])]) } - - d <- try(lapply(seq_along(f.txt.grp[[x]]), function(i) { + + d <- try(lapply(seq_along(inv$f.txt.grp[[x]]), function(i) { # message(f.txt.grp[[x]][i]) - y <- try(read.delim(f.txt.grp[[x]][i], sep = "|", stringsAsFactors = FALSE, header = header, + y <- try(read.delim(inv$f.txt.grp[[x]][i], sep = "|", stringsAsFactors = FALSE, header = header, na.strings = c("", "NA")), silent = TRUE) - + if (inherits(y, 'try-error')) { if (!quiet) { - message("File ", f.txt.grp[[x]][i], " contains no data") + message("File ", inv$f.txt.grp[[x]][i], " contains no data") } return(NULL) } else if (length(y) == 1) { - if (grepl("soil_metadata", f.txt.grp[[x]][i])) { + if (grepl("soil_metadata", inv$f.txt.grp[[x]][i])) { y <- data.frame( - areasymbol = toupper(gsub(".*soil_metadata_(.*)\\.txt", "\\1", f.txt.grp[[x]][i])), + areasymbol = toupper(gsub(".*soil_metadata_(.*)\\.txt", "\\1", inv$f.txt.grp[[x]][i])), content = paste(y[[1]], collapse = "\n") ) } else { y <- data.frame(content = y) } } else { - if (!is.null(mstab) && !header) { # preserve headers if present + if (!is.null(inv$mstab) && !header) { # preserve headers if present colnames(y) <- newnames } # enforce schema types from metadata - y <- .coerce_ssurgo_types(y, mstab_lut[x], mstabcol, .ssurgo_type_map) + y <- .coerce_ssurgo_types(y, inv$mstab_lut[x], mstabcol, .ssurgo_type_map) } - if (is.na(mstab_lut[x])) { + if (is.na(inv$mstab_lut[x])) { # readme, version return(NULL) } - + # remove deeper rules from cointerp for smaller DB size # most people only use depth==0 (default) - if (mstab_lut[x] == "cointerp" && !is.null(maxruledepth)) { + if (inv$mstab_lut[x] == "cointerp" && !is.null(maxruledepth)) { y <- y[y$ruledepth <= maxruledepth, ] } - + if ("musym" %in% colnames(y)) { y$musym <- as.character(y$musym) } - + try({ if (i == 1 && isFALSE(append)) { - DBI::dbWriteTable(conn, mstab_lut[x], y, overwrite = overwrite) + DBI::dbWriteTable(conn, inv$mstab_lut[x], y, overwrite = overwrite) } else { - if (DBI::dbExistsTable(conn, mstab_lut[x]) && x %in% txt.first) { - # skip writing sdv/mds* metadata tables to avoid uniqueness issues + if (DBI::dbExistsTable(conn, inv$mstab_lut[x]) && x %in% inv$txt.first) { + # skip writing sdv/mds* metadata tables to avoid uniqueness issues return(FALSE) } - DBI::dbWriteTable(conn, mstab_lut[x], y, append = TRUE) + DBI::dbWriteTable(conn, inv$mstab_lut[x], y, append = TRUE) } }, silent = quiet) }), silent = quiet) - - if (length(mstab_lut[x]) && is.na(mstab_lut[x])) { - mstab_lut[x] <- x + + if (length(inv$mstab_lut[x]) && is.na(inv$mstab_lut[x])) { + inv$mstab_lut[x] <- x } - - if (length(mstab_lut[x]) && !is.na(mstab_lut[x])) { - + + if (length(inv$mstab_lut[x]) && !is.na(inv$mstab_lut[x])) { + # create pkey indices if (!is.null(indexPK) && length(indexPK) > 0) { try({ - q <- sprintf("CREATE UNIQUE INDEX IF NOT EXISTS %s ON %s (%s)", - paste0('PK_', mstab_lut[x]), mstab_lut[x], + q <- sprintf("CREATE UNIQUE INDEX IF NOT EXISTS %s ON %s (%s)", + paste0('PK_', inv$mstab_lut[x]), inv$mstab_lut[x], paste(indexPK, collapse = ",")) - if (DBI::dbExistsTable(conn, mstab_lut[x])) + if (DBI::dbExistsTable(conn, inv$mstab_lut[x])) DBI::dbExecute(conn, q) }, silent = quiet) } - + # create key indices if (!is.null(indexDI) && length(indexDI) > 0) { for (i in seq_along(indexDI)) { try({ - q <- sprintf("CREATE INDEX IF NOT EXISTS %s ON %s (%s)", - paste0('DI_', mstab_lut[x]), mstab_lut[x], indexDI[i]) - if (DBI::dbExistsTable(conn, mstab_lut[x])) + q <- sprintf("CREATE INDEX IF NOT EXISTS %s ON %s (%s)", + paste0('DI_', inv$mstab_lut[x]), inv$mstab_lut[x], indexDI[i]) + if (DBI::dbExistsTable(conn, inv$mstab_lut[x])) DBI::dbExecute(conn, q) }, silent = quiet) } } - + # for GPKG output, add gpkg_contents (metadata for features and attributes) if (IS_GPKG) { if (!.gpkg_has_contents(conn)) { @@ -540,23 +532,144 @@ createSSURGO <- function(filename = NULL, try(.gpkg_create_contents(conn)) } # update gpkg_contents table entry - if (DBI::dbExistsTable(conn, mstab_lut[x])) { - try(.gpkg_delete_contents(conn, mstab_lut[x])) - try(.gpkg_add_contents(conn, mstab_lut[x])) + if (DBI::dbExistsTable(conn, inv$mstab_lut[x])) { + try(.gpkg_delete_contents(conn, inv$mstab_lut[x])) + try(.gpkg_add_contents(conn, inv$mstab_lut[x])) } } - + # TODO: other foreign keys/relationships? ALTER TABLE/ADD CONSTRAINT not available in SQLite # the only way to add a foreign key is via CREATE TABLE which means refactoring above two # steps into a single SQL statement (create table with primary and foreign keys) } }) } - + res <- DBI::dbListTables(conn) invisible(res) } +.inventory_ssurgo_files <- function(files, + exdir = NULL, + pattern = NULL, + layer_names = .get_spatial_layer_names(), + include_spatial = TRUE, + include_tabular = TRUE, + header = FALSE) { + + # create and add combined vector datasets: + # "soilmu_a", "soilmu_l", "soilmu_p", "soilsa_a", "soilsf_l", "soilsf_p" + f.shp <- files[grepl(".*\\.shp$", files)] + shp.grp <- do.call('rbind', strsplit( + gsub( + ".*soil([musfa]{2})_([apl])_([a-z]{2}\\d{3}|[a-z]{2})\\.shp", + "\\1;\\2;\\3", + f.shp + ), + ";", + fixed = TRUE + )) + + f.shp.sc <- character(0) + if (is.character(include_spatial)) { + idx <- paste0(shp.grp[, 1], "_", shp.grp[, 2]) %in% names(layer_names[layer_names %in% include_spatial]) + shp.grp <- shp.grp[idx, , drop = FALSE] + f.shp <- f.shp[idx] + f.shp.sc <- files[grepl(paste0( + paste0("soil", shp.grp[, 1], "_", shp.grp[, 2], "_", shp.grp[, 3]), + collapse = "|" + ), files)] + include_spatial <- TRUE + } + + # create and add combined tabular datasets + f.txt <- files[grepl(".*\\.txt$", files)] + txt.grp <- gsub("\\.txt", "", basename(f.txt)) + + # explicit handling special feature descriptions -> "featdesc" table + txt.grp[grepl("soilsf_t_", txt.grp)] <- "featdesc" + txt.grp[grepl("soil_metadata_", txt.grp)] <- "soil_metadata" + txt.first <- unique(txt.grp[grep("^sdv|^md*s|^Metadata", txt.grp)]) + + f.txt.grp <- split(f.txt, txt.grp) + f.txt.grp[txt.first] <- lapply(f.txt.grp[txt.first], .subset, 1) + + # get table, column, index lookup tables + mstabn <- f.txt.grp[[which(names(f.txt.grp) %in% c("mstab", "mdstattabs", "MetadataTable"))[1]]][[1]] + mstabcn <- f.txt.grp[[which(names(f.txt.grp) %in% c("mstabcol", "mdstattabcols", "MetadataColumnLookup"))[1]]][[1]] + msidxdn <- f.txt.grp[[which(names(f.txt.grp) %in% c("msidxdet", "mdstatidxdet", "MetadataIndexDetail"))[1]]][[1]] + + if (length(mstabn) >= 1) { + mstab <- read.delim( + ifelse( + is.null(exdir), + yes = mstabn[1], + no = file.path(exdir, mstabn[1]) + ), + sep = "|", + stringsAsFactors = FALSE, + header = header + ) + mstab_lut <- c(mstab[[1]], "soil_metadata") + names(mstab_lut) <- c(mstab[[5]], "soil_metadata") + } else { + mstab_lut <- names(f.txt.grp) + names(mstab_lut) <- names(f.txt.grp) + } + + if (is.character(include_tabular)) { + f.txt.grp <- f.txt.grp[names(mstab_lut[mstab_lut %in% include_tabular])] + include_tabular <- TRUE + } + + list( + f.shp = f.shp, + shp.grp = shp.grp, + f.shp.sc = f.shp.sc, + include_spatial = include_spatial, + f.txt.grp = f.txt.grp, + txt.first = txt.first, + include_tabular = include_tabular, + mstabn = mstabn, + mstabcn = mstabcn, + msidxdn = msidxdn, + mstab = mstab, + mstab_lut = mstab_lut + ) +} + +.get_spatial_layer_names <- function() { + c( + `mu_a` = "mupolygon", + `mu_l` = "muline", + `mu_p` = "mupoint", + `sa_a` = "sapolygon", + `sf_l` = "featline", + `sf_p` = "featpoint" + ) +} + +# helper: coerce columns to schema types from mstabcol metadata +.coerce_ssurgo_types <- function(y, tablename, mstabcol, type_map) { + if (is.null(type_map) || length(mstabcol) < 6) + return(y) + col_meta <- mstabcol[mstabcol[[1]] == tablename, c(3L, 6L), drop = FALSE] + for (j in seq_len(nrow(col_meta))) { + col <- col_meta[[1L]][j] + rtyp <- type_map[col_meta[[2L]][j]] + if (is.na(rtyp) || !col %in% names(y)) + next + y[[col]] <- switch( + rtyp, + character = as.character(y[[col]]), + integer = suppressWarnings(as.integer(as.character(y[[col]]))), + numeric = suppressWarnings(as.numeric(as.character(y[[col]]))), + logical = as.logical(y[[col]]) + ) + } + y +} + ## From https://github.com/brownag/gpkg ----- #' Add, Remove, Update and Create `gpkg_contents` table and records @@ -569,9 +682,9 @@ createSSURGO <- function(filename = NULL, #' @noRd #' @keywords internal .gpkg_add_contents <- function(con, table_name, description = "", template = NULL) { - + stopifnot(requireNamespace("RSQLite")) - + if (!missing(template) && !is.null(template) && is.list(template) && @@ -582,13 +695,13 @@ createSSURGO <- function(filename = NULL, ex <- c(-180, -90, 180, 90) cr <- 4326 } - + # append to gpkg_contents RSQLite::dbExecute(con, paste0( - "INSERT INTO gpkg_contents (table_name, data_type, identifier, + "INSERT INTO gpkg_contents (table_name, data_type, identifier, description, last_change, - min_x, min_y, max_x, max_y, srs_id) + min_x, min_y, max_x, max_y, srs_id) VALUES ('", table_name , "', 'attributes', '", @@ -613,7 +726,7 @@ createSSURGO <- function(filename = NULL, RSQLite::dbExecute(con, paste0("DELETE FROM gpkg_contents WHERE table_name = '", table_name, "'")) } -#' @description `.gpkg_has_contents()`: Determine if a database has table named `"gpkg_contents"` +#' @description `.gpkg_has_contents()`: Determine if a database has table named `"gpkg_contents"` #' @noRd #' @keywords internal .gpkg_has_contents <- function(con) { @@ -621,7 +734,7 @@ createSSURGO <- function(filename = NULL, isTRUE("gpkg_contents" %in% RSQLite::dbListTables(con)) } -#' @description `.gpkg_has_contents()`: Determine if a database has table named `"gpkg_contents"` +#' @description `.gpkg_has_contents()`: Determine if a database has table named `"gpkg_contents"` #' @noRd #' @keywords internal .gpkg_create_contents <- function(con) { @@ -639,7 +752,7 @@ createSSURGO <- function(filename = NULL, srs_id INTEGER, CONSTRAINT fk_gc_r_srs_id FOREIGN KEY (srs_id) REFERENCES gpkg_spatial_ref_sys(srs_id) )" - + if (!.gpkg_has_contents(con)) { RSQLite::dbExecute(con, q) } else return(1) diff --git a/man/downloadSSURGO.Rd b/man/downloadSSURGO.Rd index 178335b4..01663e0a 100644 --- a/man/downloadSSURGO.Rd +++ b/man/downloadSSURGO.Rd @@ -10,6 +10,8 @@ downloadSSURGO( destdir = tempdir(), exdir = destdir, include_template = FALSE, + include_spatial = TRUE, + include_tabular = TRUE, db = c("SSURGO", "STATSGO"), extract = TRUE, remove_zip = FALSE, @@ -18,23 +20,36 @@ downloadSSURGO( ) } \arguments{ -\item{WHERE}{\emph{character}. A SQL \code{WHERE} clause expression used to filter records in \code{sacatalog} table. -Alternately \code{WHERE} can be any spatial object supported by \code{SDA_spatialQuery()} for defining -the target extent.} +\item{WHERE}{\emph{character}. A SQL \code{WHERE} clause expression used to filter records in \code{sacatalog} +table. Alternately \code{WHERE} can be any spatial object supported by \code{SDA_spatialQuery()} for +defining the target extent.} -\item{areasymbols}{\emph{character}. Character vector of soil survey area symbols e.g. \code{c("CA067", "CA077")}. Used -in lieu of \code{WHERE} argument.} +\item{areasymbols}{\emph{character}. Character vector of soil survey area symbols e.g. \code{c("CA067", "CA077")}. Used in lieu of \code{WHERE} argument.} \item{destdir}{\emph{character}. Directory to download ZIP files into. Default \code{tempdir()}.} -\item{exdir}{\emph{character}. Directory to extract ZIP archives into. May be a directory that does not yet exist. -Each ZIP file will extract to a folder labeled with \code{areasymbol} in this directory. Default: -\code{destdir}} +\item{exdir}{\emph{character}. Directory to extract ZIP archives into. May be a directory that does +not yet exist. Each ZIP file will extract to a folder labeled with \code{areasymbol} in this +directory. Default: \code{destdir}} -\item{include_template}{\emph{logical}. Include the (possibly state-specific) MS Access template database? -Default: \code{FALSE}} +\item{include_template}{\emph{logical}. Include the (possibly state-specific) MS Access template +database? Default: \code{FALSE}} -\item{db}{\emph{character}. Either \code{"SSURGO"} (default; detailed soil map) or \code{"STATSGO"} (general soil map).} +\item{include_spatial}{\emph{logical} or \emph{character}. Extract spatial data layers from ZIP file? +Default: \code{TRUE} inserts all spatial tables. If \code{include_spatial} is a \emph{character} vector +containing table names, only that set is extracted from the downloaded ZIP files. e.g. +\code{include_spatial=c("mupolygon", "featpoint")} extracts only the shapefiles (with side car +files) for mapunit polygons and special feature points.} + +\item{include_tabular}{\emph{logical} or \emph{character}. Include tabular data layers in database? +Default: \code{TRUE} inserts all tabular tables. If \code{include_tabular} is a \emph{character} vector +containing table names, only that set is extracted from the downloaded ZIP files. e.g. +\code{include_tabular=c("mapunit", "muaggatt")} writes only the \code{mapunit} and \code{muaggatt} tables. +Note that special feature descriptions are stored in table \code{"featdesc"} and metadata for each +soil survey area are stored in \code{"soil_metadata"} tables.} + +\item{db}{\emph{character}. Either \code{"SSURGO"} (default; detailed soil map) or \code{"STATSGO"} (general +soil map).} \item{extract}{\emph{logical}. Extract ZIP files to \code{exdir}? Default: \code{TRUE}} @@ -58,12 +73,9 @@ To specify the Soil Survey Areas you would like to obtain data you use a \code{W of \code{sacatalog} table such as \code{areasymbol = 'CA067'}, \code{"areasymbol IN ('CA628', 'CA067')"} or \verb{areasymbol LIKE 'CT\%'}. -When \code{db="STATSGO"} the \code{WHERE} argument is not supported. Allowed \code{areasymbols} include -\code{"US"} and two-letter state codes e.g. \code{"WY"} for the Wyoming general soils map. - Pipe-delimited TXT files are found in \emph{/tabular/} folder extracted from a SSURGO ZIP. -The files are named for tables in the SSURGO schema. There is no header / the files do not have -column names. See the \emph{Soil Data Access Tables and Columns Report}: +The files are named for tables in the SSURGO schema. There is no header and the files do not +have column names. See the \emph{Soil Data Access Tables and Columns Report}: \url{https://sdmdataaccess.nrcs.usda.gov/documents/TablesAndColumnsReport.pdf} for details on tables, column names and metadata including the default sequence of columns used in TXT files. The function returns a \code{try-error} if the \code{WHERE}/\code{areasymbols} arguments result in @@ -72,6 +84,17 @@ Several ESRI shapefiles are found in the \emph{/spatial/} folder extracted from have prefix \code{soilmu_} (mapunit), \code{soilsa_} (survey area), \code{soilsf_} (special features). There will also be a TXT file with prefix \code{soilsf_} describing any special features. Shapefile names then have an \code{a_} (polygon), \code{l_} (line), \code{p_} (point) followed by the soil survey area symbol. +When \code{db="STATSGO"} the \code{WHERE} argument is not supported. Allowed \code{areasymbols} include +\code{"US"} and two-letter state codes e.g. \code{"WY"} for the Wyoming general soils map. + +As in \code{createSSURGO()}, the \code{include_spatial} and \code{include_tabular} arguments either take a +logical value (default \code{TRUE}) or a character vector of the specific table names to include. Note +that when used in \code{downloadSSURGO()} the required metadata files are \emph{always} extracted to +facilitate mapping to user-facing table names. These arguments allow for customizing the files +that get extracted from ZIP files, not just filtering on file names (as is implemented with +pre-existing \code{pattern} argument). This can dramatically improve efficiency of extraction and the +overall size of the data in \code{exdir}. These arguments can be used in conjunction with the +\code{pattern} argument to fine-tune the files included in the generated snapshot database. } \seealso{ \code{\link[=createSSURGO]{createSSURGO()}}