Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a readr-style "problems" report on parsing failures #221

Open
akgold opened this issue Apr 15, 2021 · 5 comments
Open

Consider a readr-style "problems" report on parsing failures #221

akgold opened this issue Apr 15, 2021 · 5 comments

Comments

@akgold
Copy link

akgold commented Apr 15, 2021

On read, a column is assumed to be logical after some (maybe 1,000?) rows of missing data. If a character is at the bottom, then this silently fails with no error.

my_str <- "My Str"
empty_sheet_url <- "<url>"
tbl <- tibble::tibble(i = 1:2000, str = c(rep("", 1999), my_str))
googlesheets4::write_sheet(tbl, empty_sheet_url)

df <- googlesheets4::read_sheet(empty_sheet_url, "tbl")
testthat::expect_type(df$str, "character") #fails, is logical
testthat::expect_equal(df$str, c(rep("", 1999), my_str)) #fails, is all NA
@jennybc
Copy link
Member

jennybc commented Jul 11, 2021

I'd say this is all working as documented. This is how googlesheets4, readxl, and readr have always worked: there's a guess_max parameter that controls how many rows we look at when guessing column type.

(It's possible that round tripping empty strings should be easier? But that's a separate matter.)

library(googlesheets4)
library(googledrive)
library(tidyverse)

# hidden auth chunk here
tbl <- tibble(i = 1:2000, str = c(rep("", 1999), "blah"))
ss <- gs4_create(sheets = tbl)
#> ✓ Creating new Sheet: 'compulsory-fanworms'

# saw some weird errors, so let's slow things down
Sys.sleep(2)

dat <- read_sheet(ss, "tbl")
#> ✓ Reading from 'compulsory-fanworms'
#> ✓ Range ''tbl''
dat
#> # A tibble: 2,000 x 2
#>        i str  
#>    <dbl> <lgl>
#>  1     1 NA   
#>  2     2 NA   
#>  3     3 NA   
#>  4     4 NA   
#>  5     5 NA   
#>  6     6 NA   
#>  7     7 NA   
#>  8     8 NA   
#>  9     9 NA   
#> 10    10 NA   
#> # … with 1,990 more rows
tail(dat)
#> # A tibble: 6 x 2
#>       i str  
#>   <dbl> <lgl>
#> 1  1995 NA   
#> 2  1996 NA   
#> 3  1997 NA   
#> 4  1998 NA   
#> 5  1999 NA   
#> 6  2000 NA

The guess_max argument is important:

guess_max Maximum number of data rows to use for guessing column types.

Its default is 1000 (or n_max, if that’s smaller):

read_sheet(ss, ..., guess_max = min(1000, n_max))

This is also how readxl and readr work.

dat <- read_sheet(ss, "tbl", guess_max = Inf)
#> ✓ Reading from 'compulsory-fanworms'
#> ✓ Range ''tbl''
dat
#> # A tibble: 2,000 x 2
#>        i str  
#>    <dbl> <chr>
#>  1     1 <NA> 
#>  2     2 <NA> 
#>  3     3 <NA> 
#>  4     4 <NA> 
#>  5     5 <NA> 
#>  6     6 <NA> 
#>  7     7 <NA> 
#>  8     8 <NA> 
#>  9     9 <NA> 
#> 10    10 <NA> 
#> # … with 1,990 more rows
tail(dat)
#> # A tibble: 6 x 2
#>       i str  
#>   <dbl> <chr>
#> 1  1995 <NA> 
#> 2  1996 <NA> 
#> 3  1997 <NA> 
#> 4  1998 <NA> 
#> 5  1999 <NA> 
#> 6  2000 blah

is.character(dat$str)
#> [1] TRUE

An absence of cell data is brought in as NA, but we can use
tidyr::replace_na() to bring back explicit empty strings, if desired.

dat <- dat %>% 
  replace_na(list(str = ""))

identical(tbl$str, dat$str)
#> [1] TRUE

drive_rm(ss)
#> File deleted:
#> • 'compulsory-fanworms' <id: 1Mu0JrRh0ZASocbL0SosFeCYvaC-4_ECCoRmkkixJfTU>

Created on 2021-07-11 by the reprex package (v2.0.0.9000)

@akgold
Copy link
Author

akgold commented Jul 12, 2021

That totally makes sense. The thing that made this difficult for me to puzzle out was the silence of this potentially destructive behavior. I wasn't aware of guess_max, and no error or warning occurred here to alert me to that being what was going on. My "solution" was to put a junk character in the first row so it would guess correctly.

Maybe there's no easy way to tell when this occurs, but something like, "First value non-blank value in col str found after guess_max, coercing to logical." would've been super helpful.

@jennybc
Copy link
Member

jennybc commented Jul 12, 2021

In readr this would show up in the "problems" report. I'll reopen this.

readr::read_csv("x,y\na,\nc,d", guess_max = 1)
#> Warning: 1 parsing failure.
#> row col           expected actual         file
#>   2   y 1/0/T/F/TRUE/FALSE      d literal data
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <lgl>
#> 1 a     NA   
#> 2 c     NA

Created on 2021-07-12 by the reprex package (v2.0.0.9000)

@jennybc jennybc reopened this Jul 12, 2021
@jennybc jennybc changed the title Non-missing entry in mostly empty column fails on read Consider a reader-style "problems" report on parsing failures Jul 12, 2021
@jennybc jennybc changed the title Consider a reader-style "problems" report on parsing failures Consider a readr-style "problems" report on parsing failures Jul 12, 2021
@jennybc
Copy link
Member

jennybc commented Jul 12, 2021

My "solution" was to put a junk character in the first row so it would guess correctly.

Just as an FYI for all of these packages, I think it's more common to do guess_max = Inf either as a policy (if one works with datasets where the size is not prohibitive for this) or when doing the initial "getting to know you" intake of a dataset. Definitely better than editing the data itself. The real gold standard is to provide column types (I would definitely specify that a column is a character, before I added junk data to influence guessing).

@jennybc
Copy link
Member

jennybc commented Jul 12, 2021

This will presumably get bundled up with the Great Col Spec Project that is coming soon for me (googlesheets4 & readxl) #51.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants