Skip to content

Releases: tidyverse/duckplyr

duckplyr 1.2.1

12 Mar 03:17

Choose a tag to compare

Bug fixes

  • Filter write-only options before passing to read functions in compute_parquet() and compute_csv() (#886, #887).

Continuous integration

duckplyr 1.2.0

27 Feb 03:15

Choose a tag to compare

Features

  • Establish compatibility with dplyr 1.2.0, this is now the minimum required version.

  • New read_tbl_duckdb() reads a table from a DuckDB database file by attaching it to the default connection (#414, #828).

    db_path <- tempfile(fileext = ".duckdb")
    con <- DBI::dbConnect(duckdb::duckdb(), db_path)
    DBI::dbWriteTable(con, "my_table", data.frame(x = 1:5, y = letters[1:5]))
    DBI::dbDisconnect(con)
    
    read_tbl_duckdb(db_path, "my_table") |>
      filter(x > 2)
    
    unlink(db_path)
  • first(), last(), nth(), round(), and n() inside mutate(.by = ...) are now translated directly to DuckDB (#626, #854).

    duckdb_tibble(g = c("a", "a", "b", "b", "b"), x = c(10, 20, 30, 40, 50), .prudence = "stingy") |>
      summarise(.by = g, first_x = first(x), last_x = last(x), second_x = nth(x, 2))
    
    duckdb_tibble(g = c("a", "a", "b", "b"), x = 1:4, .prudence = "stingy") |>
      mutate(count = n(), .by = g)
  • compute_parquet() and compute_csv() now accept an options argument to pass format-specific settings to the underlying DuckDB operation and also applies them when reading back the data (#729, #821).

    df <- duckdb_tibble(x = 1:3, y = c("a", "b", "c"), .prudence = "stingy")
    path <- tempfile(fileext = ".parquet")
    compute_parquet(df, path, options = list(compression = "zstd"))
  • compute_parquet() and compute_csv() are now generic S3 functions, making it easier to add methods for custom classes (#746, #818).

  • Functions with named arguments are now translated to DuckDB (#822).

    duckdb_tibble(x = c(1.23, 4.56, 7.89), .prudence = "stingy") |>
      mutate(y = round(x, digits = 1L))
  • transmute() can now reference new variables created within the same call (#796, #819).

    duckdb_tibble(x = 1:3, .prudence = "stingy") |>
      transmute(y = x * 2, z = y + 10)
  • Add experimental translation for filter_out() (#869, #870).

    duckdb_tibble(x = 1:3, .prudence = "stingy") |>
      filter_out(x > 2)

Documentation

  • Document row.names incompatibility (#603, #825).

  • Add examples for specifying CSV column types by name (#775, #820).

  • Add superseded lifecycle badge to transmute() documentation (#364, #824).

  • Add blog post to pkgdown config (#612, #827).

  • Review contributing guide (#657).

Chore

  • Align internal tests with dplyr 1.2.0 (#863).

  • Migrate from deprecated qs to qs2 (#846, #847).

  • Format code with air.

duckplyr 1.1.3

06 Nov 02:19

Choose a tag to compare

Features

  • read_file_duckdb() only wraps path into a list if the length is not equal to one, to support read_stat().

Continuous integration

  • Avoid example failing in R 4.2 and older.

Documentation

  • Add "Supported by Posit" badge.

duckplyr 1.1.2

20 Sep 01:59

Choose a tag to compare

Features

  • Fully support dd::...() syntax (#795).

  • Threshold for prudence = "thrifty" is reduced to 1000 cells when the data comes from a remote data source.

  • Support named arguments for dd::...() functions.

Performance

  • Generate a more balanced expresion when translating %in% to avoid performance problems in duckdb v1.4.0.

duckplyr 1.1.1

01 Aug 02:59

Choose a tag to compare

Chore

  • Fix CRAN failure with _R_CHECK_THINGS_IN_OTHER_DIRS_=true.

duckplyr 1.1.0

10 May 02:14

Choose a tag to compare

This release improves compatibility with dbplyr and DuckDB.
See vignette("duckdb") for details.

Features

  • Pass functions prefixed with dd$ directly to DuckDB, e.g., dd$ROW() will be translated as DuckDB's ROW() function (#658).

  • New as_tbl() to convert to a dbplyr tbl object (#634, #685).

  • Register Ark methods for Positron's "Variables" pane (@DavisVaughan, #661, #678). DuckDB tibbles are no longer displayed as data frames in the "Variables" pane due to a limitation in Positron. Use collect() to convert them to data frames if you rely on the viewer functionality.

  • Translate n_distinct() as macro with support for na.rm = TRUE (@joakimlinde, #572, #655).

  • Translate coalesce().

  • compute() does not have a fallback, failures are reported to the client (#637).

  • Implement slice_head() (#640).

Bug fixes

  • Set functions like union() no longer trigger materialization (#654, #692).

  • Joins no longer materialize the input data when the package is used with methods_overwrite() or library(duckplyr) (#641).

  • Correct formatting for controlled fallbacks with Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE).

Chore

  • Bump duckdb and pillar dependencies.

  • Use roxyglobals from CRAN rather than GitHub (@andreranza, #659).

  • Bring tools and patch up to date (@joakimlinde, #647).

  • Internal rel_to_df() needs prudence argument (#644).

  • Fix sync scripts and add reproducible code (#639).

  • Check loadability of extensions in test (#636).

Documentation

  • Document slice_head() as supported.

  • Add Posit's ROR ID (#592).

  • Add vignette("duckdb") (#690).

  • Add experimental badge.

  • Verbose conflict_prefer() (#667, #684).

  • Typos + clarification edits to "large" vignette (@mine-cetinkaya-rundel, #665).

Testing

  • Skip tests using grep() or sub() on CRAN.

duckplyr 1.0.1

01 Mar 02:17

Choose a tag to compare

Bug fixes

  • Check if extensions can be loaded before running examples and vignettes (#620).

  • Show source of error if data frame cannot be converted to duck frame (#614).

  • Correct formatting for controlled fallbacks with Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE)

Chore

  • Require duckdb >= 1.2.0 (#619).

  • Break this version with duckdb 2.0.0 (#623).

Documentation

  • Separate ?compute_parquet and ?compute_csv (#610, #622).

  • Italicize book title in README (@wibeasley, #607).

  • Fix typo in filter(.by = ...) error message (@maelle, #611).

  • Fix link in documentation (#600, #601).

duckplyr 1.0.0

09 Feb 02:05
097102f

Choose a tag to compare

Features

Large data

  • Improved support for handling large data from files and S3: ingestion with read_parquet_duckdb() and others, and materialization with as_duckdb_tibble(), compute.duckplyr_df() and compute_file(). See vignette("large") for details.

  • Control automatic materialization of duckplyr frames with the new prudence argument to as_duckdb_tibble(), duckdb_tibble(), compute.duckplyr_df() and compute_file(). See vignette("prudence") for details.

New functions

  • read_csv_duckdb() and others, deprecating duckplyr_df_from_csv() and df_from_csv() (#210, #396, #459).

  • read_sql_duckdb() (experimental) to run SQL queries against the default DuckDB connection and return the result as a duckplyr frame (duckdb/duckdb-r#32, #397).

  • db_exec() to execute configuration queries against the default duckdb connection (#39, #165, #227, #404, #459).

  • duckdb_tibble() (#382, #457).

  • as_duckdb_tibble(), replaces as_duckplyr_tibble() and as_duckplyr_df() (#383, #457) and supports dbplyr connections to a duckdb database (#86, #211, #226).

  • compute_parquet() and compute_csv(), implement compute.duckplyr_df() (#409, #430).

  • fallback_config() to create a configuration file for the settings that do not affect behavior (#216, #426).

  • is_duckdb_tibble(), deprecates is_duckplyr_df() (#391, #392).

  • last_rel() to retrieve the last relation object used in materialization (#209, #375).

  • Add "prudent_duckplyr_df" class that stops automatic materialization and requires collect() (#381, #390).

Translations

  • Partial support for across() in mutate() and summarise() (#296, #306, #318, @lionel-, @DavisVaughan).

  • Implement na.rm handling for sum(), min(), max(), any() and all(), with fallback for window functions (#205, #566).

  • Add support for sub() and gsub() (@toppyy, #420).

  • Handle dplyr::desc() (#550).

  • Avoid forwarding is.na() to is.nan() to support non-numeric data, avoid checking roundtrip for timestamp data (#482).

  • Correctly handle missing values in if_else().

  • Limit number of items that can be handled with %in% (#319).

  • duckdb_tibble() checks if columns can be represented in DuckDB (#537).

  • Fall back to dplyr when passing multiple with joins (#323).

Error messages

  • Improve fallback error message by explicitly materializing (#432, #456).

  • Point to the native CSV reader if encountering data frames read with readr (#127, #469).

  • Improve as_duckdb_tibble() error message for invalid x (@maelle, #339).

Behavior

  • Depend on dplyr instead of reexporting all generics (#405). Nothing changes for users in scripts. When using duckplyr in a package, you now also need to import dplyr.

  • Fallback logging is now on by default, can be disabled with configuration (#422).

  • The default DuckDB connection is now based on a file, the location defaults to a subdirectory of tempdir() and can be controlled with the DUCKPLYR_TEMP_DIR environment variable (#439, #448, #561).

  • collect() returns a tibble (#438, #447).

  • explain() returns the input, invisibly (#331).

Bug fixes

  • Compute ptype only for join columns in a safe way without materialization, not for the entire data frame (#289).

  • Internal expr_scrub() (used for telemetry) can handle function-definitions (@toppyy, #268, #271).

  • Harden telemetry code against invalid arguments (#321).

Documentation

  • New articles: vignette("large"), vignette("prudence"), vignette("fallback"), vignette("limits"), vignette("developers"), vignette("telemetry") (#207, #504).

  • New flights_df() used instead of palmerpenguins::penguins (#408).

  • Move to the tidyverse GitHub organization, new repository URL https://github.com/tidyverse/duckplyr/ (#225).

  • Avoid base pipe in examples for compatibility with R 4.0.0 (#463, #466).

Performance

  • Comparison expressions are translated in a way that allows them to be pushed down to Parquet (@toppyy, #270).

  • Printing a duckplyr frame no longer materializes (#255, #378).

  • Prefer vctrs::new_data_frame() over tibble() (#500).

duckplyr 0.4.1

14 Jul 00:49

Choose a tag to compare

Features

  • df_from_file() and related functions support multiple files (#194, #195), show a clear error message for non-string path arguments (#182), and create a tibble by default (#177).
  • New as_duckplyr_tibble() to convert a data frame to a duckplyr tibble (#177).
  • Support descending sort for character and other non-numeric data (@toppyy, #92, #175).
  • Avoid setting memory limit (#193).
  • Check compatibility of join columns (#168, #185).
  • Explicitly list supported functions, add contributing guide, add analysis scripts for GitHub activity data (#179).

Documentation

  • Add contributing guide (#179).
  • Show a startup message at package load if telemetry is not configured (#188, #198).
  • ?df_from_file shows how to read multiple files (#181, #186) and how to specify CSV column types (#140, #189), and is shown correctly in reference index (#173, #190).
  • Discuss dbplyr in README (#145, #191).
  • Add analysis scripts for GitHub activity data (#179).

duckplyr 0.4.0

23 May 00:45

Choose a tag to compare

Features

  • Use built-in rfuns extension to implement equality and inequality operators, improve translation for as.integer(), NA and %in% (#83, #154, #148, #155, #159, #160).
  • Reexport non-deprecated dplyr functions (#144, #163).
  • library(duckplyr) calls methods_overwrite() (#164).
  • Only allow constant patterns in grepl().
  • Explicitly reject calls with named arguments for now.
  • Reduce default memory limit to 1 GB.

Bug fixes

  • Stricter type checks in the set operations intersect(), setdiff(), symdiff(), union(), and union_all() (#169).
  • Distinguish between constant NA and those used in an expression (#157).
  • head(-1) forwards to the default implementation (#131, #156).
  • Fix cli syntax for internal error message (#151).
  • More careful detection of row names in data frame.
  • Always check roundtrip for timestamp columns.
  • left_join() and other join functions call auto_copy().
  • Only reset expression depth if it has been set before.
  • Require fallback if the result contains duplicate column names when ignoring case.
  • row_number() returns integer.
  • is.na(NaN) is TRUE.
  • summarise(count = n(), count = n()) creates only one column named count.
  • Correct wording in instructions for enabling fallback logging (@TimTaylor, #141).

Chore

  • Remove styler dependency (#137, #138).
  • Avoid error from stats collection.

Documentation

Testing

  • Reenable tests that now run successfully (#166).
  • Synchronize tests (#153).
  • Test that vec_ptype() does not materialize (#149).
  • Improve telemetry tests.
  • Promote equality checks to expect_identical() to capture differences between doubles and integers.