Releases: tidyverse/duckplyr
duckplyr 1.2.1
Bug fixes
- Filter write-only options before passing to read functions in
compute_parquet()andcompute_csv()(#886, #887).
Continuous integration
- Fix failing test on macos (@joakimlinde, #888).
duckplyr 1.2.0
Features
-
Establish compatibility with dplyr 1.2.0, this is now the minimum required version.
-
New
read_tbl_duckdb()reads a table from a DuckDB database file by attaching it to the default connection (#414, #828).db_path <- tempfile(fileext = ".duckdb") con <- DBI::dbConnect(duckdb::duckdb(), db_path) DBI::dbWriteTable(con, "my_table", data.frame(x = 1:5, y = letters[1:5])) DBI::dbDisconnect(con) read_tbl_duckdb(db_path, "my_table") |> filter(x > 2) unlink(db_path)
-
first(),last(),nth(),round(), andn()insidemutate(.by = ...)are now translated directly to DuckDB (#626, #854).duckdb_tibble(g = c("a", "a", "b", "b", "b"), x = c(10, 20, 30, 40, 50), .prudence = "stingy") |> summarise(.by = g, first_x = first(x), last_x = last(x), second_x = nth(x, 2)) duckdb_tibble(g = c("a", "a", "b", "b"), x = 1:4, .prudence = "stingy") |> mutate(count = n(), .by = g)
-
compute_parquet()andcompute_csv()now accept anoptionsargument to pass format-specific settings to the underlying DuckDB operation and also applies them when reading back the data (#729, #821).df <- duckdb_tibble(x = 1:3, y = c("a", "b", "c"), .prudence = "stingy") path <- tempfile(fileext = ".parquet") compute_parquet(df, path, options = list(compression = "zstd"))
-
compute_parquet()andcompute_csv()are now generic S3 functions, making it easier to add methods for custom classes (#746, #818). -
Functions with named arguments are now translated to DuckDB (#822).
duckdb_tibble(x = c(1.23, 4.56, 7.89), .prudence = "stingy") |> mutate(y = round(x, digits = 1L))
-
transmute()can now reference new variables created within the same call (#796, #819).duckdb_tibble(x = 1:3, .prudence = "stingy") |> transmute(y = x * 2, z = y + 10)
-
Add experimental translation for
filter_out()(#869, #870).duckdb_tibble(x = 1:3, .prudence = "stingy") |> filter_out(x > 2)
Documentation
-
Add examples for specifying CSV column types by name (#775, #820).
-
Add superseded lifecycle badge to
transmute()documentation (#364, #824). -
Review contributing guide (#657).
Chore
duckplyr 1.1.3
Features
read_file_duckdb()only wrapspathinto a list if the length is not equal to one, to supportread_stat().
Continuous integration
- Avoid example failing in R 4.2 and older.
Documentation
- Add "Supported by Posit" badge.
duckplyr 1.1.2
Features
-
Fully support
dd::...()syntax (#795). -
Threshold for
prudence = "thrifty"is reduced to 1000 cells when the data comes from a remote data source. -
Support named arguments for
dd::...()functions.
Performance
- Generate a more balanced expresion when translating
%in%to avoid performance problems in duckdb v1.4.0.
duckplyr 1.1.1
Chore
- Fix CRAN failure with
_R_CHECK_THINGS_IN_OTHER_DIRS_=true.
duckplyr 1.1.0
This release improves compatibility with dbplyr and DuckDB.
See vignette("duckdb") for details.
Features
-
Pass functions prefixed with
dd$directly to DuckDB, e.g.,dd$ROW()will be translated as DuckDB'sROW()function (#658). -
New
as_tbl()to convert to a dbplyr tbl object (#634, #685). -
Register Ark methods for Positron's "Variables" pane (@DavisVaughan, #661, #678). DuckDB tibbles are no longer displayed as data frames in the "Variables" pane due to a limitation in Positron. Use
collect()to convert them to data frames if you rely on the viewer functionality. -
Translate
n_distinct()as macro with support forna.rm = TRUE(@joakimlinde, #572, #655). -
Translate
coalesce(). -
compute()does not have a fallback, failures are reported to the client (#637). -
Implement
slice_head()(#640).
Bug fixes
-
Set functions like
union()no longer trigger materialization (#654, #692). -
Joins no longer materialize the input data when the package is used with
methods_overwrite()orlibrary(duckplyr)(#641). -
Correct formatting for controlled fallbacks with
Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE).
Chore
-
Bump duckdb and pillar dependencies.
-
Use roxyglobals from CRAN rather than GitHub (@andreranza, #659).
-
Bring tools and patch up to date (@joakimlinde, #647).
-
Internal
rel_to_df()needsprudenceargument (#644). -
Fix sync scripts and add reproducible code (#639).
-
Check loadability of extensions in test (#636).
Documentation
-
Document
slice_head()as supported. -
Add Posit's ROR ID (#592).
-
Add
vignette("duckdb")(#690). -
Add experimental badge.
-
Typos + clarification edits to "large" vignette (@mine-cetinkaya-rundel, #665).
Testing
- Skip tests using
grep()orsub()on CRAN.
duckplyr 1.0.1
duckplyr 1.0.0
Features
Large data
-
Improved support for handling large data from files and S3: ingestion with
read_parquet_duckdb()and others, and materialization withas_duckdb_tibble(),compute.duckplyr_df()andcompute_file(). Seevignette("large")for details. -
Control automatic materialization of duckplyr frames with the new
prudenceargument toas_duckdb_tibble(),duckdb_tibble(),compute.duckplyr_df()andcompute_file(). Seevignette("prudence")for details.
New functions
-
read_csv_duckdb()and others, deprecatingduckplyr_df_from_csv()anddf_from_csv()(#210, #396, #459). -
read_sql_duckdb()(experimental) to run SQL queries against the default DuckDB connection and return the result as a duckplyr frame (duckdb/duckdb-r#32, #397). -
db_exec()to execute configuration queries against the default duckdb connection (#39, #165, #227, #404, #459). -
as_duckdb_tibble(), replacesas_duckplyr_tibble()andas_duckplyr_df()(#383, #457) and supports dbplyr connections to a duckdb database (#86, #211, #226). -
compute_parquet()andcompute_csv(), implementcompute.duckplyr_df()(#409, #430). -
fallback_config()to create a configuration file for the settings that do not affect behavior (#216, #426). -
is_duckdb_tibble(), deprecatesis_duckplyr_df()(#391, #392). -
last_rel()to retrieve the last relation object used in materialization (#209, #375). -
Add
"prudent_duckplyr_df"class that stops automatic materialization and requirescollect()(#381, #390).
Translations
-
Partial support for
across()inmutate()andsummarise()(#296, #306, #318, @lionel-, @DavisVaughan). -
Implement
na.rmhandling forsum(),min(),max(),any()andall(), with fallback for window functions (#205, #566). -
Handle
dplyr::desc()(#550). -
Avoid forwarding
is.na()tois.nan()to support non-numeric data, avoid checking roundtrip for timestamp data (#482). -
Correctly handle missing values in
if_else(). -
Limit number of items that can be handled with
%in%(#319). -
duckdb_tibble()checks if columns can be represented in DuckDB (#537). -
Fall back to dplyr when passing
multiplewith joins (#323).
Error messages
-
Improve fallback error message by explicitly materializing (#432, #456).
-
Point to the native CSV reader if encountering data frames read with readr (#127, #469).
-
Improve
as_duckdb_tibble()error message for invalidx(@maelle, #339).
Behavior
-
Depend on dplyr instead of reexporting all generics (#405). Nothing changes for users in scripts. When using duckplyr in a package, you now also need to import dplyr.
-
Fallback logging is now on by default, can be disabled with configuration (#422).
-
The default DuckDB connection is now based on a file, the location defaults to a subdirectory of
tempdir()and can be controlled with theDUCKPLYR_TEMP_DIRenvironment variable (#439, #448, #561). -
explain()returns the input, invisibly (#331).
Bug fixes
-
Compute ptype only for join columns in a safe way without materialization, not for the entire data frame (#289).
-
Internal
expr_scrub()(used for telemetry) can handle function-definitions (@toppyy, #268, #271). -
Harden telemetry code against invalid arguments (#321).
Documentation
-
New articles:
vignette("large"),vignette("prudence"),vignette("fallback"),vignette("limits"),vignette("developers"),vignette("telemetry")(#207, #504). -
New
flights_df()used instead ofpalmerpenguins::penguins(#408). -
Move to the tidyverse GitHub organization, new repository URL https://github.com/tidyverse/duckplyr/ (#225).
-
Avoid base pipe in examples for compatibility with R 4.0.0 (#463, #466).
Performance
duckplyr 0.4.1
Features
df_from_file()and related functions support multiple files (#194, #195), show a clear error message for non-stringpatharguments (#182), and create a tibble by default (#177).- New
as_duckplyr_tibble()to convert a data frame to a duckplyr tibble (#177). - Support descending sort for character and other non-numeric data (@toppyy, #92, #175).
- Avoid setting memory limit (#193).
- Check compatibility of join columns (#168, #185).
- Explicitly list supported functions, add contributing guide, add analysis scripts for GitHub activity data (#179).
Documentation
- Add contributing guide (#179).
- Show a startup message at package load if telemetry is not configured (#188, #198).
?df_from_fileshows how to read multiple files (#181, #186) and how to specify CSV column types (#140, #189), and is shown correctly in reference index (#173, #190).- Discuss dbplyr in README (#145, #191).
- Add analysis scripts for GitHub activity data (#179).
duckplyr 0.4.0
Features
- Use built-in rfuns extension to implement equality and inequality operators, improve translation for
as.integer(),NAand%in%(#83, #154, #148, #155, #159, #160). - Reexport non-deprecated dplyr functions (#144, #163).
library(duckplyr)callsmethods_overwrite()(#164).- Only allow constant patterns in
grepl(). - Explicitly reject calls with named arguments for now.
- Reduce default memory limit to 1 GB.
Bug fixes
- Stricter type checks in the set operations
intersect(),setdiff(),symdiff(),union(), andunion_all()(#169). - Distinguish between constant
NAand those used in an expression (#157). head(-1)forwards to the default implementation (#131, #156).- Fix cli syntax for internal error message (#151).
- More careful detection of row names in data frame.
- Always check roundtrip for timestamp columns.
left_join()and other join functions callauto_copy().- Only reset expression depth if it has been set before.
- Require fallback if the result contains duplicate column names when ignoring case.
row_number()returns integer.is.na(NaN)isTRUE.summarise(count = n(), count = n())creates only one column namedcount.- Correct wording in instructions for enabling fallback logging (@TimTaylor, #141).
Chore
Documentation
- Mention wildcards to read multiple files in
?df_from_file(@andreranza, #133, #134).