diff --git a/content/blog/duckplyr-1-1-0/index.Rmd b/content/blog/duckplyr-1-1-0/index.Rmd new file mode 100644 index 000000000..d1aee1125 --- /dev/null +++ b/content/blog/duckplyr-1-1-0/index.Rmd @@ -0,0 +1,233 @@ +--- +output: hugodown::hugo_document + +slug: duckplyr-1-1-0 +title: duckplyr fully joins the tidyverse! +date: 2025-06-19 +author: Kirill Müller and Maëlle Salmon +description: > + duckplyr 1.1.0 is on CRAN! + A drop-in replacement for dplyr, powered by DuckDB for speed. + It is the most dplyr-like of dplyr backends. + +photo: + url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/ + author: Kiril Gruev + +# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" +categories: [package] +tags: + - duckplyr + - dplyr + - tidyverse +--- + +```{r include = FALSE} +options( + pillar.min_title_chars = 20, + pillar.max_footer_lines = 7, + pillar.bold = TRUE +) +options(conflicts.policy = list(warn = FALSE)) +library(conflicted) +conflict_prefer("filter", "dplyr", quiet = TRUE) +``` + + +We're well chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.1.0. +This is a dplyr backend powered by [DuckDB](https://duckdb.org/), a fast in-memory analytical database system[^duckdb]. +duckplyr uses the power of DuckDB for impressive performance where it can, and seemlessly falls back to R where it can't. +You can install it from CRAN with: + +[^duckdb]: If you haven't heard of it yet, watch [Hannes Mühleisen's keynote at posit::conf(2024)](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). + +```{r, eval = FALSE} +install.packages("duckplyr") +``` + +This article shows how duckplyr can be used instead of dplyr, explain how you can help improve the package, and share a selection of further resources. + +## A drop-in replacement for dplyr + +Imagine you have to wrangle a huge dataset, like this one from the [TPC-H benchmark](https://duckdb.org/2024/04/02/duckplyr.html#benchmark-tpc-h-q1), a famous database benchmarking dataset. + +```{r} +lineitem_tbl <- duckdb:::sql( + "INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;" +) +lineitem_tbl <- tibble::as_tibble(lineitem_tbl) +dplyr::glimpse(lineitem_tbl) +``` + +To work with this in duckplyr instead of dplyr, all you need to do is load duckplyr: + +```{r} +library(duckplyr) +``` + +Now we can express the well-known (at least in the database community!) "TPC-H benchmark query 1" in dplyr syntax and execute it in DuckDB via duckplyr. + +```{r} +tpch_dplyr <- function(lineitem) { + lineitem |> + filter(l_shipdate <= !!as.Date("1998-09-02")) |> + summarise( + sum_qty = sum(l_quantity), + sum_base_price = sum(l_extendedprice), + sum_disc_price = sum(l_extendedprice * (1 - l_discount)), + sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)), + avg_qty = mean(l_quantity), + avg_price = mean(l_extendedprice), + avg_disc = mean(l_discount), + count_order = n(), + .by = c(l_returnflag, l_linestatus) + ) |> + arrange(l_returnflag, l_linestatus) +} + +tpch_dplyr(lineitem_tbl) +``` + +Like other dplyr backends such as dtplyr and dbplyr, duckplyr gives you higher performance without learning a different syntax. +Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. +Not only is the syntax the same, the semantics are too! +If an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. +Over time, we expect fewer and fewer fallbacks to dplyr to be needed. + +## How to use duckplyr + +There are two ways to use duckplyr: + +- As above, you can `library(duckplyr)`, and replace all existing dplyr methods. This is safe because duckplyr is guaranteed to give the exactly same the results as dplyr, unlike other backends. + +- Create individual "duck frames" using _conversion functions_ like `duckdplyr::duckdb_tibble()` or `duckdplyr::as_duckdb_tibble()`, or _ingestion functions_ like `duckdplyr::read_csv_duckdb()`. + +Here's an example of the second form: + +```{r} +out <- lineitem_tbl |> + duckplyr::as_duckdb_tibble() |> + tpch_dplyr() + +out +``` + +Note that the resulting object is indistinguishable from a regular tibble, except for the additional class. + +```{r} +typeof(out) +class(out) +out$count_order +``` + +Operations not yet supported by duckplyr are automatically outsourced to dplyr. +For instance, filtering on grouped data is not supported, but it still works thanks to the fallback mechanism. +By default, the fallback is silent, but you can make it visible by setting an environment variable. +This is useful if you want to better understanding what's making your code slow. + +```{r} +Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) + +lineitem_tbl |> + duckplyr::as_duckdb_tibble() |> + filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus)) +``` + +You can also directly use DuckDB functions with the `dd$` qualifier. +Functions with this prefix will not be translated at all and passed through directly to DuckDB. +For example, the following code uses DuckDB's internal implementation of [Levenstein distance](https://duckdb.org/docs/stable/sql/functions/text.html#editdist3s1-s2): + +```{r} +tibble(a = "dbplyr", b = "duckplyr") %>% + mutate(c = dd$levenshtein(a, b)) +``` + +See `vignette("duckdb")` for more information on these features. + +If you're working with dbplyr too, you can use `as_tbl()` you to convert a duckplyr tibble to a dbplyr lazy table. +This allows you to seamlessly interact with existing code that might use inline SQL or other dbplyr functionality. +With `as_duckdb_tibble()`, you can convert a dbplyr lazy table to a duckplyr tibble. +Both operations work without intermediate materialization. + +## Benchmark + +duckplyr is often much faster than dplyr. +The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not. + +```{r include = FALSE} +# Undo the effect of library(duckplyr) +methods_restore() +``` + +We use `tpch_dplyr()` as defined above to run the query with dplyr. +The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. +The `collect()` at the end is required only for this benchmark to ensure fairness.[^collect] + +[^collect]: If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See `vignette("prudence")` for details. + +```{r} +tpch_duckplyr <- function(lineitem) { + lineitem |> + duckplyr::as_duckdb_tibble() |> + tpch_dplyr() |> + collect() +} +``` + +And now we compare the two: + +```{r} +bench::mark( + tpch_dplyr(lineitem_tbl), + tpch_duckplyr(lineitem_tbl), + check = ~ all.equal(.x, .y, tolerance = 1e-10) +) +``` + +In this example, duckplyr is a lot faster than dplyr. +It also appears to use much less memory, but this is misleading: DuckDB manages the memory, not R, so the memory usage is not visible to `bench::mark()`. + +## Out-of-memory data + +As well as improved speed with in-memory datasets, duckplyr makes it easy to work with datasets that are too big to fit in memory. +In this case, you want: + +1. To work with data stored in modern formats designed for large data (e.g. Parquet). +1. To be able to store large intermediate results on disk, keeping them out of memory. +1. Fast computation! + +duckdplyr provides each of these features: + +1. You can read data from disk with functions like `read_parquet_duckdb()`. +1. You can save intermediate results to disk with `compute_parquet()` and `compute_csv()`. +1. duckdplyr takes advantage of DuckDB's query planner which considers your entire pipeline holistically to figure out the most efficient way to get the data you need. + +See `vignette("large")` for a walkthrough and more details. + +## Help us improve duckplyr! + +Our goals for future development of duckplyr include: + +- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality; +- Making it easier to contribute code to duckplyr; +- Supporting more dplyr and tidyr functionality natively in DuckDB. + +You can help! + +- Please report any issues, especially regarding unknown incompabilities. See `vignette("limits")`. +- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html). +- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See `vignette("telemetry")` and `duckplyr::fallback_sitrep()`. + +## Additional resources + +Eager to learn more about duckplyr -- beside by trying it out yourself? +The duckplyr website features several [articles](https://duckplyr.tidyverse.org/articles/). +Furthermore, the blog post ["duckplyr: dplyr Powered by DuckDB"](https://duckdb.org/2024/04/02/duckplyr.html) by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a [section](https://blog.r-hub.io/2025/02/13/lazy-meanings/#duckplyr-lazy-evaluation-and-prudence) of the R-hub blog post ["Lazy introduction to laziness in R"](https://blog.r-hub.io/2025/02/13/lazy-meanings/) by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick. + +## Acknowledgements + +A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse [duckdb](https://r.duckdb.org/)! + +[@adamschwing](https://github.com/adamschwing), [@alejandrohagan](https://github.com/alejandrohagan), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@kevbaer](https://github.com/kevbaer), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@lschneiderbauer](https://github.com/lschneiderbauer), [@luisDVA](https://github.com/luisDVA), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@szarnyasg](https://github.com/szarnyasg), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), [@zhjx19](https://github.com/zhjx19), [@ablack3](https://github.com/ablack3), [@actuarial-lonewolf](https://github.com/actuarial-lonewolf), [@ajdamico](https://github.com/ajdamico), [@amirmazmi](https://github.com/amirmazmi), [@anderson461123](https://github.com/anderson461123), [@andrewGhazi](https://github.com/andrewGhazi), [@Antonov548](https://github.com/Antonov548), [@appiehappie999](https://github.com/appiehappie999), [@ArthurAndrews](https://github.com/ArthurAndrews), [@arthurgailes](https://github.com/arthurgailes), [@babaknaimi](https://github.com/babaknaimi), [@bcaradima](https://github.com/bcaradima), [@bdforbes](https://github.com/bdforbes), [@bergest](https://github.com/bergest), [@bill-ash](https://github.com/bill-ash), [@BorgeJorge](https://github.com/BorgeJorge), [@brianmsm](https://github.com/brianmsm), [@chainsawriot](https://github.com/chainsawriot), [@ckarnes](https://github.com/ckarnes), [@clementlefevre](https://github.com/clementlefevre), [@cregouby](https://github.com/cregouby), [@cy-james-lee](https://github.com/cy-james-lee), [@daranzolin](https://github.com/daranzolin), [@david-cortes](https://github.com/david-cortes), [@DavZim](https://github.com/DavZim), [@denis-or](https://github.com/denis-or), [@developertest1234](https://github.com/developertest1234), [@dicorynia](https://github.com/dicorynia), [@dsolito](https://github.com/dsolito), [@e-kotov](https://github.com/e-kotov), [@EAVWing](https://github.com/EAVWing), [@eddelbuettel](https://github.com/eddelbuettel), [@edward-burn](https://github.com/edward-burn), [@elefeint](https://github.com/elefeint), [@eli-daniels](https://github.com/eli-daniels), [@elysabethpc](https://github.com/elysabethpc), [@erikvona](https://github.com/erikvona), [@florisvdh](https://github.com/florisvdh), [@gaborcsardi](https://github.com/gaborcsardi), [@ggrothendieck](https://github.com/ggrothendieck), [@hdmm3](https://github.com/hdmm3), [@hope-data-science](https://github.com/hope-data-science), [@IoannaNika](https://github.com/IoannaNika), [@jabrown-aepenergy](https://github.com/jabrown-aepenergy), [@JamesLMacAulay](https://github.com/JamesLMacAulay), [@jangorecki](https://github.com/jangorecki), [@javierlenzi](https://github.com/javierlenzi), [@Joe-Heffer-Shef](https://github.com/Joe-Heffer-Shef), [@kalibera](https://github.com/kalibera), [@lboller-pwbm](https://github.com/lboller-pwbm), [@lgaborini](https://github.com/lgaborini), [@m-muecke](https://github.com/m-muecke), [@meztez](https://github.com/meztez), [@mgirlich](https://github.com/mgirlich), [@mtmorgan](https://github.com/mtmorgan), [@nassuphis](https://github.com/nassuphis), [@nbc](https://github.com/nbc), [@olivroy](https://github.com/olivroy), [@pdet](https://github.com/pdet), [@phdjsep](https://github.com/phdjsep), [@pierre-lamarche](https://github.com/pierre-lamarche), [@r2evans](https://github.com/r2evans), [@ran-codes](https://github.com/ran-codes), [@rplsmn](https://github.com/rplsmn), [@Saarialho](https://github.com/Saarialho), [@SimonCoulombe](https://github.com/SimonCoulombe), [@tau31](https://github.com/tau31), [@thohan88](https://github.com/thohan88), [@ThomasSoeiro](https://github.com/ThomasSoeiro), [@timothygmitchell](https://github.com/timothygmitchell), [@vincentarelbundock](https://github.com/vincentarelbundock), [@VincentGuyader](https://github.com/VincentGuyader), [@wlangera](https://github.com/wlangera), [@xbasics](https://github.com/xbasics), [@xiaodaigh](https://github.com/xiaodaigh), [@xtimbeau](https://github.com/xtimbeau), [@yng-me](https://github.com/yng-me), [@Yousuf28](https://github.com/Yousuf28), [@yutannihilation](https://github.com/yutannihilation), and [@zcatav](https://github.com/zcatav) + +Special thanks to Joe Thorley ([@joethorley](https://github.com/joethorley)) for help with choosing the right words. diff --git a/content/blog/duckplyr-1-1-0/index.md b/content/blog/duckplyr-1-1-0/index.md new file mode 100644 index 000000000..f375ecc80 --- /dev/null +++ b/content/blog/duckplyr-1-1-0/index.md @@ -0,0 +1,303 @@ +--- +output: hugodown::hugo_document + +slug: duckplyr-1-1-0 +title: duckplyr fully joins the tidyverse! +date: 2025-06-19 +author: Kirill Müller and Maëlle Salmon +description: > + duckplyr 1.1.0 is on CRAN! + A drop-in replacement for dplyr, powered by DuckDB for speed. + It is the most dplyr-like of dplyr backends. + +photo: + url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/ + author: Kiril Gruev + +# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" +categories: [package] +tags: + - duckplyr + - dplyr + - tidyverse +rmd_hash: e61d2b86a57469dc + +--- + +We're well chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.1.0. This is a dplyr backend powered by [DuckDB](https://duckdb.org/), a fast in-memory analytical database system[^1]. duckplyr uses the power of DuckDB for impressive performance where it can, and seemlessly falls back to R where it can't. You can install it from CRAN with: + +
+ +
install.packages("duckplyr")
+ +
+ +This article shows how duckplyr can be used instead of dplyr, explain how you can help improve the package, and share a selection of further resources. + +## A drop-in replacement for dplyr + +Imagine you have to wrangle a huge dataset, like this one from the [TPC-H benchmark](https://duckdb.org/2024/04/02/duckplyr.html#benchmark-tpc-h-q1), a famous database benchmarking dataset. + +
+ +
lineitem_tbl <- duckdb:::sql("INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;")
+lineitem_tbl <- tibble::as_tibble(lineitem_tbl)
+dplyr::glimpse(lineitem_tbl)
+#> Rows: 6,001,215
+#> Columns: 16
+#> $ l_orderkey      <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5, 6, 
+#> $ l_partkey       <dbl> 155190, 67310, 63700, 2132, 24027, 15635, 106170, 4297…
+#> $ l_suppkey       <dbl> 7706, 7311, 3701, 4633, 1534, 638, 1191, 1798, 6540, 3…
+#> $ l_linenumber    <dbl> 1, 2, 3, 4, 5, 6, 1, 1, 2, 3, 4, 5, 6, 1, 1, 2, 3, 1, 
+#> $ l_quantity      <dbl> 17, 36, 8, 28, 24, 32, 38, 45, 49, 27, 2, 28, 26, 30, 
+#> $ l_extendedprice <dbl> 21168.23, 45983.16, 13309.60, 28955.64, 22824.48, 4962…
+#> $ l_discount      <dbl> 0.04, 0.09, 0.10, 0.09, 0.10, 0.07, 0.00, 0.06, 0.10, 
+#> $ l_tax           <dbl> 0.02, 0.06, 0.02, 0.06, 0.04, 0.02, 0.05, 0.00, 0.00, 
+#> $ l_returnflag    <chr> "N", "N", "N", "N", "N", "N", "N", "R", "R", "A", "A",
+#> $ l_linestatus    <chr> "O", "O", "O", "O", "O", "O", "O", "F", "F", "F", "F",
+#> $ l_shipdate      <date> 1996-03-13, 1996-04-12, 1996-01-29, 1996-04-21, 1996-…
+#> $ l_commitdate    <date> 1996-02-12, 1996-02-28, 1996-03-05, 1996-03-30, 1996-…
+#> $ l_receiptdate   <date> 1996-03-22, 1996-04-20, 1996-01-31, 1996-05-16, 1996-…
+#> $ l_shipinstruct  <chr> "DELIVER IN PERSON", "TAKE BACK RETURN", "TAKE BACK RE…
+#> $ l_shipmode      <chr> "TRUCK", "MAIL", "REG AIR", "AIR", "FOB", "MAIL", "RAI…
+#> $ l_comment       <chr> "to beans x-ray carefull", " according to the final fo…
+
+ +
+ +To work with this in duckplyr instead of dplyr, all you need to do is load duckplyr: + +
+ +
library(duckplyr)
+#> Loading required package: dplyr
+#> The duckplyr package is configured to fall back to dplyr when it encounters an incompatibility.
+#> Fallback events can be collected and uploaded for analysis to guide future development. By
+#> default, data will be collected but no data will be uploaded.
+#>  Automatic fallback uploading is not controlled and therefore disabled, see
+#>   `?duckplyr::fallback()`.
+#>  Number of reports ready for upload: 4.
+#> → Review with `duckplyr::fallback_review()`, upload with `duckplyr::fallback_upload()`.
+#>  Configure automatic uploading with `duckplyr::fallback_config()`.
+#>  Overwriting dplyr methods with duckplyr methods.
+#>  Turn off with `duckplyr::methods_restore()`.
+
+ +
+ +Now we can express the well-known (at least in the database community!) "TPC-H benchmark query 1" in dplyr syntax and execute it in DuckDB via duckplyr. + +
+ +
tpch_dplyr <- function(lineitem) {
+  lineitem |>
+    filter(l_shipdate <= !!as.Date("1998-09-02")) |>
+    summarise(
+      sum_qty = sum(l_quantity),
+      sum_base_price = sum(l_extendedprice),
+      sum_disc_price = sum(l_extendedprice * (1 - l_discount)),
+      sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
+      avg_qty = mean(l_quantity),
+      avg_price = mean(l_extendedprice),
+      avg_disc = mean(l_discount),
+      count_order = n(),
+      .by = c(l_returnflag, l_linestatus)
+    ) |>
+    arrange(l_returnflag, l_linestatus)
+}
+
+tpch_dplyr(lineitem_tbl)
+#> # A tibble: 4 × 10
+#>   l_returnflag l_linestatus  sum_qty sum_base_price sum_disc_price    sum_charge
+#>   <chr>        <chr>           <dbl>          <dbl>          <dbl>         <dbl>
+#> 1 A            F            37734107   56586554401.   53758257135.  55909065223.
+#> 2 N            F              991417    1487504710.    1413082168.   1469649223.
+#> 3 N            O            74476040  111701729698.  106118230308. 110367043872.
+#> 4 R            F            37719753   56568041381.   53741292685.  55889619120.
+#> # ℹ 4 more variables: avg_qty <dbl>, avg_price <dbl>, avg_disc <dbl>,
+#> #   count_order <int>
+
+ +
+ +Like other dplyr backends such as dtplyr and dbplyr, duckplyr gives you higher performance without learning a different syntax. Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. Not only is the syntax the same, the semantics are too! If an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. Over time, we expect fewer and fewer fallbacks to dplyr to be needed. + +## How to use duckplyr + +There are two ways to use duckplyr: + +- As above, you can [`library(duckplyr)`](https://duckplyr.tidyverse.org), and replace all existing dplyr methods. This is safe because duckplyr is guaranteed to give the exactly same the results as dplyr, unlike other backends. + +- Create individual "duck frames" using *conversion functions* like `duckdplyr::duckdb_tibble()` or `duckdplyr::as_duckdb_tibble()`, or *ingestion functions* like `duckdplyr::read_csv_duckdb()`. + +Here's an example of the second form: + +
+ +
out <- lineitem_tbl |>
+  duckplyr::as_duckdb_tibble() |>
+  tpch_dplyr()
+
+out
+#> # A duckplyr data frame: 10 variables
+#>   l_returnflag l_linestatus  sum_qty sum_base_price sum_disc_price    sum_charge
+#>   <chr>        <chr>           <dbl>          <dbl>          <dbl>         <dbl>
+#> 1 A            F            37734107   56586554401.   53758257135.  55909065223.
+#> 2 N            F              991417    1487504710.    1413082168.   1469649223.
+#> 3 N            O            74476040  111701729698.  106118230308. 110367043872.
+#> 4 R            F            37719753   56568041381.   53741292685.  55889619120.
+#> # ℹ 4 more variables: avg_qty <dbl>, avg_price <dbl>, avg_disc <dbl>,
+#> #   count_order <int>
+
+ +
+ +Note that the resulting object is indistinguishable from a regular tibble, except for the additional class. + +
+ +
typeof(out)
+#> [1] "list"
+class(out)
+#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"
+out$count_order
+#> [1] 1478493   38854 2920374 1478870
+
+ +
+ +Operations not yet supported by duckplyr are automatically outsourced to dplyr. For instance, filtering on grouped data is not supported, but it still works thanks to the fallback mechanism. By default, the fallback is silent, but you can make it visible by setting an environment variable. This is useful if you want to better understanding what's making your code slow. + +
+ +
Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE)
+
+lineitem_tbl |>
+  duckplyr::as_duckdb_tibble() |>
+  filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus))
+#> Cannot process duckplyr query with DuckDB, falling back to dplyr.
+#>  `filter(.by = ...)` not implemented, try `mutate(.by = ...)` followed by a simple `filter()`.
+#> # A duckplyr data frame: 16 variables
+#>    l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
+#>         <dbl>     <dbl>     <dbl>        <dbl>      <dbl>           <dbl>
+#>  1          5     37531        35            3         50          73426.
+#>  2        131     44255      9264            2         50          59962.
+#>  3        199    132072      9612            1         50          55204.
+#>  4        231    198124       644            3         50          61106 
+#>  5        260    155887      5888            1         50          97144 
+#>  6        263    142891       434            3         50          96694.
+#>  7        323    163628      1177            1         50          84581 
+#>  8        354     58125      8126            3         50          54156 
+#>  9        484    183351      5870            3         50          71718.
+#> 10        485    149523      9524            1         50          78626 
+#> # ℹ more rows
+#> # ℹ 10 more variables: l_discount <dbl>, l_tax <dbl>, l_returnflag <chr>,
+#> #   l_linestatus <chr>, l_shipdate <date>, l_commitdate <date>,
+#> #   l_receiptdate <date>, l_shipinstruct <chr>, l_shipmode <chr>,
+#> #   l_comment <chr>
+
+ +
+ +You can also directly use DuckDB functions with the `dd$` qualifier. Functions with this prefix will not be translated at all and passed through directly to DuckDB. For example, the following code uses DuckDB's internal implementation of [Levenstein distance](https://duckdb.org/docs/stable/sql/functions/text.html#editdist3s1-s2): + +
+ +
tibble(a = "dbplyr", b = "duckplyr") %>%
+  mutate(c = dd$levenshtein(a, b))
+#> # A tibble: 1 × 3
+#>   a      b            c
+#>   <chr>  <chr>    <dbl>
+#> 1 dbplyr duckplyr     3
+
+ +
+ +See [`vignette("duckdb")`](https://duckplyr.tidyverse.org/articles/duckdb.html) for more information on these features. + +If you're working with dbplyr too, you can use [`as_tbl()`](https://duckplyr.tidyverse.org/reference/as_tbl.html) you to convert a duckplyr tibble to a dbplyr lazy table. This allows you to seamlessly interact with existing code that might use inline SQL or other dbplyr functionality. With [`as_duckdb_tibble()`](https://duckplyr.tidyverse.org/reference/duckdb_tibble.html), you can convert a dbplyr lazy table to a duckplyr tibble. Both operations work without intermediate materialization. + +## Benchmark + +duckplyr is often much faster than dplyr. The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not. + +We use `tpch_dplyr()` as defined above to run the query with dplyr. The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. The [`collect()`](https://dplyr.tidyverse.org/reference/compute.html) at the end is required only for this benchmark to ensure fairness.[^2] + +
+ +
tpch_duckplyr <- function(lineitem) {
+  lineitem |>
+    duckplyr::as_duckdb_tibble() |>
+    tpch_dplyr() |>
+    collect()
+}
+ +
+ +And now we compare the two: + +
+ +
bench::mark(
+  tpch_dplyr(lineitem_tbl),
+  tpch_duckplyr(lineitem_tbl),
+  check = ~ all.equal(.x, .y, tolerance = 1e-10)
+)
+#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
+#> # A tibble: 2 × 6
+#>   expression                       min   median `itr/sec` mem_alloc `gc/sec`
+#>   <bch:expr>                  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
+#> 1 tpch_dplyr(lineitem_tbl)     611.6ms  611.6ms      1.64    1.25GB     1.64
+#> 2 tpch_duckplyr(lineitem_tbl)   71.4ms   72.3ms     13.8   314.38KB     0
+
+ +
+ +In this example, duckplyr is a lot faster than dplyr. It also appears to use much less memory, but this is misleading: DuckDB manages the memory, not R, so the memory usage is not visible to [`bench::mark()`](https://bench.r-lib.org/reference/mark.html). + +## Out-of-memory data + +As well as improved speed with in-memory datasets, duckplyr makes it easy to work with datasets that are too big to fit in memory. In this case, you want: + +1. To work with data stored in modern formats designed for large data (e.g. Parquet). +2. To be able to store large intermediate results on disk, keeping them out of memory. +3. Fast computation! + +duckdplyr provides each of these features: + +1. You can read data from disk with functions like [`read_parquet_duckdb()`](https://duckplyr.tidyverse.org/reference/read_parquet_duckdb.html). +2. You can save intermediate results to disk with [`compute_parquet()`](https://duckplyr.tidyverse.org/reference/compute_parquet.html) and [`compute_csv()`](https://duckplyr.tidyverse.org/reference/compute_csv.html). +3. duckdplyr takes advantage of DuckDB's query planner which considers your entire pipeline holistically to figure out the most efficient way to get the data you need. + +See [`vignette("large")`](https://duckplyr.tidyverse.org/articles/large.html) for a walkthrough and more details. + +## Help us improve duckplyr! + +Our goals for future development of duckplyr include: + +- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality; +- Making it easier to contribute code to duckplyr; +- Supporting more dplyr and tidyr functionality natively in DuckDB. + +You can help! + +- Please report any issues, especially regarding unknown incompabilities. See [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html). +- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html). +- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See [`vignette("telemetry")`](https://duckplyr.tidyverse.org/articles/telemetry.html) and [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html). + +## Additional resources + +Eager to learn more about duckplyr -- beside by trying it out yourself? The duckplyr website features several [articles](https://duckplyr.tidyverse.org/articles/). Furthermore, the blog post ["duckplyr: dplyr Powered by DuckDB"](https://duckdb.org/2024/04/02/duckplyr.html) by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a [section](https://blog.r-hub.io/2025/02/13/lazy-meanings/#duckplyr-lazy-evaluation-and-prudence) of the R-hub blog post ["Lazy introduction to laziness in R"](https://blog.r-hub.io/2025/02/13/lazy-meanings/) by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick. + +## Acknowledgements + +A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse [duckdb](https://r.duckdb.org/)! + +[@adamschwing](https://github.com/adamschwing), [@alejandrohagan](https://github.com/alejandrohagan), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@kevbaer](https://github.com/kevbaer), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@lschneiderbauer](https://github.com/lschneiderbauer), [@luisDVA](https://github.com/luisDVA), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@szarnyasg](https://github.com/szarnyasg), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), [@zhjx19](https://github.com/zhjx19), [@ablack3](https://github.com/ablack3), [@actuarial-lonewolf](https://github.com/actuarial-lonewolf), [@ajdamico](https://github.com/ajdamico), [@amirmazmi](https://github.com/amirmazmi), [@anderson461123](https://github.com/anderson461123), [@andrewGhazi](https://github.com/andrewGhazi), [@Antonov548](https://github.com/Antonov548), [@appiehappie999](https://github.com/appiehappie999), [@ArthurAndrews](https://github.com/ArthurAndrews), [@arthurgailes](https://github.com/arthurgailes), [@babaknaimi](https://github.com/babaknaimi), [@bcaradima](https://github.com/bcaradima), [@bdforbes](https://github.com/bdforbes), [@bergest](https://github.com/bergest), [@bill-ash](https://github.com/bill-ash), [@BorgeJorge](https://github.com/BorgeJorge), [@brianmsm](https://github.com/brianmsm), [@chainsawriot](https://github.com/chainsawriot), [@ckarnes](https://github.com/ckarnes), [@clementlefevre](https://github.com/clementlefevre), [@cregouby](https://github.com/cregouby), [@cy-james-lee](https://github.com/cy-james-lee), [@daranzolin](https://github.com/daranzolin), [@david-cortes](https://github.com/david-cortes), [@DavZim](https://github.com/DavZim), [@denis-or](https://github.com/denis-or), [@developertest1234](https://github.com/developertest1234), [@dicorynia](https://github.com/dicorynia), [@dsolito](https://github.com/dsolito), [@e-kotov](https://github.com/e-kotov), [@EAVWing](https://github.com/EAVWing), [@eddelbuettel](https://github.com/eddelbuettel), [@edward-burn](https://github.com/edward-burn), [@elefeint](https://github.com/elefeint), [@eli-daniels](https://github.com/eli-daniels), [@elysabethpc](https://github.com/elysabethpc), [@erikvona](https://github.com/erikvona), [@florisvdh](https://github.com/florisvdh), [@gaborcsardi](https://github.com/gaborcsardi), [@ggrothendieck](https://github.com/ggrothendieck), [@hdmm3](https://github.com/hdmm3), [@hope-data-science](https://github.com/hope-data-science), [@IoannaNika](https://github.com/IoannaNika), [@jabrown-aepenergy](https://github.com/jabrown-aepenergy), [@JamesLMacAulay](https://github.com/JamesLMacAulay), [@jangorecki](https://github.com/jangorecki), [@javierlenzi](https://github.com/javierlenzi), [@Joe-Heffer-Shef](https://github.com/Joe-Heffer-Shef), [@kalibera](https://github.com/kalibera), [@lboller-pwbm](https://github.com/lboller-pwbm), [@lgaborini](https://github.com/lgaborini), [@m-muecke](https://github.com/m-muecke), [@meztez](https://github.com/meztez), [@mgirlich](https://github.com/mgirlich), [@mtmorgan](https://github.com/mtmorgan), [@nassuphis](https://github.com/nassuphis), [@nbc](https://github.com/nbc), [@olivroy](https://github.com/olivroy), [@pdet](https://github.com/pdet), [@phdjsep](https://github.com/phdjsep), [@pierre-lamarche](https://github.com/pierre-lamarche), [@r2evans](https://github.com/r2evans), [@ran-codes](https://github.com/ran-codes), [@rplsmn](https://github.com/rplsmn), [@Saarialho](https://github.com/Saarialho), [@SimonCoulombe](https://github.com/SimonCoulombe), [@tau31](https://github.com/tau31), [@thohan88](https://github.com/thohan88), [@ThomasSoeiro](https://github.com/ThomasSoeiro), [@timothygmitchell](https://github.com/timothygmitchell), [@vincentarelbundock](https://github.com/vincentarelbundock), [@VincentGuyader](https://github.com/VincentGuyader), [@wlangera](https://github.com/wlangera), [@xbasics](https://github.com/xbasics), [@xiaodaigh](https://github.com/xiaodaigh), [@xtimbeau](https://github.com/xtimbeau), [@yng-me](https://github.com/yng-me), [@Yousuf28](https://github.com/Yousuf28), [@yutannihilation](https://github.com/yutannihilation), and [@zcatav](https://github.com/zcatav) + +Special thanks to Joe Thorley ([@joethorley](https://github.com/joethorley)) for help with choosing the right words. + +[^1]: If you haven't heard of it yet, watch [Hannes Mühleisen's keynote at posit::conf(2024)](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). + +[^2]: If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See [`vignette("prudence")`](https://duckplyr.tidyverse.org/articles/prudence.html) for details. + diff --git a/content/blog/duckplyr-1-1-0/thumbnail-sq.jpg b/content/blog/duckplyr-1-1-0/thumbnail-sq.jpg new file mode 100644 index 000000000..6808c4720 Binary files /dev/null and b/content/blog/duckplyr-1-1-0/thumbnail-sq.jpg differ diff --git a/content/blog/duckplyr-1-1-0/thumbnail-wd.jpg b/content/blog/duckplyr-1-1-0/thumbnail-wd.jpg new file mode 100644 index 000000000..cb24733a4 Binary files /dev/null and b/content/blog/duckplyr-1-1-0/thumbnail-wd.jpg differ