|
| 1 | +--- |
| 2 | +output: hugodown::hugo_document |
| 3 | + |
| 4 | +slug: duckplyr-1-1-0 |
| 5 | +title: duckplyr fully joins the tidyverse! |
| 6 | +date: 2025-06-19 |
| 7 | +author: Kirill Müller and Maëlle Salmon |
| 8 | +description: > |
| 9 | + duckplyr 1.1.0 is on CRAN! |
| 10 | + A drop-in replacement for dplyr, powered by DuckDB for speed. |
| 11 | + It is the most dplyr-like of dplyr backends. |
| 12 | +
|
| 13 | +photo: |
| 14 | + url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/ |
| 15 | + author: Kiril Gruev |
| 16 | + |
| 17 | +# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" |
| 18 | +categories: [package] |
| 19 | +tags: |
| 20 | + - duckplyr |
| 21 | + - dplyr |
| 22 | + - tidyverse |
| 23 | +--- |
| 24 | + |
| 25 | +```{r include = FALSE} |
| 26 | +options( |
| 27 | + pillar.min_title_chars = 20, |
| 28 | + pillar.max_footer_lines = 7, |
| 29 | + pillar.bold = TRUE |
| 30 | +) |
| 31 | +options(conflicts.policy = list(warn = FALSE)) |
| 32 | +library(conflicted) |
| 33 | +conflict_prefer("filter", "dplyr", quiet = TRUE) |
| 34 | +``` |
| 35 | + |
| 36 | + |
| 37 | +We're well chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.1.0. |
| 38 | +This is a dplyr backend powered by [DuckDB](https://duckdb.org/), a fast in-memory analytical database system[^duckdb]. |
| 39 | +duckplyr uses the power of DuckDB for impressive performance where it can, and seemlessly falls back to R where it can't. |
| 40 | +You can install it from CRAN with: |
| 41 | + |
| 42 | +[^duckdb]: If you haven't heard of it yet, watch [Hannes Mühleisen's keynote at posit::conf(2024)](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). |
| 43 | + |
| 44 | +```{r, eval = FALSE} |
| 45 | +install.packages("duckplyr") |
| 46 | +``` |
| 47 | + |
| 48 | +This article shows how duckplyr can be used instead of dplyr, explain how you can help improve the package, and share a selection of further resources. |
| 49 | + |
| 50 | +## A drop-in replacement for dplyr |
| 51 | + |
| 52 | +Imagine you have to wrangle a huge dataset, like this one from the [TPC-H benchmark](https://duckdb.org/2024/04/02/duckplyr.html#benchmark-tpc-h-q1), a famous database benchmarking dataset. |
| 53 | + |
| 54 | +```{r} |
| 55 | +lineitem_tbl <- duckdb:::sql( |
| 56 | + "INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;" |
| 57 | +) |
| 58 | +lineitem_tbl <- tibble::as_tibble(lineitem_tbl) |
| 59 | +dplyr::glimpse(lineitem_tbl) |
| 60 | +``` |
| 61 | + |
| 62 | +To work with this in duckplyr instead of dplyr, all you need to do is load duckplyr: |
| 63 | + |
| 64 | +```{r} |
| 65 | +library(duckplyr) |
| 66 | +``` |
| 67 | + |
| 68 | +Now we can express the well-known (at least in the database community!) "TPC-H benchmark query 1" in dplyr syntax and execute it in DuckDB via duckplyr. |
| 69 | + |
| 70 | +```{r} |
| 71 | +tpch_dplyr <- function(lineitem) { |
| 72 | + lineitem |> |
| 73 | + filter(l_shipdate <= !!as.Date("1998-09-02")) |> |
| 74 | + summarise( |
| 75 | + sum_qty = sum(l_quantity), |
| 76 | + sum_base_price = sum(l_extendedprice), |
| 77 | + sum_disc_price = sum(l_extendedprice * (1 - l_discount)), |
| 78 | + sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)), |
| 79 | + avg_qty = mean(l_quantity), |
| 80 | + avg_price = mean(l_extendedprice), |
| 81 | + avg_disc = mean(l_discount), |
| 82 | + count_order = n(), |
| 83 | + .by = c(l_returnflag, l_linestatus) |
| 84 | + ) |> |
| 85 | + arrange(l_returnflag, l_linestatus) |
| 86 | +} |
| 87 | +
|
| 88 | +tpch_dplyr(lineitem_tbl) |
| 89 | +``` |
| 90 | + |
| 91 | +Like other dplyr backends such as dtplyr and dbplyr, duckplyr gives you higher performance without learning a different syntax. |
| 92 | +Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. |
| 93 | +Not only is the syntax the same, the semantics are too! |
| 94 | +If an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. |
| 95 | +Over time, we expect fewer and fewer fallbacks to dplyr to be needed. |
| 96 | + |
| 97 | +## How to use duckplyr |
| 98 | + |
| 99 | +There are two ways to use duckplyr: |
| 100 | + |
| 101 | +- As above, you can `library(duckplyr)`, and replace all existing dplyr methods. This is safe because duckplyr is guaranteed to give the exactly same the results as dplyr, unlike other backends. |
| 102 | + |
| 103 | +- Create individual "duck frames" using _conversion functions_ like `duckdplyr::duckdb_tibble()` or `duckdplyr::as_duckdb_tibble()`, or _ingestion functions_ like `duckdplyr::read_csv_duckdb()`. |
| 104 | + |
| 105 | +Here's an example of the second form: |
| 106 | + |
| 107 | +```{r} |
| 108 | +out <- lineitem_tbl |> |
| 109 | + duckplyr::as_duckdb_tibble() |> |
| 110 | + tpch_dplyr() |
| 111 | +
|
| 112 | +out |
| 113 | +``` |
| 114 | + |
| 115 | +Note that the resulting object is indistinguishable from a regular tibble, except for the additional class. |
| 116 | + |
| 117 | +```{r} |
| 118 | +typeof(out) |
| 119 | +class(out) |
| 120 | +out$count_order |
| 121 | +``` |
| 122 | + |
| 123 | +Operations not yet supported by duckplyr are automatically outsourced to dplyr. |
| 124 | +For instance, filtering on grouped data is not supported, but it still works thanks to the fallback mechanism. |
| 125 | +By default, the fallback is silent, but you can make it visible by setting an environment variable. |
| 126 | +This is useful if you want to better understanding what's making your code slow. |
| 127 | + |
| 128 | +```{r} |
| 129 | +Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) |
| 130 | +
|
| 131 | +lineitem_tbl |> |
| 132 | + duckplyr::as_duckdb_tibble() |> |
| 133 | + filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus)) |
| 134 | +``` |
| 135 | + |
| 136 | +You can also directly use DuckDB functions with the `dd$` qualifier. |
| 137 | +Functions with this prefix will not be translated at all and passed through directly to DuckDB. |
| 138 | +For example, the following code uses DuckDB's internal implementation of [Levenstein distance](https://duckdb.org/docs/stable/sql/functions/text.html#editdist3s1-s2): |
| 139 | + |
| 140 | +```{r} |
| 141 | +tibble(a = "dbplyr", b = "duckplyr") %>% |
| 142 | + mutate(c = dd$levenshtein(a, b)) |
| 143 | +``` |
| 144 | + |
| 145 | +See `vignette("duckdb")` for more information on these features. |
| 146 | + |
| 147 | +If you're working with dbplyr too, you can use `as_tbl()` you to convert a duckplyr tibble to a dbplyr lazy table. |
| 148 | +This allows you to seamlessly interact with existing code that might use inline SQL or other dbplyr functionality. |
| 149 | +With `as_duckdb_tibble()`, you can convert a dbplyr lazy table to a duckplyr tibble. |
| 150 | +Both operations work without intermediate materialization. |
| 151 | + |
| 152 | +## Benchmark |
| 153 | + |
| 154 | +duckplyr is often much faster than dplyr. |
| 155 | +The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not. |
| 156 | + |
| 157 | +```{r include = FALSE} |
| 158 | +# Undo the effect of library(duckplyr) |
| 159 | +methods_restore() |
| 160 | +``` |
| 161 | + |
| 162 | +We use `tpch_dplyr()` as defined above to run the query with dplyr. |
| 163 | +The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. |
| 164 | +The `collect()` at the end is required only for this benchmark to ensure fairness.[^collect] |
| 165 | + |
| 166 | +[^collect]: If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See `vignette("prudence")` for details. |
| 167 | + |
| 168 | +```{r} |
| 169 | +tpch_duckplyr <- function(lineitem) { |
| 170 | + lineitem |> |
| 171 | + duckplyr::as_duckdb_tibble() |> |
| 172 | + tpch_dplyr() |> |
| 173 | + collect() |
| 174 | +} |
| 175 | +``` |
| 176 | + |
| 177 | +And now we compare the two: |
| 178 | + |
| 179 | +```{r} |
| 180 | +bench::mark( |
| 181 | + tpch_dplyr(lineitem_tbl), |
| 182 | + tpch_duckplyr(lineitem_tbl), |
| 183 | + check = ~ all.equal(.x, .y, tolerance = 1e-10) |
| 184 | +) |
| 185 | +``` |
| 186 | + |
| 187 | +In this example, duckplyr is a lot faster than dplyr. |
| 188 | +It also appears to use much less memory, but this is misleading: DuckDB manages the memory, not R, so the memory usage is not visible to `bench::mark()`. |
| 189 | + |
| 190 | +## Out-of-memory data |
| 191 | + |
| 192 | +As well as improved speed with in-memory datasets, duckplyr makes it easy to work with datasets that are too big to fit in memory. |
| 193 | +In this case, you want: |
| 194 | + |
| 195 | +1. To work with data stored in modern formats designed for large data (e.g. Parquet). |
| 196 | +1. To be able to store large intermediate results on disk, keeping them out of memory. |
| 197 | +1. Fast computation! |
| 198 | + |
| 199 | +duckdplyr provides each of these features: |
| 200 | + |
| 201 | +1. You can read data from disk with functions like `read_parquet_duckdb()`. |
| 202 | +1. You can save intermediate results to disk with `compute_parquet()` and `compute_csv()`. |
| 203 | +1. duckdplyr takes advantage of DuckDB's query planner which considers your entire pipeline holistically to figure out the most efficient way to get the data you need. |
| 204 | + |
| 205 | +See `vignette("large")` for a walkthrough and more details. |
| 206 | + |
| 207 | +## Help us improve duckplyr! |
| 208 | + |
| 209 | +Our goals for future development of duckplyr include: |
| 210 | + |
| 211 | +- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality; |
| 212 | +- Making it easier to contribute code to duckplyr; |
| 213 | +- Supporting more dplyr and tidyr functionality natively in DuckDB. |
| 214 | + |
| 215 | +You can help! |
| 216 | + |
| 217 | +- Please report any issues, especially regarding unknown incompabilities. See `vignette("limits")`. |
| 218 | +- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html). |
| 219 | +- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See `vignette("telemetry")` and `duckplyr::fallback_sitrep()`. |
| 220 | + |
| 221 | +## Additional resources |
| 222 | + |
| 223 | +Eager to learn more about duckplyr -- beside by trying it out yourself? |
| 224 | +The duckplyr website features several [articles](https://duckplyr.tidyverse.org/articles/). |
| 225 | +Furthermore, the blog post ["duckplyr: dplyr Powered by DuckDB"](https://duckdb.org/2024/04/02/duckplyr.html) by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a [section](https://blog.r-hub.io/2025/02/13/lazy-meanings/#duckplyr-lazy-evaluation-and-prudence) of the R-hub blog post ["Lazy introduction to laziness in R"](https://blog.r-hub.io/2025/02/13/lazy-meanings/) by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick. |
| 226 | + |
| 227 | +## Acknowledgements |
| 228 | + |
| 229 | +A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse [duckdb](https://r.duckdb.org/)! |
| 230 | + |
| 231 | +[@adamschwing](https://github.com/adamschwing), [@alejandrohagan](https://github.com/alejandrohagan), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@kevbaer](https://github.com/kevbaer), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@lschneiderbauer](https://github.com/lschneiderbauer), [@luisDVA](https://github.com/luisDVA), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@szarnyasg](https://github.com/szarnyasg), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), [@zhjx19](https://github.com/zhjx19), [@ablack3](https://github.com/ablack3), [@actuarial-lonewolf](https://github.com/actuarial-lonewolf), [@ajdamico](https://github.com/ajdamico), [@amirmazmi](https://github.com/amirmazmi), [@anderson461123](https://github.com/anderson461123), [@andrewGhazi](https://github.com/andrewGhazi), [@Antonov548](https://github.com/Antonov548), [@appiehappie999](https://github.com/appiehappie999), [@ArthurAndrews](https://github.com/ArthurAndrews), [@arthurgailes](https://github.com/arthurgailes), [@babaknaimi](https://github.com/babaknaimi), [@bcaradima](https://github.com/bcaradima), [@bdforbes](https://github.com/bdforbes), [@bergest](https://github.com/bergest), [@bill-ash](https://github.com/bill-ash), [@BorgeJorge](https://github.com/BorgeJorge), [@brianmsm](https://github.com/brianmsm), [@chainsawriot](https://github.com/chainsawriot), [@ckarnes](https://github.com/ckarnes), [@clementlefevre](https://github.com/clementlefevre), [@cregouby](https://github.com/cregouby), [@cy-james-lee](https://github.com/cy-james-lee), [@daranzolin](https://github.com/daranzolin), [@david-cortes](https://github.com/david-cortes), [@DavZim](https://github.com/DavZim), [@denis-or](https://github.com/denis-or), [@developertest1234](https://github.com/developertest1234), [@dicorynia](https://github.com/dicorynia), [@dsolito](https://github.com/dsolito), [@e-kotov](https://github.com/e-kotov), [@EAVWing](https://github.com/EAVWing), [@eddelbuettel](https://github.com/eddelbuettel), [@edward-burn](https://github.com/edward-burn), [@elefeint](https://github.com/elefeint), [@eli-daniels](https://github.com/eli-daniels), [@elysabethpc](https://github.com/elysabethpc), [@erikvona](https://github.com/erikvona), [@florisvdh](https://github.com/florisvdh), [@gaborcsardi](https://github.com/gaborcsardi), [@ggrothendieck](https://github.com/ggrothendieck), [@hdmm3](https://github.com/hdmm3), [@hope-data-science](https://github.com/hope-data-science), [@IoannaNika](https://github.com/IoannaNika), [@jabrown-aepenergy](https://github.com/jabrown-aepenergy), [@JamesLMacAulay](https://github.com/JamesLMacAulay), [@jangorecki](https://github.com/jangorecki), [@javierlenzi](https://github.com/javierlenzi), [@Joe-Heffer-Shef](https://github.com/Joe-Heffer-Shef), [@kalibera](https://github.com/kalibera), [@lboller-pwbm](https://github.com/lboller-pwbm), [@lgaborini](https://github.com/lgaborini), [@m-muecke](https://github.com/m-muecke), [@meztez](https://github.com/meztez), [@mgirlich](https://github.com/mgirlich), [@mtmorgan](https://github.com/mtmorgan), [@nassuphis](https://github.com/nassuphis), [@nbc](https://github.com/nbc), [@olivroy](https://github.com/olivroy), [@pdet](https://github.com/pdet), [@phdjsep](https://github.com/phdjsep), [@pierre-lamarche](https://github.com/pierre-lamarche), [@r2evans](https://github.com/r2evans), [@ran-codes](https://github.com/ran-codes), [@rplsmn](https://github.com/rplsmn), [@Saarialho](https://github.com/Saarialho), [@SimonCoulombe](https://github.com/SimonCoulombe), [@tau31](https://github.com/tau31), [@thohan88](https://github.com/thohan88), [@ThomasSoeiro](https://github.com/ThomasSoeiro), [@timothygmitchell](https://github.com/timothygmitchell), [@vincentarelbundock](https://github.com/vincentarelbundock), [@VincentGuyader](https://github.com/VincentGuyader), [@wlangera](https://github.com/wlangera), [@xbasics](https://github.com/xbasics), [@xiaodaigh](https://github.com/xiaodaigh), [@xtimbeau](https://github.com/xtimbeau), [@yng-me](https://github.com/yng-me), [@Yousuf28](https://github.com/Yousuf28), [@yutannihilation](https://github.com/yutannihilation), and [@zcatav](https://github.com/zcatav) |
| 232 | + |
| 233 | +Special thanks to Joe Thorley ([@joethorley](https://github.com/joethorley)) for help with choosing the right words. |
0 commit comments