|
57 | 57 | "\n", |
58 | 58 | "* The `0` and `9` tell us the \"divisions\" of the partitions. When the dataset is sorted by the index, these divisions are ranges to show which index values reside in each partition.\n", |
59 | 59 | "\n", |
60 | | - "We can signal to Dask that we'd like to actually obtain the data as `nested_pandas.NestedFrame` by using `compute`." |
| 60 | + "We can use peek at the first `n` rows using `ndf.head(n)` (or the last few with `ndf.tail(n)`)." |
61 | 61 | ] |
62 | 62 | }, |
63 | 63 | { |
|
66 | 66 | "metadata": {}, |
67 | 67 | "outputs": [], |
68 | 68 | "source": [ |
69 | | - "ndf.compute() # or could use ndf.head(n) to peak at the first n rows" |
| 69 | + "ndf.head(3)" |
| 70 | + ] |
| 71 | + }, |
| 72 | + { |
| 73 | + "cell_type": "markdown", |
| 74 | + "metadata": {}, |
| 75 | + "source": [ |
| 76 | + "We can signal to Dask that we'd like to actually obtain *all* of the data as `nested_pandas.NestedFrame` by using `compute`." |
| 77 | + ] |
| 78 | + }, |
| 79 | + { |
| 80 | + "cell_type": "code", |
| 81 | + "execution_count": null, |
| 82 | + "metadata": {}, |
| 83 | + "outputs": [], |
| 84 | + "source": [ |
| 85 | + "ndf.compute()" |
70 | 86 | ] |
71 | 87 | }, |
72 | 88 | { |
|
134 | 150 | "metadata": {}, |
135 | 151 | "outputs": [], |
136 | 152 | "source": [ |
137 | | - "result.head(5).nested[0] # no t value lower than 17.0" |
| 153 | + "result.head(5).nested[0] # no `t` value is lower than 17.0" |
138 | 154 | ] |
139 | 155 | }, |
140 | 156 | { |
141 | 157 | "cell_type": "markdown", |
142 | 158 | "metadata": {}, |
143 | 159 | "source": [ |
144 | | - "Nested-Dask `reduce` functions near-identically to Nested-Pandas `reduce`, providing a way to call custom functions on `NestedFrame` data. The one addition is that we'll need to provide the Dask `meta` value for the result. This is a dataframe-like or series-like object that has the same structure as the expected output. Let's compute the mean flux for each dataframe in the \"nested\" column. " |
| 160 | + "Nested-Dask `reduce` functions near-identically to Nested-Pandas `reduce`, providing a way to call custom functions on `NestedFrame` data. The one additional concern is that Dask requires, in almost every case, a `meta=` argument to help Dask understand the shape and type of the output data. Dask provides a `make_meta` function, to which you can pass a dummy output value." |
145 | 161 | ] |
146 | 162 | }, |
147 | 163 | { |
|
152 | 168 | "source": [ |
153 | 169 | "import numpy as np\n", |
154 | 170 | "import pandas as pd\n", |
| 171 | + "from dask.dataframe.utils import make_meta\n", |
155 | 172 | "\n", |
156 | | - "# The result will be a series with float values\n", |
157 | | - "meta = pd.DataFrame(columns=[0], dtype=float)\n", |
| 173 | + "# Use hierarchical column names to access the flux column\n", |
| 174 | + "# passed as an array to np.mean .\n", |
| 175 | + "#\n", |
| 176 | + "# Take a single sample row, computed (that's what .head(1) will do),\n", |
| 177 | + "# and generate the meta for it.\n", |
| 178 | + "meta = make_meta(ndf.head(1).reduce(np.mean, \"nested.flux\"))\n", |
158 | 179 | "\n", |
159 | | - "# use hierarchical column names to access the flux column\n", |
160 | | - "# passed as an array to np.mean\n", |
161 | 180 | "means = ndf.reduce(np.mean, \"nested.flux\", meta=meta)\n", |
162 | 181 | "means.compute()" |
163 | 182 | ] |
164 | 183 | }, |
| 184 | + { |
| 185 | + "cell_type": "markdown", |
| 186 | + "metadata": {}, |
| 187 | + "source": [ |
| 188 | + "The `reduce` function can also be used to apply any row-based calculation, as it turns out, even if the dimension stays the same. Observe that we can use this similar pattern to produce, say, the square of the flux. It is still a \"reduction\" in that the result is no longer within the original `NestedFrame` structure, but the cardinality of each output row is now the same as the cardinality of each input row." |
| 189 | + ] |
| 190 | + }, |
| 191 | + { |
| 192 | + "cell_type": "code", |
| 193 | + "execution_count": null, |
| 194 | + "metadata": {}, |
| 195 | + "outputs": [], |
| 196 | + "source": [ |
| 197 | + "meta = make_meta(ndf.head(1).reduce(np.square, \"nested.flux\"))\n", |
| 198 | + "\n", |
| 199 | + "flux_sq = ndf.reduce(np.square, \"nested.flux\", meta=meta)\n", |
| 200 | + "flux_sq.compute()" |
| 201 | + ] |
| 202 | + }, |
165 | 203 | { |
166 | 204 | "cell_type": "code", |
167 | 205 | "execution_count": null, |
|
172 | 210 | ], |
173 | 211 | "metadata": { |
174 | 212 | "kernelspec": { |
175 | | - "display_name": "Python 3", |
| 213 | + "display_name": "Python 3 (ipykernel)", |
176 | 214 | "language": "python", |
177 | 215 | "name": "python3" |
178 | 216 | }, |
|
186 | 224 | "name": "python", |
187 | 225 | "nbconvert_exporter": "python", |
188 | 226 | "pygments_lexer": "ipython3", |
189 | | - "version": "3.12.3" |
| 227 | + "version": "3.13.2" |
190 | 228 | } |
191 | 229 | }, |
192 | 230 | "nbformat": 4, |
193 | | - "nbformat_minor": 2 |
| 231 | + "nbformat_minor": 4 |
194 | 232 | } |
0 commit comments