Skip to content

Commit 464bffa

Browse files
authored
Documentation notebook for groupby (#396)
* add documentation notebook for groupby * clearing output * try except for failing case & typo fix * update tutorials.rst * fix .iloc[] code format
1 parent a32df9c commit 464bffa

File tree

2 files changed

+289
-0
lines changed

2 files changed

+289
-0
lines changed

docs/tutorials.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ Tutorials
77
Fine Data Manipulation with Nested-Pandas <tutorials/data_manipulation>
88
Lower-level interfaces <tutorials/low_level.ipynb>
99
Using Nested-Pandas with Astronomical Spectra <pre_executed/nested_spectra.ipynb>
10+
Using GroupBy with Nested-Pandas <tutorials/groupby_doc.ipynb>

docs/tutorials/groupby_doc.ipynb

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "8e123e97",
6+
"metadata": {},
7+
"source": [
8+
"# GroupBy for NestedPandas\n",
9+
"\n",
10+
"This notebook explores how Pandas' built-in `groupby` interacts with `NestedPandas` structures.\n",
11+
"\n",
12+
"Because Nested-Pandas extends the Pandas library, native ``pandas.DataFrame.groupby`` works with nested-pandas out of the box in some ways. "
13+
]
14+
},
15+
{
16+
"cell_type": "code",
17+
"execution_count": null,
18+
"id": "ccb69ebe",
19+
"metadata": {},
20+
"outputs": [],
21+
"source": [
22+
"# This will be the nf example used in this doc\n",
23+
"from nested_pandas.datasets import generate_data\n",
24+
"\n",
25+
"nf = generate_data(5, 10, seed=1)\n",
26+
"nf[\"c\"] = [0, 0, 1, 1, 1]\n",
27+
"nf"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"id": "20933f5a",
33+
"metadata": {},
34+
"source": [
35+
"`groupby` works on *non-nested* columns and will return a `pandas.groupby` object. \n",
36+
"Grouping by nested columns does **not** work since nested values are mutable objects so they are unhashable.\n",
37+
"\n",
38+
"Use base columns as group keys or extract scalar identifiers from nested data first.\n",
39+
"\n"
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"id": "3a45b96e",
46+
"metadata": {},
47+
"outputs": [],
48+
"source": [
49+
"nf.groupby(\"c\") # returns a Pandas GroupBy object"
50+
]
51+
},
52+
{
53+
"cell_type": "markdown",
54+
"id": "a7b19d3d",
55+
"metadata": {},
56+
"source": [
57+
"## Basic Aggregations\n",
58+
"\n",
59+
"- Some built-in methods like `count` work but not as expected (view nested column as a single object).\n",
60+
"- Others (`min`, `max`, `mean`) fail on nested columns.\n",
61+
"- Interestingly, `describe` will work as expected with the automatic flattened nested column."
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": null,
67+
"id": "0487e613",
68+
"metadata": {},
69+
"outputs": [],
70+
"source": [
71+
"# count is viewing nested columns as single objects\n",
72+
"nf.groupby(\"c\").count()"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"id": "fec633b6",
79+
"metadata": {},
80+
"outputs": [],
81+
"source": [
82+
"# min/max/mean fail on nested columns\n",
83+
"try:\n",
84+
" grouped_min = nf.groupby(\"c\").min()\n",
85+
" print(grouped_min)\n",
86+
"except TypeError as e:\n",
87+
" print(f\"Cannot compute min on nested columns: {e}\")"
88+
]
89+
},
90+
{
91+
"cell_type": "code",
92+
"execution_count": null,
93+
"id": "201774f2",
94+
"metadata": {},
95+
"outputs": [],
96+
"source": [
97+
"# describe works as expected with automatic flattened nested column\n",
98+
"nf.groupby(\"c\").describe()"
99+
]
100+
},
101+
{
102+
"cell_type": "markdown",
103+
"id": "555bdaa3",
104+
"metadata": {},
105+
"source": [
106+
"## Type Preservation\n",
107+
"Within each group, the object remains accessible as a ``NestedFrame`` object and the nested columns remain ``NestedSeries``.\n",
108+
"\n",
109+
"We can check this by applying a custom function on our 2-group `groupby` object:"
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": null,
115+
"id": "6a02ee47",
116+
"metadata": {},
117+
"outputs": [],
118+
"source": [
119+
"# check the type\n",
120+
"def type_check(df):\n",
121+
" print(\"Group DataFrame Type:\", type(df))\n",
122+
" print(\"Nested Column Type:\", type(df[\"nested\"]))\n",
123+
" print()\n",
124+
" # return df\n",
125+
"\n",
126+
"\n",
127+
"nf.groupby(\"c\").apply(type_check, include_groups=False)"
128+
]
129+
},
130+
{
131+
"cell_type": "markdown",
132+
"id": "117ad069",
133+
"metadata": {},
134+
"source": [
135+
"An important note is that when trying to accsss the row of each group with `.iloc[]`, **numeric row-wise indexing** and **slice-based indexing** will output different types."
136+
]
137+
},
138+
{
139+
"cell_type": "markdown",
140+
"id": "197f26e9",
141+
"metadata": {},
142+
"source": [
143+
"For `NestedFrame`, when we try to access the first row, row-wise indexing (`.iloc[0]`) will collapse the result in to 1-D `pandas.Series` with the nested column stored as a `DataFrame`. However, slice-based indexing (`.iloc[0:1]`) will preserve the nested structure and still output the row as a `NestedFrame` with nested column still being `NestedSeries`."
144+
]
145+
},
146+
{
147+
"cell_type": "code",
148+
"execution_count": null,
149+
"id": "6639d290",
150+
"metadata": {},
151+
"outputs": [],
152+
"source": [
153+
"# check the full row type\n",
154+
"def row_type_check(df):\n",
155+
" print(\"df.iloc[0]: \", type(df.iloc[0]))\n",
156+
" print(\"df.iloc[0:1]:\", type(df.iloc[0:1]))\n",
157+
" print(\"\\n Accessing nested column for both ways:\")\n",
158+
" print(\"df.iloc[0] nested column:\", type(df.iloc[0][\"nested\"]))\n",
159+
" print(\"df.iloc[0:1] nested column:\", type(df.iloc[0:1][\"nested\"]))\n",
160+
" print()\n",
161+
" # return df\n",
162+
"\n",
163+
"\n",
164+
"nf.groupby(\"c\").apply(row_type_check, include_groups=False)"
165+
]
166+
},
167+
{
168+
"cell_type": "markdown",
169+
"id": "ff198f4f",
170+
"metadata": {},
171+
"source": [
172+
"For nested column with type `NestedSeries`, accessing a single row from `df[\"nested\"]` will either output a `pandas.DataFrame` (`.iloc[0]`) or a `pandas.Series` (`.iloc[0:1]`).\n",
173+
"\n",
174+
"Note that outside groupby, `df[\"nested\"].iloc[0]` is stored as a `pandas.DataFrame`, which is expected. \n",
175+
"\n",
176+
"<!-- (NestedPandas stores the nested frames as serialized DataFrames?) -->"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": null,
182+
"id": "443447a1",
183+
"metadata": {},
184+
"outputs": [],
185+
"source": [
186+
"# check the nested row type\n",
187+
"def nested_row_type_check(df):\n",
188+
" print('df[\"nested\"].iloc[0]:', type(df[\"nested\"].iloc[0]))\n",
189+
" print('df[\"nested\"].iloc[0:1]:', type(df[\"nested\"].iloc[0:1]))\n",
190+
" print()\n",
191+
" # return df\n",
192+
"\n",
193+
"\n",
194+
"nf.groupby(\"c\").apply(nested_row_type_check, include_groups=False)"
195+
]
196+
},
197+
{
198+
"cell_type": "markdown",
199+
"id": "684b697c",
200+
"metadata": {},
201+
"source": [
202+
"Other operations will preserve the nested structure in general, but if you need to work with the contents of a nested column directly, you may need to flatten it first using `.nest.to_flat()`."
203+
]
204+
},
205+
{
206+
"cell_type": "markdown",
207+
"id": "a4ff3a6b",
208+
"metadata": {},
209+
"source": [
210+
"## Custom Functions with `apply`\n",
211+
"\n",
212+
"`.apply()` for nested operations is supported natively. It generally works if the function flattens or use index slicing to ensure matching type for operations. \n",
213+
"\n",
214+
"Some potential examples:"
215+
]
216+
},
217+
{
218+
"cell_type": "code",
219+
"execution_count": null,
220+
"id": "023b771c",
221+
"metadata": {},
222+
"outputs": [],
223+
"source": [
224+
"# custom function to flatten nested column\n",
225+
"def flatten_nested(df):\n",
226+
" return df[\"nested\"].nest.to_flat()\n",
227+
"\n",
228+
"\n",
229+
"nf.groupby(\"c\").apply(flatten_nested, include_groups=False)"
230+
]
231+
},
232+
{
233+
"cell_type": "code",
234+
"execution_count": null,
235+
"id": "4b0d063f",
236+
"metadata": {},
237+
"outputs": [],
238+
"source": [
239+
"import pandas as pd\n",
240+
"\n",
241+
"\n",
242+
"# custom function to perform aggregations on flattened nested column\n",
243+
"def mean_flux(df):\n",
244+
" flat = df[\"nested\"].nest.to_flat()\n",
245+
" return pd.Series({\"mean_flux\": flat[\"flux\"].mean(), \"mean_t\": flat[\"t\"].mean()})\n",
246+
"\n",
247+
"\n",
248+
"nf.groupby(\"c\").apply(mean_flux, include_groups=False)"
249+
]
250+
},
251+
{
252+
"cell_type": "markdown",
253+
"id": "63da5da8",
254+
"metadata": {},
255+
"source": [
256+
"## Summary\n",
257+
"- Always group by **base columns**, not nested columns. \n",
258+
"- Use **slice-based indexing** (`.iloc[0:1]`) to preserve nested types.\n",
259+
"- Use **`.nest.to_flat()`** to flatten a nested column when needed for numerical or aggregating operations.\n",
260+
"\n",
261+
"- Nested structures are designed to reduce the need for expensive groupby operations by allowing data to stay organized hierarchically. However, when grouping is necessary, pandas’ groupby still works with nested-pandas and maintains type consistency.\n",
262+
"\n",
263+
"- Some use cases may behave unexpectedly because of the nested structures. We encourage users to open issues if you run into unexpected behavior or edge cases.\n"
264+
]
265+
}
266+
],
267+
"metadata": {
268+
"kernelspec": {
269+
"display_name": ".venv",
270+
"language": "python",
271+
"name": "python3"
272+
},
273+
"language_info": {
274+
"codemirror_mode": {
275+
"name": "ipython",
276+
"version": 3
277+
},
278+
"file_extension": ".py",
279+
"mimetype": "text/x-python",
280+
"name": "python",
281+
"nbconvert_exporter": "python",
282+
"pygments_lexer": "ipython3",
283+
"version": "3.13.8"
284+
}
285+
},
286+
"nbformat": 4,
287+
"nbformat_minor": 5
288+
}

0 commit comments

Comments
 (0)