Documentation notebook for groupby (#396)

Graciaaa3 · web-flow · commit 464bffab01c7 · 2025-11-13T17:29:52.000-08:00
* add documentation notebook for groupby

* clearing output

* try except for failing case &amp; typo fix

* update tutorials.rst

* fix .iloc[] code format
diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -7,3 +7,4 @@ Tutorials
     Fine Data Manipulation with Nested-Pandas <tutorials/data_manipulation>
     Lower-level interfaces <tutorials/low_level.ipynb>
     Using Nested-Pandas with Astronomical Spectra <pre_executed/nested_spectra.ipynb>
+    Using GroupBy with Nested-Pandas <tutorials/groupby_doc.ipynb>
diff --git a/docs/tutorials/groupby_doc.ipynb b/docs/tutorials/groupby_doc.ipynb
@@ -0,0 +1,288 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "8e123e97",
+   "metadata": {},
+   "source": [
+    "# GroupBy for NestedPandas\n",
+    "\n",
+    "This notebook explores how Pandas' built-in `groupby` interacts with `NestedPandas` structures.\n",
+    "\n",
+    "Because Nested-Pandas extends the Pandas library, native ``pandas.DataFrame.groupby`` works with nested-pandas out of the box in some ways. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ccb69ebe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This will be the nf example used in this doc\n",
+    "from nested_pandas.datasets import generate_data\n",
+    "\n",
+    "nf = generate_data(5, 10, seed=1)\n",
+    "nf[\"c\"] = [0, 0, 1, 1, 1]\n",
+    "nf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20933f5a",
+   "metadata": {},
+   "source": [
+    "`groupby` works on *non-nested* columns and will return a `pandas.groupby` object.  \n",
+    "Grouping by nested columns does **not** work since nested values are mutable objects so they are unhashable.\n",
+    "\n",
+    "Use base columns as group keys or extract scalar identifiers from nested data first.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a45b96e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "nf.groupby(\"c\")  # returns a Pandas GroupBy object"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7b19d3d",
+   "metadata": {},
+   "source": [
+    "## Basic Aggregations\n",
+    "\n",
+    "- Some built-in methods like `count` work but not as expected (view nested column as a single object).\n",
+    "- Others (`min`, `max`, `mean`) fail on nested columns.\n",
+    "- Interestingly, `describe` will work as expected with the automatic flattened nested column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0487e613",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# count is viewing nested columns as single objects\n",
+    "nf.groupby(\"c\").count()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fec633b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# min/max/mean fail on nested columns\n",
+    "try:\n",
+    "    grouped_min = nf.groupby(\"c\").min()\n",
+    "    print(grouped_min)\n",
+    "except TypeError as e:\n",
+    "    print(f\"Cannot compute min on nested columns: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "201774f2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# describe works as expected with automatic flattened nested column\n",
+    "nf.groupby(\"c\").describe()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "555bdaa3",
+   "metadata": {},
+   "source": [
+    "## Type Preservation\n",
+    "Within each group, the object remains accessible as a ``NestedFrame`` object and the nested columns remain ``NestedSeries``.\n",
+    "\n",
+    "We can check this by applying a custom function on our 2-group `groupby` object:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6a02ee47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check the type\n",
+    "def type_check(df):\n",
+    "    print(\"Group DataFrame Type:\", type(df))\n",
+    "    print(\"Nested Column Type:\", type(df[\"nested\"]))\n",
+    "    print()\n",
+    "    # return df\n",
+    "\n",
+    "\n",
+    "nf.groupby(\"c\").apply(type_check, include_groups=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "117ad069",
+   "metadata": {},
+   "source": [
+    "An important note is that when trying to accsss the row of each group with `.iloc[]`, **numeric row-wise indexing** and **slice-based indexing** will output different types."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "197f26e9",
+   "metadata": {},
+   "source": [
+    "For `NestedFrame`, when we try to access the first row, row-wise indexing (`.iloc[0]`) will collapse the result in to 1-D `pandas.Series` with the nested column stored as a `DataFrame`. However, slice-based indexing (`.iloc[0:1]`) will preserve the nested structure and still output the row as a `NestedFrame` with nested column still being `NestedSeries`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6639d290",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check the full row type\n",
+    "def row_type_check(df):\n",
+    "    print(\"df.iloc[0]: \", type(df.iloc[0]))\n",
+    "    print(\"df.iloc[0:1]:\", type(df.iloc[0:1]))\n",
+    "    print(\"\\n Accessing nested column for both ways:\")\n",
+    "    print(\"df.iloc[0] nested column:\", type(df.iloc[0][\"nested\"]))\n",
+    "    print(\"df.iloc[0:1] nested column:\", type(df.iloc[0:1][\"nested\"]))\n",
+    "    print()\n",
+    "    # return df\n",
+    "\n",
+    "\n",
+    "nf.groupby(\"c\").apply(row_type_check, include_groups=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff198f4f",
+   "metadata": {},
+   "source": [
+    "For nested column with type `NestedSeries`, accessing a single row from `df[\"nested\"]` will either output a `pandas.DataFrame` (`.iloc[0]`) or a `pandas.Series` (`.iloc[0:1]`).\n",
+    "\n",
+    "Note that outside groupby, `df[\"nested\"].iloc[0]` is stored as a `pandas.DataFrame`, which is expected. \n",
+    "\n",
+    "<!-- (NestedPandas stores the nested frames as serialized DataFrames?) -->"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "443447a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check the nested row type\n",
+    "def nested_row_type_check(df):\n",
+    "    print('df[\"nested\"].iloc[0]:', type(df[\"nested\"].iloc[0]))\n",
+    "    print('df[\"nested\"].iloc[0:1]:', type(df[\"nested\"].iloc[0:1]))\n",
+    "    print()\n",
+    "    # return df\n",
+    "\n",
+    "\n",
+    "nf.groupby(\"c\").apply(nested_row_type_check, include_groups=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "684b697c",
+   "metadata": {},
+   "source": [
+    "Other operations will preserve the nested structure in general, but if you need to work with the contents of a nested column directly, you may need to flatten it first using `.nest.to_flat()`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a4ff3a6b",
+   "metadata": {},
+   "source": [
+    "## Custom Functions with `apply`\n",
+    "\n",
+    "`.apply()` for nested operations is supported natively. It generally works if the function flattens or use index slicing to ensure matching type for operations. \n",
+    "\n",
+    "Some potential examples:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "023b771c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# custom function to flatten nested column\n",
+    "def flatten_nested(df):\n",
+    "    return df[\"nested\"].nest.to_flat()\n",
+    "\n",
+    "\n",
+    "nf.groupby(\"c\").apply(flatten_nested, include_groups=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b0d063f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "\n",
+    "# custom function to perform aggregations on flattened nested column\n",
+    "def mean_flux(df):\n",
+    "    flat = df[\"nested\"].nest.to_flat()\n",
+    "    return pd.Series({\"mean_flux\": flat[\"flux\"].mean(), \"mean_t\": flat[\"t\"].mean()})\n",
+    "\n",
+    "\n",
+    "nf.groupby(\"c\").apply(mean_flux, include_groups=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "63da5da8",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "- Always group by **base columns**, not nested columns.  \n",
+    "- Use **slice-based indexing** (`.iloc[0:1]`) to preserve nested types.\n",
+    "- Use **`.nest.to_flat()`** to flatten a nested column when needed for numerical or aggregating operations.\n",
+    "\n",
+    "- Nested structures are designed to reduce the need for expensive groupby operations by allowing data to stay organized hierarchically. However, when grouping is necessary, pandas’ groupby still works with nested-pandas and maintains type consistency.\n",
+    "\n",
+    "- Some use cases may behave unexpectedly because of the nested structures. We encourage users to open issues if you run into unexpected behavior or edge cases.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}