Skip to content

Commit e21f357

Browse files
authored
Merge pull request #339 from lincc-frameworks/nestedseries_docs
`NestedSeries` documentation
2 parents cf6d152 + 5d1f35d commit e21f357

File tree

6 files changed

+180
-16
lines changed

6 files changed

+180
-16
lines changed

docs/gettingstarted/quickstart.ipynb

Lines changed: 88 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,9 @@
3333
"source": [
3434
"## Overview\n",
3535
"\n",
36-
"Nested-Pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor columns whose values will be identical for that item.\n",
36+
"Nested-pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor columns whose values will be identical for that item.\n",
3737
"\n",
38-
"As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in Pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further, any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation to aggregate all of the data for each object.\n",
38+
"As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further, any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation to aggregate all of the data for each object.\n",
3939
"\n",
4040
"Let's create a flat pandas dataframe with three objects: object 0 has three observations, object 1 has three observations, and object 2 has 4 observations."
4141
]
@@ -56,6 +56,7 @@
5656
" \"dec\": [0.0, 0.0, 0.0, -1.0, -1.0, -1.0, 0.5, 0.5, 0.5, 0.5],\n",
5757
" \"time\": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5, 60677.0, 60676.6, 60676.7, 60676.8, 60676.9],\n",
5858
" \"brightness\": [100.0, 101.0, 99.8, 5.0, 5.01, 4.98, 20.1, 20.5, 20.3, 20.2],\n",
59+
" \"band\": [\"g\", \"r\", \"g\", \"r\", \"g\", \"r\", \"g\", \"g\", \"r\", \"r\"],\n",
5960
" }\n",
6061
")\n",
6162
"my_data_frame"
@@ -86,7 +87,7 @@
8687
"nf = NestedFrame.from_flat(\n",
8788
" my_data_frame,\n",
8889
" base_columns=[\"ra\", \"dec\"], # the columns not to nest\n",
89-
" nested_columns=[\"time\", \"brightness\"], # the columns to nest\n",
90+
" nested_columns=[\"time\", \"brightness\", \"band\"], # the columns to nest\n",
9091
" on=\"id\", # column used to associate rows\n",
9192
" name=\"lightcurve\", # name of the nested column\n",
9293
")\n",
@@ -239,7 +240,7 @@
239240
"cell_type": "markdown",
240241
"metadata": {},
241242
"source": [
242-
"The above query is native Pandas, however with nested-pandas we can use hierarchical column names to extend `query` to nested layers."
243+
"The above query is native pandas, however with nested-pandas we can use hierarchical column names to extend `query` to nested layers."
243244
]
244245
},
245246
{
@@ -283,7 +284,7 @@
283284
"source": [
284285
"## Reduce Function\n",
285286
"\n",
286-
"Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to Pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":"
287+
"Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":"
287288
]
288289
},
289290
{
@@ -341,11 +342,91 @@
341342
"source": [
342343
"nf_inputs.loc[0]"
343344
]
345+
},
346+
{
347+
"cell_type": "markdown",
348+
"metadata": {},
349+
"source": [
350+
"## Extended Series Operations with `NestedSeries`"
351+
]
352+
},
353+
{
354+
"cell_type": "markdown",
355+
"metadata": {},
356+
"source": [
357+
"In addition to the extended API offered by the `NestedFrame` for Dataframe operations, nested-pandas provides the `NestedSeries` extending Series operations for nested data."
358+
]
359+
},
360+
{
361+
"cell_type": "code",
362+
"execution_count": null,
363+
"metadata": {},
364+
"outputs": [],
365+
"source": [
366+
"# Single columns containing Nested Data are represented as NestedSeries\n",
367+
"type(nf[\"lightcurve\"])"
368+
]
369+
},
370+
{
371+
"cell_type": "code",
372+
"execution_count": null,
373+
"metadata": {},
374+
"outputs": [],
375+
"source": [
376+
"# It behaves just like a pandas Series\n",
377+
"nf[\"lightcurve\"]"
378+
]
379+
},
380+
{
381+
"cell_type": "markdown",
382+
"metadata": {},
383+
"source": [
384+
"`NestedSeries` offers some unique access patterns for getting data:"
385+
]
386+
},
387+
{
388+
"cell_type": "code",
389+
"execution_count": null,
390+
"metadata": {},
391+
"outputs": [],
392+
"source": [
393+
"# Accessing sub-columns\n",
394+
"nf[\"lightcurve\"][\"time\"] # Alternative to nf[\"lightcurve.time\"]"
395+
]
396+
},
397+
{
398+
"cell_type": "code",
399+
"execution_count": null,
400+
"metadata": {},
401+
"outputs": [],
402+
"source": [
403+
"# Multi-selecting sub-columns\n",
404+
"nf[\"lightcurve\"][[\"time\", \"brightness\"]]"
405+
]
406+
},
407+
{
408+
"cell_type": "markdown",
409+
"metadata": {},
410+
"source": [
411+
"### `NestedSeries` Masking"
412+
]
413+
},
414+
{
415+
"cell_type": "code",
416+
"execution_count": null,
417+
"metadata": {},
418+
"outputs": [],
419+
"source": [
420+
"# Using masks to filter nested data\n",
421+
"g_mask = nf[\"lightcurve\"][\"band\"] == \"g\"\n",
422+
"nf[\"lightcurve\"] = nf[\"lightcurve\"][g_mask]\n",
423+
"nf"
424+
]
344425
}
345426
],
346427
"metadata": {
347428
"kernelspec": {
348-
"display_name": "Python 3 (ipykernel)",
429+
"display_name": "lsdb",
349430
"language": "python",
350431
"name": "python3"
351432
},
@@ -359,7 +440,7 @@
359440
"name": "python",
360441
"nbconvert_exporter": "python",
361442
"pygments_lexer": "ipython3",
362-
"version": "3.13.3"
443+
"version": "3.12.8"
363444
}
364445
},
365446
"nbformat": 4,

docs/reference.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ API Reference
55
:maxdepth: 2
66

77
NestedFrame <reference/nestedframe>
8+
NestedSeries <reference/nestedseries>
89
.nest Accessor <reference/accessor>
910
Utility Functions <reference/utils>
1011
NestedDtype <reference/nesteddtype>

docs/reference/nestedseries.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
=========
2+
NestedSeries
3+
=========
4+
.. currentmodule:: nested_pandas
5+
6+
Constructor
7+
~~~~~~~~~~~
8+
.. autosummary::
9+
:toctree: api/
10+
11+
NestedSeries
12+
13+
Functions
14+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15+
.. autosummary::
16+
:toctree: api/
17+
18+
NestedSeries.to_lists
19+
NestedSeries.to_flat

docs/tutorials/data_manipulation.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@
243243
"outputs": [],
244244
"source": [
245245
"# Create a flat dataframe from our existing nested dataframe\n",
246-
"flat_df = ndf[\"nested\"].nest.to_flat()\n",
246+
"flat_df = ndf[\"nested\"].to_flat()\n",
247247
"\n",
248248
"# Nest our flat dataframe back into our original dataframe\n",
249249
"ndf[\"example\"] = flat_df\n",

docs/tutorials/low_level.ipynb

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,15 @@
6767
"id": "767e8105fcafca0d",
6868
"metadata": {},
6969
"source": [
70-
"## Get access to different data views using `.nest` accessor\n",
70+
"## Get access to different data views using the `.nest` accessor\n",
7171
"\n",
7272
"`pandas` provides an interface to access series with custom \"accessors\" - special attributes acting like a different view on the data.\n",
7373
"You may already know [`.str` accessor](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str) for strings or [`.dt` for datetime-like](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-methods) data.\n",
7474
"Since v2.0, pandas also supports few accessors for `ArrowDtype`d Series, `.list` for list-arrays and `.struct` for struct-arrays.\n",
7575
"\n",
76-
"`nested-pandas` extends this concept and provides `.nest` accessor for `NestedDtype`d Series, which gives user an object to work with nested data more efficiently and flexibly."
76+
"`nested-pandas` extends this concept and provides the `.nest` accessor for `NestedDtype`d Series, which gives user an object to work with nested data more efficiently and flexibly.\n",
77+
"\n",
78+
"> **Note**: The `.nest` accessor shares much of it's API with the `NestedSeries` API, as `NestedSeries` uses the `.nest` accessor under the hood. As a result, many `.nest` operations can be used directly, without invoking the \"`.nest`\" when working with a `NestedSeries`, but some lower-level functionalities remain unique to the `.nest` accessor."
7779
]
7880
},
7981
{

src/nested_pandas/series/nestedseries.py

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
1+
from functools import wraps
2+
13
import pandas as pd
24

35
from nested_pandas.series.dtype import NestedDtype
46

7+
__all__ = ["NestedSeries"]
8+
59

610
def nested_only(func):
711
"""Decorator to designate certain functions can only be used with NestedDtype."""
812

13+
@wraps(func) # This ensures the original function's metadata is preserved
914
def wrapper(*args, **kwargs):
1015
if not isinstance(args[0].dtype, NestedDtype):
1116
raise TypeError(f"'{func.__name__}' can only be used with a NestedDtype, not '{args[0].dtype}'.")
@@ -79,11 +84,67 @@ def __setitem__(self, key, value):
7984
return super().__setitem__(key, value)
8085

8186
@nested_only
82-
def to_flat(self):
83-
"""Convert to a flat dataframe representation of the nested series."""
84-
return self.nest.to_flat()
87+
def to_flat(self, fields: list[str] | None = None) -> pd.DataFrame:
88+
"""Convert nested series into dataframe of flat arrays.
89+
90+
Parameters
91+
----------
92+
fields : list[str] or None, optional
93+
Names of the fields to include. Default is None, which means all fields.
94+
95+
Returns
96+
-------
97+
pd.DataFrame
98+
Dataframe of flat arrays.
99+
100+
Examples
101+
--------
102+
103+
>>> from nested_pandas.datasets.generation import generate_data
104+
>>> nf = generate_data(5, 2, seed=1)
105+
106+
>>> nf["nested"].to_flat()
107+
t flux band
108+
0 8.38389 80.074457 r
109+
0 13.40935 89.460666 g
110+
1 13.70439 96.826158 g
111+
1 8.346096 8.504421 g
112+
2 4.089045 31.342418 g
113+
2 11.173797 3.905478 g
114+
3 17.562349 69.232262 r
115+
3 2.807739 16.983042 r
116+
4 0.547752 87.638915 g
117+
4 3.96203 87.81425 r
118+
119+
"""
120+
return self.nest.to_flat(fields=fields)
85121

86122
@nested_only
87-
def to_lists(self):
88-
"""Convert to a list representation of the nested series."""
89-
return self.nest.to_lists()
123+
def to_lists(self, fields: list[str] | None = None) -> pd.DataFrame:
124+
"""Convert nested series into dataframe of list-array columns.
125+
126+
Parameters
127+
----------
128+
fields : list[str] or None, optional
129+
Names of the fields to include. Default is None, which means all fields.
130+
131+
Returns
132+
-------
133+
pd.DataFrame
134+
Dataframe of list-arrays.
135+
136+
Examples
137+
--------
138+
139+
>>> from nested_pandas.datasets.generation import generate_data
140+
>>> nf = generate_data(5, 2, seed=1)
141+
142+
>>> nf["nested"].to_lists()
143+
t flux band
144+
0 [ 8.38389029 13.4093502 ] [80.07445687 89.46066635] ['r' 'g']
145+
1 [13.70439001 8.34609605] [96.82615757 8.50442114] ['g' 'g']
146+
2 [ 4.08904499 11.17379657] [31.34241782 3.90547832] ['g' 'g']
147+
3 [17.56234873 2.80773877] [69.23226157 16.98304196] ['r' 'r']
148+
4 [0.54775186 3.96202978] [87.63891523 87.81425034] ['g' 'r']
149+
"""
150+
return self.nest.to_lists(fields=fields)

0 commit comments

Comments
 (0)