|
33 | 33 | "source": [ |
34 | 34 | "## Overview\n", |
35 | 35 | "\n", |
36 | | - "Nested-Pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor columns whose values will be identical for that item.\n", |
| 36 | + "Nested-pandas is tailored towards efficient analysis of nested data sets. This includes data that would normally be represented in a Pandas DataFrames with multiple rows needed to represent a single \"thing\" and therefor columns whose values will be identical for that item.\n", |
37 | 37 | "\n", |
38 | | - "As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in Pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further, any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation to aggregate all of the data for each object.\n", |
| 38 | + "As a concrete example, consider an astronomical data set storing information about observations of physical objects, such as stars and galaxies. One way to represent this in pandas is to create one row per observation with an ID column indicating to which physical object the observation corresponds. However this approach ends up repeating a lot of data over each observation of the same object such as its location on the sky (RA, dec), its classification, etc. Further, any operations processing the data as time series requires the user to first perform a (potentially expensive) group-by operation to aggregate all of the data for each object.\n", |
39 | 39 | "\n", |
40 | 40 | "Let's create a flat pandas dataframe with three objects: object 0 has three observations, object 1 has three observations, and object 2 has 4 observations." |
41 | 41 | ] |
|
56 | 56 | " \"dec\": [0.0, 0.0, 0.0, -1.0, -1.0, -1.0, 0.5, 0.5, 0.5, 0.5],\n", |
57 | 57 | " \"time\": [60676.0, 60677.0, 60678.0, 60675.0, 60676.5, 60677.0, 60676.6, 60676.7, 60676.8, 60676.9],\n", |
58 | 58 | " \"brightness\": [100.0, 101.0, 99.8, 5.0, 5.01, 4.98, 20.1, 20.5, 20.3, 20.2],\n", |
| 59 | + " \"band\": [\"g\", \"r\", \"g\", \"r\", \"g\", \"r\", \"g\", \"g\", \"r\", \"r\"],\n", |
59 | 60 | " }\n", |
60 | 61 | ")\n", |
61 | 62 | "my_data_frame" |
|
86 | 87 | "nf = NestedFrame.from_flat(\n", |
87 | 88 | " my_data_frame,\n", |
88 | 89 | " base_columns=[\"ra\", \"dec\"], # the columns not to nest\n", |
89 | | - " nested_columns=[\"time\", \"brightness\"], # the columns to nest\n", |
| 90 | + " nested_columns=[\"time\", \"brightness\", \"band\"], # the columns to nest\n", |
90 | 91 | " on=\"id\", # column used to associate rows\n", |
91 | 92 | " name=\"lightcurve\", # name of the nested column\n", |
92 | 93 | ")\n", |
|
239 | 240 | "cell_type": "markdown", |
240 | 241 | "metadata": {}, |
241 | 242 | "source": [ |
242 | | - "The above query is native Pandas, however with nested-pandas we can use hierarchical column names to extend `query` to nested layers." |
| 243 | + "The above query is native pandas, however with nested-pandas we can use hierarchical column names to extend `query` to nested layers." |
243 | 244 | ] |
244 | 245 | }, |
245 | 246 | { |
|
283 | 284 | "source": [ |
284 | 285 | "## Reduce Function\n", |
285 | 286 | "\n", |
286 | | - "Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to Pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":" |
| 287 | + "Finally, we'll end with the flexible `reduce` function. `reduce` functions similarly to pandas' `apply` but flattens (reduces) the inputs from nested layers into array inputs to the given apply function. For example, let's find the mean flux for each dataframe in \"nested\":" |
287 | 288 | ] |
288 | 289 | }, |
289 | 290 | { |
|
341 | 342 | "source": [ |
342 | 343 | "nf_inputs.loc[0]" |
343 | 344 | ] |
| 345 | + }, |
| 346 | + { |
| 347 | + "cell_type": "markdown", |
| 348 | + "metadata": {}, |
| 349 | + "source": [ |
| 350 | + "## Extended Series Operations with `NestedSeries`" |
| 351 | + ] |
| 352 | + }, |
| 353 | + { |
| 354 | + "cell_type": "markdown", |
| 355 | + "metadata": {}, |
| 356 | + "source": [ |
| 357 | + "In addition to the extended API offered by the `NestedFrame` for Dataframe operations, nested-pandas provides the `NestedSeries` extending Series operations for nested data." |
| 358 | + ] |
| 359 | + }, |
| 360 | + { |
| 361 | + "cell_type": "code", |
| 362 | + "execution_count": null, |
| 363 | + "metadata": {}, |
| 364 | + "outputs": [], |
| 365 | + "source": [ |
| 366 | + "# Single columns containing Nested Data are represented as NestedSeries\n", |
| 367 | + "type(nf[\"lightcurve\"])" |
| 368 | + ] |
| 369 | + }, |
| 370 | + { |
| 371 | + "cell_type": "code", |
| 372 | + "execution_count": null, |
| 373 | + "metadata": {}, |
| 374 | + "outputs": [], |
| 375 | + "source": [ |
| 376 | + "# It behaves just like a pandas Series\n", |
| 377 | + "nf[\"lightcurve\"]" |
| 378 | + ] |
| 379 | + }, |
| 380 | + { |
| 381 | + "cell_type": "markdown", |
| 382 | + "metadata": {}, |
| 383 | + "source": [ |
| 384 | + "`NestedSeries` offers some unique access patterns for getting data:" |
| 385 | + ] |
| 386 | + }, |
| 387 | + { |
| 388 | + "cell_type": "code", |
| 389 | + "execution_count": null, |
| 390 | + "metadata": {}, |
| 391 | + "outputs": [], |
| 392 | + "source": [ |
| 393 | + "# Accessing sub-columns\n", |
| 394 | + "nf[\"lightcurve\"][\"time\"] # Alternative to nf[\"lightcurve.time\"]" |
| 395 | + ] |
| 396 | + }, |
| 397 | + { |
| 398 | + "cell_type": "code", |
| 399 | + "execution_count": null, |
| 400 | + "metadata": {}, |
| 401 | + "outputs": [], |
| 402 | + "source": [ |
| 403 | + "# Multi-selecting sub-columns\n", |
| 404 | + "nf[\"lightcurve\"][[\"time\", \"brightness\"]]" |
| 405 | + ] |
| 406 | + }, |
| 407 | + { |
| 408 | + "cell_type": "markdown", |
| 409 | + "metadata": {}, |
| 410 | + "source": [ |
| 411 | + "### `NestedSeries` Masking" |
| 412 | + ] |
| 413 | + }, |
| 414 | + { |
| 415 | + "cell_type": "code", |
| 416 | + "execution_count": null, |
| 417 | + "metadata": {}, |
| 418 | + "outputs": [], |
| 419 | + "source": [ |
| 420 | + "# Using masks to filter nested data\n", |
| 421 | + "g_mask = nf[\"lightcurve\"][\"band\"] == \"g\"\n", |
| 422 | + "nf[\"lightcurve\"] = nf[\"lightcurve\"][g_mask]\n", |
| 423 | + "nf" |
| 424 | + ] |
344 | 425 | } |
345 | 426 | ], |
346 | 427 | "metadata": { |
347 | 428 | "kernelspec": { |
348 | | - "display_name": "Python 3 (ipykernel)", |
| 429 | + "display_name": "lsdb", |
349 | 430 | "language": "python", |
350 | 431 | "name": "python3" |
351 | 432 | }, |
|
359 | 440 | "name": "python", |
360 | 441 | "nbconvert_exporter": "python", |
361 | 442 | "pygments_lexer": "ipython3", |
362 | | - "version": "3.13.3" |
| 443 | + "version": "3.12.8" |
363 | 444 | } |
364 | 445 | }, |
365 | 446 | "nbformat": 4, |
|
0 commit comments