Investigate why access performance isn't improved uniformly by repacking metadata

On the files we tested over Antarctica, repacking the metadata with `h5repack` didn't improve access times in a dramatic way, specially for xarray and h5py. These granules contained a lot of data and each was around 6GB, with ~7MB of metadata. They were selected and processed using this [notebook](https://github.com/ICESAT-2HackWeek/h5cloud/blob/main/notebooks/01_data-selection.ipynb) 

e.g. [`ATL03_20181120182818_08110112_006_02.h5`](https://urs.earthdata.nasa.gov/oauth/authorize?client_id=ntD0YGC_SM3Bjs-Tnxd7bg&response_type=code&redirect_uri=https://data.nsidc.earthdatacloud.nasa.gov/login&state=%2Fnsidc-cumulus-prod-protected%2FATLAS%2FATL03%2F006%2F2018%2F11%2F20%2FATL03_20181120182818_08110112_006_02.h5) ~7GB in size and 7MB of metadata

> Note: The S3 bucket with the original data is gone but can be easily recreated.

![https://raw.githubusercontent.com/ICESAT-2HackWeek/h5cloud/1f3441190951e5a2da74611f1196a657db7035bd/notebooks/arr_mean_bar_plot.png](https://raw.githubusercontent.com/ICESAT-2HackWeek/h5cloud/1f3441190951e5a2da74611f1196a657db7035bd/notebooks/arr_mean_bar_plot.png)


However for other granules with less data, repacking represented a 10X improvement for xarray 

e.g. [`ATL03_20220201060852_06261401_005_01.h5`](https://data.nsidc.earthdatacloud.nasa.gov/nsidc-cumulus-prod-protected/ATLAS/ATL03/005/2022/02/01/ATL03_20220201060852_06261401_005_01.h5) ~500MB in size and 3MB of metadata


After applying `h5repack` to both files the access time to the first one is not improved for xarray but it is improved from 1 minute to 5 seconds for the second granule, why?

```python

group = '/gt2l/heights'
variable = 'h_ph'

with s3.open(file, 'rb') as file_stream:
     ds = xr.open_dataset(file_stream, group=group, engine='h5netcdf')
     variable_mean = ds[variable].mean()

```


I'm going to repack the original files and put them on a more durable bucket, along with more examples from other NASA datasets. 

Maybe @ajelenak has some clues on why this may be happening.

 


```[tasklist]
### Tasks
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/28
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/29
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/27
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/25
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/24
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/23
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/26
```

```[tasklist]
### Tasks
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate why access performance isn't improved uniformly by repacking metadata #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate why access performance isn't improved uniformly by repacking metadata #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions