Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Dec 19, 2024
1 parent d6e15a5 commit 177246c
Show file tree
Hide file tree
Showing 5 changed files with 94 additions and 93 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
b3698749
86173dc3
130 changes: 65 additions & 65 deletions _tex/paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -403,14 +403,14 @@ \section{Methodology}\label{methodology}
As we can see in Figure~\ref{fig-2}, when we cloud optimize a file using
paged-aggregation there are some considerations and behavior that we had
to take into account. The first thing to observe is that page
aggregation will --as we mentioned-- consolidate the file-level metadata
at the front of the file and will add information in the so-called
superblock\footnote{The HDF5 superblock is a crucial component of the
HDF5 file format, acting as the starting point for accessing all data
within the file. It stores important metadata such as the version of
the file format, pointers to the root group, and addresses for
locating different file components} The next thing to notice is that
page size us uses across the board for metadata and data as of October
aggregation will -- as we mentioned -- consolidate the file-level
metadata at the front of the file and will add information in the
so-called superblock\footnote{The HDF5 superblock is a crucial component
of the HDF5 file format, acting as the starting point for accessing
all data within the file. It stores important metadata such as the
version of the file format, pointers to the root group, and addresses
for locating different file components} The next thing to notice is
page size is used across the board for metadata and data as of October
2024 and version 1.14 of the HDF5 library, page size cannot dynamically
adjust to the total metadata size.

Expand Down Expand Up @@ -485,14 +485,14 @@ \section{Methodology}\label{methodology}
\end{longtable}

This table represents the different configurations we used for our tests
in 2 file sizes. It's worth noticing that we encountered a few outlier
cases where compression and chunk sizes led page aggregation to an
increase in file size of approximately 10\% which was above the desired
value for NSIDC (5\% max) We tested these files using the most common
libraries to handle HDF5 and 2 different I/O drivers that support remote
access to AWS S3, fsspec and the native S3. The results of our testing
is explained in the next section and the code to reproduce the results
is in the attached notebooks.
in 2 file sizes. It's worth noticing we encountered a few outlier cases
where compression and chunk sizes led page aggregation to an increase in
file size of approximately 10\% which was above the desired value for
NSIDC (5\% max). We tested these files using the most common libraries
to handle HDF5 and 2 different I/O drivers that support remote access to
AWS S3, fsspec and the native S3. The results of our testing are
explained in the next section and the code to reproduce the results is
in the attached notebooks.

\section{Results}\label{results}

Expand All @@ -504,19 +504,18 @@ \section{Results}\label{results}

}

\caption{\label{fig-4}shows that using paged aggregation alone is not a
complete solution. This behavior us caused by over-reads of data now
distributed in pages and the internals of HDF5 not knowing how to
optimize the requests. This means that if we cloud optimize alone and
use the same code, in some cases we'll make access to these files even
slower. A very important thing to notice here is that rechunking the
file, in this case using 10X bigger chunks results in a predictable 10X
improvement in access times without any cloud optimization involved.
Having less chunks generates less metadata and bigger requests, in
general is it recommended that chunk sizes should range between 1MB and
10MB{[}Add citation, S3 and HDF5{]} and if we have enough memory and
bandwidth even bigger (Pangeo recommends up to 100MB chunks){[}Add
citation.{]}}
\caption{\label{fig-4}Using paged aggregation alone is not a complete
solution. This behavior is caused by over-reads of data now distributed
in pages and the internals of HDF5 not knowing how to optimize the
requests. This means that if we cloud optimize alone and use the same
code, in some cases we'll make access to these files even slower. A very
important thing to notice here is that rechunking the file, in this case
using 10X bigger chunks results in a predictable 10X improvement in
access times without any cloud optimization involved. Having less chunks
generates less metadata and bigger requests, in general is it
recommended that chunk sizes should range between 1MB and 10MB{[}Add
citation, S3 and HDF5{]} and if we have enough memory and bandwidth even
bigger (Pangeo recommends up to 100MB chunks){[}Add citation.{]}}

\end{figure*}%

Expand All @@ -528,18 +527,18 @@ \section{Results}\label{results}

}

\caption{\label{fig-5}shows that performance once the I/O configuration
is aligned with the chunking in the file, access times perform on par
with cloud optimized access patterns like Kerchunk/Zarr. These numbers
are from in-region execution. Out of region is considerable slower for
the non cloud optimized case.}
\caption{\label{fig-5}Once the I/O configuration is aligned with the
chunking in the file, access times perform on par with cloud optimized
access patterns like Kerchunk/Zarr. These numbers are from in-region
execution. Out of region is considerably slower for the
non-cloud-optimized case.}

\end{figure*}%

\section{Recommendations}\label{recommendations}

Based on the benckmarks we got from our tests, we have split our
recommendations for the ATL03 product into 3 main categories, creating
recommendations for the ATL03 product into 3 main categories: creating
the files, accessing the files, and future tool development. These
recommendations aim to streamline HDF5 workflows in cloud environments,
enhancing performance and reducing costs.
Expand All @@ -548,29 +547,30 @@ \subsection{Recommended cloud
optimizations}\label{recommended-cloud-optimizations}

Based on our testing we recommend the following cloud optimizations for
creating HDF5 files for the ATL03 product: Create HDF5 files using paged
aggregation by setting HDF5 library parameters:
creating HDF5 files for the ATL03 product:

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\tightlist
\item
File page strategy: H5F\_FSPACE\_STRATEGY\_PAGE
\item
File page size: 8000000 If repacking an existing file, h5repack
contains the code to alter these variables inside the file
\end{enumerate}
Create HDF5 files using paged aggregation by setting HDF5 library
parameters:

\begin{enumerate}
\def\labelenumii{\alph{enumii}.}
\tightlist
\item
File page strategy: H5F\_FSPACE\_STRATEGY\_PAGE
\item
File page size: 8000000 If repacking an existing file, h5repack
contains the code to alter these variables inside the file
\end{enumerate}

\begin{Shaded}
\begin{Highlighting}[]
\ExtensionTok{h5repack} \AttributeTok{{-}S}\NormalTok{ PAGE }\AttributeTok{{-}G}\NormalTok{ 8000000 input.h5 output.h5}
\end{Highlighting}
\end{Shaded}

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{2}
\tightlist
\item
Avoid using unlimited dimensions when creating variables because the
HDF5 API cannot support it inside buffered pages and representation of
Expand All @@ -580,14 +580,14 @@ \subsection{Recommended cloud
\subsubsection{Reasoning}\label{reasoning}

Based on the variable size of ATL03 it becomes really difficult to
allocate a fixed metadata page, big files contain north of 30MB of
metadata, but the median sized file is below 8MB. If we had adopted user
block we would have caused an increase in the file size and storage cost
of approximate 30\% (reference to our tests). Another consequence of
using a dedicated fixed page for file-level metadata is that metadata
overflow will generate the same impact in access times, the library will
fetch the metadata in one go but the rest will be using the predefined
block size of 4kb.
allocate a fixed metadata page. Big files contain north of 30MB of
metadata, but the median metadata size per file is below 8MB. If we had
adopted user block we would have caused an increase in the file size and
storage cost of approximate 30\% (reference to our tests). Another
consequence of using a dedicated fixed page for file-level metadata is
that metadata overflow will generate the same impact in access times,
the library will fetch the metadata in one go but the rest will be using
the predefined block size of 4kb.

Paged aggregation is thus the simplest way of cloud optimizing an HDF5
file as the metadata will keep filling dedicated pages until all the
Expand Down Expand Up @@ -618,10 +618,10 @@ \subsection{Recommended Access
HTTP overhead and slower access speeds.
\item
\textbf{Parallel Access}: Use parallel computing frameworks like
\texttt{Dask} or multiprocessing to divide read operations across
multiple processes or nodes. This alleviates the sequential access
bottleneck caused by the HDF5 global lock, particularly in workflows
accessing multiple datasets.
\href{https://www.dask.org/}{\texttt{Dask}} or multiprocessing to
divide read operations across multiple processes or nodes. This
alleviates the sequential access bottleneck caused by the HDF5 global
lock, particularly in workflows accessing multiple datasets.
\item
\textbf{Cache Management}: Implement caching for metadata to avoid
repetitive fetches. Tools like \texttt{fsspec} or \texttt{h5coro}
Expand Down Expand Up @@ -688,14 +688,14 @@ \subsection{Mission implementation}\label{mission-implementation}
\item
Set the ``file space page size'' to 8MiB.
\item
Changed all ``COMPACT'' dataset storage types to ``CONTIGUOUS''.
Change all ``COMPACT'' dataset storage types to ``CONTIGUOUS''.
\item
Increased the ``chunk size'' of the photon-rate datasets (from 10,000
to 100,000 elements), while making sure no ``chunk sizes'' exceeded
the 8MiB ``file space page size''.
Increase the ``chunk size'' of the photon-rate datasets (from 10,000
to 100,000 elements), while making sure no ``chunk sizes'' exceed the
8MiB ``file space page size''.
\item
Introduced a new production step that executes the ``h5repack''
utility (with no options) to create a ``defragmented'' final product.
Introduce a new production step that executes the ``h5repack'' utility
(with no options) to create a ``defragmented'' final product.
\end{enumerate}

\subsection{Discussion and Further
Expand Down
Loading

0 comments on commit 177246c

Please sign in to comment.