Built site for gh-pages

nsidc · Dec 19, 2024 · 177246c · 177246c
1 parent d6e15a5
commit 177246c
Show file tree

Hide file tree

Showing 5 changed files with 94 additions and 93 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-b3698749
+86173dc3
diff --git a/_tex/paper.tex b/_tex/paper.tex
@@ -403,14 +403,14 @@ \section{Methodology}\label{methodology}
 As we can see in Figure~\ref{fig-2}, when we cloud optimize a file using
 paged-aggregation there are some considerations and behavior that we had
 to take into account. The first thing to observe is that page
-aggregation will --as we mentioned-- consolidate the file-level metadata
-at the front of the file and will add information in the so-called
-superblock\footnote{The HDF5 superblock is a crucial component of the
-  HDF5 file format, acting as the starting point for accessing all data
-  within the file. It stores important metadata such as the version of
-  the file format, pointers to the root group, and addresses for
-  locating different file components} The next thing to notice is that
-page size us uses across the board for metadata and data as of October
+aggregation will -- as we mentioned -- consolidate the file-level
+metadata at the front of the file and will add information in the
+so-called superblock\footnote{The HDF5 superblock is a crucial component
+  of the HDF5 file format, acting as the starting point for accessing
+  all data within the file. It stores important metadata such as the
+  version of the file format, pointers to the root group, and addresses
+  for locating different file components} The next thing to notice is
+page size is used across the board for metadata and data as of October
 2024 and version 1.14 of the HDF5 library, page size cannot dynamically
 adjust to the total metadata size.
 
@@ -485,14 +485,14 @@ \section{Methodology}\label{methodology}
 \end{longtable}
 
 This table represents the different configurations we used for our tests
-in 2 file sizes. It's worth noticing that we encountered a few outlier
-cases where compression and chunk sizes led page aggregation to an
-increase in file size of approximately 10\% which was above the desired
-value for NSIDC (5\% max) We tested these files using the most common
-libraries to handle HDF5 and 2 different I/O drivers that support remote
-access to AWS S3, fsspec and the native S3. The results of our testing
-is explained in the next section and the code to reproduce the results
-is in the attached notebooks.
+in 2 file sizes. It's worth noticing we encountered a few outlier cases
+where compression and chunk sizes led page aggregation to an increase in
+file size of approximately 10\% which was above the desired value for
+NSIDC (5\% max). We tested these files using the most common libraries
+to handle HDF5 and 2 different I/O drivers that support remote access to
+AWS S3, fsspec and the native S3. The results of our testing are
+explained in the next section and the code to reproduce the results is
+in the attached notebooks.
 
 \section{Results}\label{results}
 
@@ -504,19 +504,18 @@ \section{Results}\label{results}
 
 }
 
-\caption{\label{fig-4}shows that using paged aggregation alone is not a
-complete solution. This behavior us caused by over-reads of data now
-distributed in pages and the internals of HDF5 not knowing how to
-optimize the requests. This means that if we cloud optimize alone and
-use the same code, in some cases we'll make access to these files even
-slower. A very important thing to notice here is that rechunking the
-file, in this case using 10X bigger chunks results in a predictable 10X
-improvement in access times without any cloud optimization involved.
-Having less chunks generates less metadata and bigger requests, in
-general is it recommended that chunk sizes should range between 1MB and
-10MB{[}Add citation, S3 and HDF5{]} and if we have enough memory and
-bandwidth even bigger (Pangeo recommends up to 100MB chunks){[}Add
-citation.{]}}
+\caption{\label{fig-4}Using paged aggregation alone is not a complete
+solution. This behavior is caused by over-reads of data now distributed
+in pages and the internals of HDF5 not knowing how to optimize the
+requests. This means that if we cloud optimize alone and use the same
+code, in some cases we'll make access to these files even slower. A very
+important thing to notice here is that rechunking the file, in this case
+using 10X bigger chunks results in a predictable 10X improvement in
+access times without any cloud optimization involved. Having less chunks
+generates less metadata and bigger requests, in general is it
+recommended that chunk sizes should range between 1MB and 10MB{[}Add
+citation, S3 and HDF5{]} and if we have enough memory and bandwidth even
+bigger (Pangeo recommends up to 100MB chunks){[}Add citation.{]}}
 
 \end{figure*}%
 
@@ -528,18 +527,18 @@ \section{Results}\label{results}
 
 }
 
-\caption{\label{fig-5}shows that performance once the I/O configuration
-is aligned with the chunking in the file, access times perform on par
-with cloud optimized access patterns like Kerchunk/Zarr. These numbers
-are from in-region execution. Out of region is considerable slower for
-the non cloud optimized case.}
+\caption{\label{fig-5}Once the I/O configuration is aligned with the
+chunking in the file, access times perform on par with cloud optimized
+access patterns like Kerchunk/Zarr. These numbers are from in-region
+execution. Out of region is considerably slower for the
+non-cloud-optimized case.}
 
 \end{figure*}%
 
 \section{Recommendations}\label{recommendations}
 
 Based on the benckmarks we got from our tests, we have split our
-recommendations for the ATL03 product into 3 main categories, creating
+recommendations for the ATL03 product into 3 main categories: creating
 the files, accessing the files, and future tool development. These
 recommendations aim to streamline HDF5 workflows in cloud environments,
 enhancing performance and reducing costs.
@@ -548,29 +547,30 @@ \subsection{Recommended cloud
 optimizations}\label{recommended-cloud-optimizations}
 
 Based on our testing we recommend the following cloud optimizations for
-creating HDF5 files for the ATL03 product: Create HDF5 files using paged
-aggregation by setting HDF5 library parameters:
+creating HDF5 files for the ATL03 product:
 
 \begin{enumerate}
 \def\labelenumi{\arabic{enumi}.}
 \tightlist
 \item
-  File page strategy: H5F\_FSPACE\_STRATEGY\_PAGE
-\item
-  File page size: 8000000 If repacking an existing file, h5repack
-  contains the code to alter these variables inside the file
-\end{enumerate}
+  Create HDF5 files using paged aggregation by setting HDF5 library
+  parameters:
+
+  \begin{enumerate}
+  \def\labelenumii{\alph{enumii}.}
+  \tightlist
+  \item
+    File page strategy: H5F\_FSPACE\_STRATEGY\_PAGE
+  \item
+    File page size: 8000000 If repacking an existing file, h5repack
+    contains the code to alter these variables inside the file
+  \end{enumerate}
 
 \begin{Shaded}
 \begin{Highlighting}[]
  \ExtensionTok{h5repack} \AttributeTok{{-}S}\NormalTok{ PAGE }\AttributeTok{{-}G}\NormalTok{ 8000000 input.h5 output.h5}
 \end{Highlighting}
 \end{Shaded}
-
-\begin{enumerate}
-\def\labelenumi{\arabic{enumi}.}
-\setcounter{enumi}{2}
-\tightlist
 \item
   Avoid using unlimited dimensions when creating variables because the
   HDF5 API cannot support it inside buffered pages and representation of
@@ -580,14 +580,14 @@ \subsection{Recommended cloud
 \subsubsection{Reasoning}\label{reasoning}
 
 Based on the variable size of ATL03 it becomes really difficult to
-allocate a fixed metadata page, big files contain north of 30MB of
-metadata, but the median sized file is below 8MB. If we had adopted user
-block we would have caused an increase in the file size and storage cost
-of approximate 30\% (reference to our tests). Another consequence of
-using a dedicated fixed page for file-level metadata is that metadata
-overflow will generate the same impact in access times, the library will
-fetch the metadata in one go but the rest will be using the predefined
-block size of 4kb.
+allocate a fixed metadata page. Big files contain north of 30MB of
+metadata, but the median metadata size per file is below 8MB. If we had
+adopted user block we would have caused an increase in the file size and
+storage cost of approximate 30\% (reference to our tests). Another
+consequence of using a dedicated fixed page for file-level metadata is
+that metadata overflow will generate the same impact in access times,
+the library will fetch the metadata in one go but the rest will be using
+the predefined block size of 4kb.
 
 Paged aggregation is thus the simplest way of cloud optimizing an HDF5
 file as the metadata will keep filling dedicated pages until all the
@@ -618,10 +618,10 @@ \subsection{Recommended Access
   HTTP overhead and slower access speeds.
 \item
   \textbf{Parallel Access}: Use parallel computing frameworks like
-  \texttt{Dask} or multiprocessing to divide read operations across
-  multiple processes or nodes. This alleviates the sequential access
-  bottleneck caused by the HDF5 global lock, particularly in workflows
-  accessing multiple datasets.
+  \href{https://www.dask.org/}{\texttt{Dask}} or multiprocessing to
+  divide read operations across multiple processes or nodes. This
+  alleviates the sequential access bottleneck caused by the HDF5 global
+  lock, particularly in workflows accessing multiple datasets.
 \item
   \textbf{Cache Management}: Implement caching for metadata to avoid
   repetitive fetches. Tools like \texttt{fsspec} or \texttt{h5coro}
@@ -688,14 +688,14 @@ \subsection{Mission implementation}\label{mission-implementation}
 \item
   Set the ``file space page size'' to 8MiB.
 \item
-  Changed all ``COMPACT'' dataset storage types to ``CONTIGUOUS''.
+  Change all ``COMPACT'' dataset storage types to ``CONTIGUOUS''.
 \item
-  Increased the ``chunk size'' of the photon-rate datasets (from 10,000
-  to 100,000 elements), while making sure no ``chunk sizes'' exceeded
-  the 8MiB ``file space page size''.
+  Increase the ``chunk size'' of the photon-rate datasets (from 10,000
+  to 100,000 elements), while making sure no ``chunk sizes'' exceed the
+  8MiB ``file space page size''.
 \item
-  Introduced a new production step that executes the ``h5repack''
-  utility (with no options) to create a ``defragmented'' final product.
+  Introduce a new production step that executes the ``h5repack'' utility
+  (with no options) to create a ``defragmented'' final product.
 \end{enumerate}
 
 \subsection{Discussion and Further