diff --git a/papers/Alireza_Vaezi/banner.jpg b/papers/Alireza_Vaezi/banner.jpg deleted file mode 100644 index d210b89277..0000000000 Binary files a/papers/Alireza_Vaezi/banner.jpg and /dev/null differ diff --git a/papers/Alireza_Vaezi/banner.png b/papers/Alireza_Vaezi/banner.png index e6a793bd6c..add6516597 100644 Binary files a/papers/Alireza_Vaezi/banner.png and b/papers/Alireza_Vaezi/banner.png differ diff --git a/papers/Alireza_Vaezi/figure1.png b/papers/Alireza_Vaezi/figure1.png deleted file mode 100644 index cd768ee933..0000000000 Binary files a/papers/Alireza_Vaezi/figure1.png and /dev/null differ diff --git a/papers/Alireza_Vaezi/figure2.png b/papers/Alireza_Vaezi/figure2.png deleted file mode 100644 index 2ff94d5a6b..0000000000 Binary files a/papers/Alireza_Vaezi/figure2.png and /dev/null differ diff --git a/papers/Alireza_Vaezi/main.md b/papers/Alireza_Vaezi/main.md index 301e69a2bd..4d42eb295a 100644 --- a/papers/Alireza_Vaezi/main.md +++ b/papers/Alireza_Vaezi/main.md @@ -1,15 +1,12 @@ --- # Ensure that this title is the same as the one in `myst.yml` title: Training a Supervised Cilia Segmentation Model from Self-Supervision -exports: - - format: pdf - template: arxiv_two_column - output: exports/my-document.pdf - abstract: | - Cilia are organelles found on the surface of some cells in the human body that sweep rhythmically to transport substances. Dysfunctional cilia are indicative of diseases that can disrupt organs such as the lungs and kidneys. Understanding cilia behavior is essential in diagnosing and treating such diseases. But, the tasks of automatically analysing cilia are often a labor and time-intensive since there is a lack of automated segmentation. In this work we overcome this bottleneck by developing a robust, self-supervised framework exploiting the visual similarity of normal and dysfunctional cilia. This framework generates pseudolabels from optical flow motion vectors, which serve as training data for a semi-supervised neural network. Our approach eliminates the need for manual annotations, enabling accurate and efficient segmentation of both motile and immotile cilia. + Cilia are organelles found on the surface of some cells in the human body that sweep rhythmically to transport substances. Dysfunctional cilia are indicative of diseases that can disrupt organs such as the lungs and kidneys. Understanding cilia behavior is essential in diagnosing and treating such diseases. But, the tasks of automatically analyzing cilia are often a labor and time-intensive since there is a lack of automated segmentation. In this work we overcome this bottleneck by developing a robust, self-supervised framework exploiting the visual similarity of normal and dysfunctional cilia. This framework generates pseudolabels from optical flow motion vectors, which serve as training data for a semi-supervised neural network. Our approach eliminates the need for manual annotations, enabling accurate and efficient segmentation of both motile and immotile cilia. --- +(sec:introduction)= + ## Introduction Cilia are hair-like membranes that extend out from the surface of the cells and are present on a variety of cell types such as lungs and brain ventricles and can be found in the majority of vertebrate cells. Categorized into motile and primary, motile cilia can help the cell to propel, move the flow of fluid, or fulfill sensory functions, while primary cilia act as signal receivers, translating extracellular signals into cellular responses [@doi:10.1007/978-94-007-5808-7_1]. Ciliopathies is the term commonly used to describe diseases caused by ciliary dysfunction. These disorders can result in serious issues such as blindness, neurodevelopmental defects, or obesity [@Hansen2021-fd]. Motile cilia beat in a coordinated manner with a specific frequency and pattern [@doi:10.1016/j.compfluid.2011.05.016]. Stationary, dyskinetic, or slow ciliary beating indicates ciliary defects. Ciliary beating is a fundamental biological process that is essential for the proper functioning of various organs, which makes understanding the ciliary phenotypes a crucial step towards understanding ciliopathies and the conditions stemming from it [@zain2022low]. @@ -20,7 +17,9 @@ Video segmentation techniques tend to be more robust to such noise, but still st To address this challenge, we propose a two-stage image segmentation model designed to obviate the need for expert-drawn masks. We first build a corpus of segmentation masks based on optical flow (OF) thresholding over a subset of healthy training data with guaranteed motility. We then train a semi-supervised neural segmentation model to identify both motile and immotile data as a single segmentation category, using the flow-generated masks as “pseudolabels”. These pseudolabels operate as “ground truth” for the model while acknowledging the intrinsic uncertainty of the labels. The fact that motile and immotile cilia tend to be visually similar in snapshot allows us to generalize the domain of the model from motile cilia to all cilia. Combining these stages results in a semi-supervised framework that does not rely on any expert-drawn ground-truth segmentation masks, paving the way for full automation of a general cilia analysis pipeline. -The rest of this article is structured as follows: The Background section enumerates the studies relevant to our methodology, followed by a detailed description of our approach in the Methodology section. Finally, the next section delineates our experiment and provides a discussion of the results obtained. +The rest of this article is structured as follows: The [Background section](#sec:background) enumerates the studies relevant to our methodology, followed by a detailed description of our approach in the [Methodology section](#sec:methodology). Finally, the [next section](#sec:results) delineates our experiment and provides a discussion of the results obtained. + +(sec:background)= ## Background @@ -28,10 +27,12 @@ Dysfunction in ciliary motion indicates diseases known as ciliopathies, which ca Accurate analysis of ciliary motion is essential but challenging due to the limitations of manual analysis, which is labor-intensive, subjective, and prone to error. [@zain2020towards] proposed a modular generative pipeline that automates ciliary motion analysis by segmenting, representing, and modeling the dynamic behavior of cilia, thereby reducing the need for expert intervention and improving diagnostic consistency. [@quinn2015automated] developed a computational pipeline using dynamic texture analysis and machine learning to objectively and quantitatively assess ciliary motion, achieving over 90% classification accuracy in identifying abnormal ciliary motion associated with diseases like primary ciliary dyskinesia (PCD). Additionally, [@zain2022low] explored advanced feature extraction techniques like Zero-phase PCA Sphering (ZCA) and Sparse Autoencoders (SAE) to enhance cilia segmentation accuracy. These methods address challenges posed by noisy, partially occluded, and out-of-phase imagery, ultimately improving the overall performance of ciliary motion analysis pipelines. Collectively, these approaches aim to enhance diagnostic accuracy and efficiency, making ciliary motion analysis more accessible and reliable, thereby improving patient outcomes through early and accurate detection of ciliopathies. However, these studies rely on manually labeled data. The segmentation masks and ground-truth annotations, which are essential for training the models and validating their performance, are generated by expert reviewers. This dependence on manually labeled data is a significant limitation making automated cilia segmentation the bottleneck to automating cilia analysis. -In the biomedical field, where labeled data is often scarce and costly to obtain, several solutions have been proposed to augment and utilize available data effectively. These include semi-supervised learning [@YAKIMOVICH2021100383,@van2020survey], which utilizes both labeled and unlabeled data to enhance learning accuracy by leveraging the data's underlying distribution. Active learning [@settles2009active] focuses on selectively querying the most informative data points for expert labeling, optimizing the training process by using the most valuable examples. Data augmentation techniques [@10.3389/fcvm.2020.00105], [@Krois2021], [@10.1148/ryai.2020190195], [@Sandfort2019], [@YAKIMOVICH2021100383], [@van2001art], [@krizhevsky2012imagenet], [@ronneberger2015u], such as image transformations and synthetic data generation through Generative Adversarial Networks [@goodfellow2014generative], [@yi2019generative], increase the diversity and volume of training data, enhancing model robustness and reducing overfitting. Transfer learning [@YAKIMOVICH2021100383], [@Sanford2020-yg], [@NEURIPS2019_eb1e7832], [@hutchinson2017overcoming] transfers knowledge from one task to another, minimizing the need for extensive labeled data in new tasks. Self-supervised learning [@kim2019self], [@kolesnikov2019revisiting], [@mahendran2019cross] creates its labels by defining a pretext task, like predicting the position of a randomly cropped image patch, aiding in the learning of useful data representations. Additionally, few-shot, one-shot, and zero-shot learning techniques [@li2006one], [@miller2000learning] are designed to operate with minimal or no labeled examples, relying on generalization capabilities or metadata for making predictions about unseen classes. +In the biomedical field, where labeled data is often scarce and costly to obtain, several solutions have been proposed to augment and utilize available data effectively. These include semi-supervised learning [@YAKIMOVICH2021100383,@van2020survey], which utilizes both labeled and unlabeled data to enhance learning accuracy by leveraging the data's underlying distribution. Active learning [@settles2009active] focuses on selectively querying the most informative data points for expert labeling, optimizing the training process by using the most valuable examples. Data augmentation techniques [@10.3389/fcvm.2020.00105;@Krois2021;@10.1148/ryai.2020190195;@Sandfort2019;@YAKIMOVICH2021100383;@van2001art;@krizhevsky2012imagenet;@ronneberger2015u], such as image transformations and synthetic data generation through Generative Adversarial Networks [@goodfellow2014generative;@yi2019generative], increase the diversity and volume of training data, enhancing model robustness and reducing overfitting. Transfer learning [@YAKIMOVICH2021100383;@Sanford2020-yg;@NEURIPS2019_eb1e7832;@hutchinson2017overcoming] transfers knowledge from one task to another, minimizing the need for extensive labeled data in new tasks. Self-supervised learning [@kim2019self;@kolesnikov2019revisiting;@mahendran2019cross] creates its labels by defining a pretext task, like predicting the position of a randomly cropped image patch, aiding in the learning of useful data representations. Additionally, few-shot, one-shot, and zero-shot learning techniques [@li2006one;@miller2000learning] are designed to operate with minimal or no labeled examples, relying on generalization capabilities or metadata for making predictions about unseen classes. A promising approach to overcome the dependency on manually labeled data is the use of unsupervised methods to generate ground truth masks. Unsupervised methods do not require prior knowledge of the data [@khatibi2021proposing]. Using domain-specific cues unsupervised learning techniques can automatically discover patterns and structures in the data without the need for labeled examples, potentially simplifying the process of generating accurate segmentation masks for cilia. Inspired by advances in unsupervised methods for image segmentation, in this work, we firstly compute the motion vectors using optical flow of the ciliary regions and then apply autoregressive modelling to capture their temporal dynamics. Autoregressive modelling is advantageous since the labels are features themselves. By analyzing the OF vectors, we can identify the characteristic motion of cilia, which allows us to generate pseudolabels as ground truth segmentation masks. These pseudolabels are then used to train a robust semi-supervised neural network, enabling accurate and automated segmentation of both motile and immotile cilia. +(sec:methodology)= + ## Methodology Dynamic textures, such as sea waves, smoke, and foliage, are sequences of images of moving scenes that exhibit certain stationarity properties in time [@doretto2003dynamic]. Similarly, ciliary motion can be considered as dynamic textures for their orderly rhythmic beating. Taking advantage of this temporal regularity in ciliary motion, OF can be used to compute the flow vectors of each pixel of high-speed videos of cilia. In conjunction with OF, autoregressive (AR) parameterization of the OF property of the video yields a manifold that quantifies the characteristic motion in the cilia. The low dimension of this manifold contains the majority of variations within the data, which can then be used to segment the motile ciliary regions. @@ -51,10 +52,12 @@ Where $I(x,y,t)$ is the pixel intensity at position $(x,y)$ a time $t$. Here, $( :label: fig:sample_vids_with_gt_mask A sample of three videos in our cilia dataset with their manually annotated ground truth masks. ::: + + :::{figure} sample_OF.png :label: fig:sample_OF Representation of rotation (curl) component of OF at a random time @@ -66,7 +69,7 @@ Representation of rotation (curl) component of OF at a random time ```{math} :label: AR -y_t =C\vec{x_t} + \vec{u} +y_t =C\vec{x_t} + \vec{u} ``` ```{math} @@ -103,22 +106,25 @@ The next section discusses the results of the experiment and the performance of :::{table} Summary of model architecture, training setup, and dataset distribution :label: tbl:model_specs -| **Aspect** | **Details** | -|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------| -| **Architecture** | FPN with ResNet-34 encoder | -| **Input** | Grayscale images with a single input channel | -| **Batch Size** | 2 | -| **Training Samples** | 28,869 | -| **Validation Samples** | 5,095 | -| **Test Samples** | 108 | -| **Loss Function** | Binary Cross-Entropy Loss | -| **Optimizer** | Adam optimizer with a learning rate of $10^{-3}$ | -| **Evaluation Metric** | Dice score during training, validation, and testing | -| **Data Augmentation Techniques**| Resizing, random cropping, and rotation | -| **Implementation** | Using a Python library with Neural Networks for Image Segmentation based on PyTorch [@Iakubovskii:2019] | + +| **Aspect** | **Details** | +| -------------------------------- | ------------------------------------------------------------------------------------------------------- | +| **Architecture** | FPN with ResNet-34 encoder | +| **Input** | Grayscale images with a single input channel | +| **Batch Size** | 2 | +| **Training Samples** | 28,869 | +| **Validation Samples** | 5,095 | +| **Test Samples** | 108 | +| **Loss Function** | Binary Cross-Entropy Loss | +| **Optimizer** | Adam optimizer with a learning rate of $10^{-3}$ | +| **Evaluation Metric** | Dice score during training, validation, and testing | +| **Data Augmentation Techniques** | Resizing, random cropping, and rotation | +| **Implementation** | Using a Python library with Neural Networks for Image Segmentation based on PyTorch [@Iakubovskii:2019] | ::: +(sec:results)= + ## Results and Discussion The model's performance metrics, including IoU, Dice score, sensitivity, and specificity, are summarized in @tbl:metrics. The validation phase achieved an IoU of 0.398 and a Dice score of 0.569, which indicates a moderate overlap between the predicted and ground truth masks. The high sensitivity (0.997) observed during validation suggests that the model is proficient in identifying ciliary regions, albeit with a specificity of 0.882, indicating some degree of false positives. In the testing phase, the IoU and Dice scores decreased to 0.132 and 0.233, respectively, reflecting the challenges posed by the dyskinetic cilia data, which were not included in the training or validation sets. Despite this, the model maintained a sensitivity of 0.479 and specificity of 0.806. @@ -128,15 +134,16 @@ The model's performance metrics, including IoU, Dice score, sensitivity, and spe The model predictions on 5 dyskinetic cilia samples. The first column shows a frame of the video, the second column shows the manually labeled ground truth, the third column is the model's prediction, and the last column is a thresholded version of the prediction. ::: -@fig:out_sample provides visual examples of the model's predictions on dyskinetic cilia samples, alongside the manually labeled ground truth and thresholded predictions. The dyskinetic samples were not used in the training or validation phases. These predictions were generated after only 15 epochs of training with a small training data. The visual comparison reveals that, while the model captures the general structure of ciliary regions, there are instances of under-segmentation and over-segmentation, which are more pronounced in the dyskinetic samples. This observation is consistent with the quantitative metrics, suggesting that further refinement of the pseudolabel generation process or model architecture could enhance segmentation accuracy. +@fig:out_sample provides visual examples of the model's predictions on dyskinetic cilia samples, alongside the manually labeled ground truth and thresholded predictions. The dyskinetic samples were not used in the training or validation phases. These predictions were generated after only 15 epochs of training with a small training data. The visual comparison reveals that, while the model captures the general structure of ciliary regions, there are instances of under-segmentation and over-segmentation, which are more pronounced in the dyskinetic samples. This observation is consistent with the quantitative metrics, suggesting that further refinement of the pseudolabel generation process or model architecture could enhance segmentation accuracy. :::{table} The performance of the model in validation and testing phases after 15 epochs of training. :label: tbl:metrics -| Phases | Metrics | | | | -|------------|---------------|-------------|------------|------------| -| | IoU over dataset | Dice Score | Sensitivity| Specificity| -| Validation | 0.398 | 0.569 | 0.997 | 0.882 | -| Testing | 0.132 | 0.233 | 0.479 | 0.806 | + +| Phases | Metrics | | | | +| ---------- | ---------------- | ---------- | ----------- | ----------- | +| | IoU over dataset | Dice Score | Sensitivity | Specificity | +| Validation | 0.398 | 0.569 | 0.997 | 0.882 | +| Testing | 0.132 | 0.233 | 0.479 | 0.806 | ::: @@ -146,11 +153,12 @@ Since dyskinetic videos contain cilia that show some degree of movement we gener :::{table} The performance of the model after retraining with an addition of 283 videos of dyskinetic cilia to the training dataset. :label: tbl:exp2_metrics -| Phases | Metrics | | | | -|------------|---------------|-------------|------------|------------| -| | IoU over dataset | Dice Score | Sensitivity| Specificity| -| Validation | 0.202 | 0.337 | 0.999 | 0.765 | -| Testing | 0.139 | 0.245 | 0.732 | 0.696 | + +| Phases | Metrics | | | | +| ---------- | ---------------- | ---------- | ----------- | ----------- | +| | IoU over dataset | Dice Score | Sensitivity | Specificity | +| Validation | 0.202 | 0.337 | 0.999 | 0.765 | +| Testing | 0.139 | 0.245 | 0.732 | 0.696 | ::: diff --git a/papers/Alireza_Vaezi/myst.yml b/papers/Alireza_Vaezi/myst.yml index 99a858d56a..17e40926b1 100644 --- a/papers/Alireza_Vaezi/myst.yml +++ b/papers/Alireza_Vaezi/myst.yml @@ -1,50 +1,51 @@ version: 1 +extends: ../proceedings.yml project: + doi: 10.25080/HXCJ6205 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-alireza-vaezi # Ensure your title is the same as in your `main.md` title: Training a Supervised Cilia Segmentation Model from Self-Supervision - subtitle: University Of Georgia + description: Understanding cilia behavior is essential in diagnosing and treating such diseases, but, the tasks of automatically analyzing cilia are often a labor and time-intensive. In this work we overcome this bottleneck by developing a robust, self-supervised framework exploiting the visual similarity of normal and dysfunctional cilia. # Authors should have affiliations, emails and ORCIDs if available authors: - - name: Seyed Alireza Vaezi - email: sv22900@uga.edu - orcid: 0009-0000-2089-8362 - affiliations: - - University of Georgia - corresponding: true - - name: Shannon Quinn - email: spq@uga.edu - affiliations: - - University of Georgia + - name: Seyed Alireza Vaezi + email: sv22900@uga.edu + orcid: 0009-0000-2089-8362 + affiliations: + - name: University of Georgia + ror: https://ror.org/00te3t702 + corresponding: true + - name: Shannon Quinn + email: spq@uga.edu + affiliations: + - name: University of Georgia + ror: https://ror.org/00te3t702 keywords: - - Cilia - - Unsupervised biomedical Image Segmentation - - Optical Flow - - Autoregressive - - Deep Learning + - Cilia + - Unsupervised biomedical Image Segmentation + - Optical Flow + - Autoregressive + - Deep Learning # Add the abbreviations that you use in your paper here abbreviations: - MyST: Markedly Structured Text + OF: optical flow + PCD: primary ciliary dyskinesia + ZCA: Zero-phase PCA Sphering + SAE: Sparse Autoencoders + AR: autoregressive + FPN: Feature Pyramid Network # It is possible to explicitly ignore the `doi-exists` check for certain citation keys error_rules: - - rule: doi-exists - severity: ignore - keys: - - Atr03 - - terradesert - - jupyter - - sklearn1 - - sklearn2 - - Iakubovskii:2019 - - settles2009active - # A banner will be generated for you on publication, this is a placeholder - banner: banner.jpg - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 + - rule: doi-exists + severity: ignore + keys: + - Atr03 + - terradesert + - jupyter + - sklearn1 + - sklearn2 + - Iakubovskii:2019 + - settles2009active site: template: article-theme diff --git a/papers/Arushi_Nath/banner.png b/papers/Arushi_Nath/banner.png index 23676bb677..0e5a60bd47 100644 Binary files a/papers/Arushi_Nath/banner.png and b/papers/Arushi_Nath/banner.png differ diff --git a/papers/Arushi_Nath/main.md b/papers/Arushi_Nath/main.md index 4c680a9b90..f1b4995ea2 100644 --- a/papers/Arushi_Nath/main.md +++ b/papers/Arushi_Nath/main.md @@ -2,43 +2,50 @@ # Ensure that this title is the same as the one in `myst.yml` title: Algorithms to Determine Asteroid’s Physical Properties using Sparse and Dense Photometry, Robotic Telescopes and Open Data abstract: | - The rapid pace of discovering asteroids due to advancements in detection techniques outpaces current abilities to analyze them comprehensively. Understanding an asteroid's physical properties is crucial for effective deflection strategies and improves our understanding of the solar system's formation and evolution. Dense photometry provides continuous time-series measurements valuable for determining an asteroid's rotation period, yet is limited to a singular phase angle. Conversely, sparse photometry offers non-continuous measurements across multiple phase angles, essential for determining an asteroid's absolute magnitude, albedo (reflectivity), and size. This paper presents open-source algorithms that integrate dense photometry from citizen scientists with sparse photometry from space and ground-based all-sky surveys to determine asteroids' albedo, size, rotation, strength, and composition. - Applying the algorithms to the Didymos binary asteroid, combined with data from GAIA, the Zwicky Transient Facility, and ATLAS photometric sky surveys, revealed Didymos to be 840 meters wide, with a 0.14 albedo, an 18.14 absolute magnitude, a 2.26-hour rotation period, rubble-pile strength, and an S-type composition. Didymos was the target of the 2022 NASA Double Asteroid Redirection Test (DART) mission. The algorithm successfully measured a 35-minute decrease in the mutual orbital period following the DART mission, equating to a 40-meter reduction in the mutual orbital radius, proving a successful deflection. Analysis of the broader asteroid population highlighted significant compositional diversity, with a predominance of carbonaceous (C-type) asteroids in the outer regions of the asteroid belt and siliceous (S-type) and metallic (M-type) asteroids more common in the inner regions. These findings provide insights into the diversity and distribution of asteroid compositions, reflecting the conditions and processes of the early solar system. + The rapid pace of discovering asteroids due to advancements in detection techniques outpaces current abilities to analyze them comprehensively. Understanding an asteroid's physical properties is crucial for effective deflection strategies and improves our understanding of the solar system's formation and evolution. Dense photometry provides continuous time-series measurements valuable for determining an asteroid's rotation period, yet is limited to a singular phase angle. Conversely, sparse photometry offers non-continuous measurements across multiple phase angles, essential for determining an asteroid's absolute magnitude, albedo (reflectivity), and size. This paper presents open-source algorithms that integrate dense photometry from citizen scientists with sparse photometry from space and ground-based all-sky surveys to determine asteroids' albedo, size, rotation, strength, and composition. + Applying the algorithms to the Didymos binary asteroid, combined with data from GAIA, the Zwicky Transient Facility, and ATLAS photometric sky surveys, revealed Didymos to be 840 meters wide, with a 0.14 albedo, an 18.14 absolute magnitude, a 2.26-hour rotation period, rubble-pile strength, and an S-type composition. Didymos was the target of the 2022 NASA Double Asteroid Redirection Test (DART) mission. The algorithm successfully measured a 35-minute decrease in the mutual orbital period following the DART mission, equating to a 40-meter reduction in the mutual orbital radius, proving a successful deflection. Analysis of the broader asteroid population highlighted significant compositional diversity, with a predominance of carbonaceous (C-type) asteroids in the outer regions of the asteroid belt and siliceous (S-type) and metallic (M-type) asteroids more common in the inner regions. These findings provide insights into the diversity and distribution of asteroid compositions, reflecting the conditions and processes of the early solar system. This work empowers citizen scientists to become planetary defenders, contributing significantly to planetary defense and enhancing our understanding of solar system composition and evolution. - --- ## Introduction ### Background + There are over 1.3 million known asteroids, and advanced detection techniques lead to the discovery of hundreds of new near-Earth and main-belt asteroids every month. Studying these asteroids provides valuable insights into the early solar system's formation and evolution. Phase curves, which illustrate the change in an asteroid's brightness as its phase angle (the angle between the observer, asteroid, and Sun) changes, are essential for asteroid characterization. Understanding near-Earth asteroids is crucial because it allows for the development of effective deflection strategies, which are vital for preventing potential collisions with Earth and safeguarding our planet from catastrophic impacts. ### Research Problem + Despite advancements in detection techniques, the need for observations spanning multiple years, limited telescope availability, and narrow observation windows hinder detailed characterization of asteroids. To date, phase curves have been generated for only a few thousand asteroids. This slow pace of analysis hinders our planetary defense capabilities for deflecting potentially hazardous asteroids and limits our understanding of the solar system's evolution. ### Related Work -Recent efforts in the field have focused on various approaches to combine dense and sparse photometric datasets. For instance, the Pan-STARRS survey has used sparse photometry to estimate the absolute magnitudes and rotation periods of asteroids, while the Asteroid Terrestrial-impact Last Alert System (ATLAS) provides sparse photometry data for many asteroids observed across different phase angles. Studies by Shevchenko et al. (2019) have explored methods to derive phase integrals and geometric albedos from sparse data. On the dense photometry side, projects like the Zwicky Transient Facility (ZTF) and Gaia Data Release 3 (DR3) have contributed extensive datasets valuable for continuous observations of asteroid brightness variations. However, these efforts often face challenges in data integration due to differing observational cadences, filters, and coverage. + +Recent efforts in the field have focused on various approaches to combine dense and sparse photometric datasets. For instance, the Pan-STARRS survey has used sparse photometry to estimate the absolute magnitudes and rotation periods of asteroids, while the Asteroid Terrestrial-impact Last Alert System (ATLAS) provides sparse photometry data for many asteroids observed across different phase angles. Studies by @Shevchenko2019 have explored methods to derive phase integrals and geometric albedos from sparse data. On the dense photometry side, projects like the Zwicky Transient Facility (ZTF) and Gaia Data Release 3 (DR3) have contributed extensive datasets valuable for continuous observations of asteroid brightness variations. However, these efforts often face challenges in data integration due to differing observational cadences, filters, and coverage. ### Research Objectives + This paper presents an innovative methodology, PhAst, developed using Python algorithms to combine dense photometry from citizen scientists with sparse photometry from space and ground-based all-sky surveys to determine key physical characteristics of asteroids. The specific objectives of this research are to: + 1. Develop Python algorithms to integrate serendipitous asteroid observations with citizen-contributed and open datasets. 2. Apply these algorithms to planetary defense tests, such as NASA's DART mission. 3. Characterize large populations of asteroids to infer compositional diversity in our solar system. ### Significance of the Study + This study offers significant contributions to the field of asteroid characterization and planetary defense. By integrating dense and sparse photometry, the PhAst algorithm provides a comprehensive method for determining the physical properties of asteroids. The open-source nature of the algorithm encourages collaboration and improvements from a global community of researchers and citizen scientists, enhancing its robustness and accelerating advancements in the field. Furthermore, the study empowers citizen scientists to actively participate in planetary defense, contributing valuable data and insights that enhance our understanding and preparedness for potential asteroid impacts. ### Application in Planetary Defense: NASA’s DART Mission + The NASA Double Asteroid Redirection Test (DART) mission was designed to test and validate methods to protect Earth from hazardous asteroid impacts by demonstrating the kinetic impactor technique. It involved sending a spacecraft to collide with an asteroid to change its trajectory. PhAst provides a detailed pre- and post-impact analysis of the target asteroid, Didymos, and its moonlet, Dimorphos. ## Methodology + ### Overview The PhAst algorithm integrates dense and sparse photometry data to determine the physical properties of asteroids. Dense photometry provides continuous time-series measurements, crucial for determining rotation periods, while sparse photometry offers non-continuous measurements across multiple phase angles, essential for absolute magnitude and size determination. Integrating both methods can overcome their individual limitations. ### Development of Novel Open-Source PhAst -PhAst integrates several years of sparse photometry from serendipitous asteroid observations with dense photometry from professional and citizen scientists. **See Figure 1.** The algorithm effectively combines continuous light data (dense photometry) and infrequent light data (sparse photometry) by creating phase curves whose linear components yield the asteroid’s geometric albedo and composition, while the non-linear brightness surge at small angles determines the absolute magnitude. This methodology allows for the creation of folded light curves to measure the asteroid’s rotation period and, for binary asteroids, their mutual orbital period. Being open-source, the PhAst algorithm allows for collaboration and improvements from a global community of researchers and citizen scientists, enhancing its robustness and accelerating advancements in asteroid characterization. +PhAst integrates several years of sparse photometry from serendipitous asteroid observations with dense photometry from professional and citizen scientists [@fig1]. The algorithm effectively combines continuous light data (dense photometry) and infrequent light data (sparse photometry) by creating phase curves whose linear components yield the asteroid’s geometric albedo and composition, while the non-linear brightness surge at small angles determines the absolute magnitude. This methodology allows for the creation of folded light curves to measure the asteroid’s rotation period and, for binary asteroids, their mutual orbital period. Being open-source, the PhAst algorithm allows for collaboration and improvements from a global community of researchers and citizen scientists, enhancing its robustness and accelerating advancements in asteroid characterization. ```{figure} figure1.png :name: fig1 @@ -48,14 +55,16 @@ Flowchart Showing Data Integration Process of PhAst ``` ### Data Sources and Integration + 1. **Primary Asteroid Observations Using Robotic Telescopes:** Observation proposals were submitted to Alnitak Observatory, American Association of Variable Star Observers, Burke Gaffney Observatory, and Faulkes Telescope. 2. **Citizen Scientist Observations:** Observations submitted by backyard astronomers from locations such as Chile and the USA. 3. **Serendipitous Asteroid Observations in Sky Surveys:** Data from European Space Agency Gaia Data Release 3 and Zwicky Transient Facility (ZTF) Survey. -4. **Secondary Asteroid Databases:** Data from the Asteroid Lightcurve Database (ALCDEF) and Asteroid Photometric Data Catalog (PDS) 3rd update. +4. **Secondary Asteroid Databases:** Data from the Asteroid Lightcurve Database (ALCDEF) @ALCDEF and Asteroid Photometric Data Catalog (PDS) 3rd update. For searching asteroids in the ZTF dataset, the FINKS portal was utilized, which allowed searching asteroids by their Minor Planet Center (MPC) number. Similarly, asteroids in the GAIA dataset were searched using the Solar System Objects database of Gaia DR3. ### Observational Process + 1. **Identify Known Stars and Asteroids:** Using the GAIA Star Catalog and HORIZONS Asteroid Catalog, known stars and asteroids are identified and centroided in images. This step ensures that the exact positions of celestial objects are accurately determined, which is crucial for subsequent analysis. 2. **Determine Optimal Aperture Size:** Differential photometry is used to calculate the asteroid's instrumental magnitude by determining the optimal aperture size that balances brightness measurement and noise. Too small an aperture may not capture the full brightness of the asteroid, while too large an aperture may include excessive background noise. 3. **Select Suitable Comparison Stars:** Comparison stars with stable brightness are selected to remove the effects of seeing conditions and determine the asteroid's computed magnitude. This step is important to ensure that variations in observed brightness are due to the asteroid itself and not due to atmospheric conditions or instrumental errors. @@ -64,108 +73,122 @@ For searching asteroids in the ZTF dataset, the FINKS portal was utilized, which 6. **Determine Rotation and Orbital Periods:** Composite light curves are used to find the asteroid's rotation period and, for binary asteroids, the mutual orbital period. This analysis reveals the dynamic characteristics of the asteroid, including its spin state and orbital interactions with companion bodies. ### Python Tools and Libraries + The development and implementation of PhAst heavily relied on various Python tools and libraries: + - **Python:** The primary programming language used for developing PhAst. - **NumPy:** Used for numerical computations and handling large datasets efficiently. - **Matplotlib:** Utilized for plotting phase curves and light curves, visualizing the data, and generating graphs for analysis. - **AstroPy:** Employed for astronomical calculations and handling astronomical data, such as coordinate transformations and time conversions. ## Case Study: Didymos Binary Asteroid + ### Initial Observations + The Didymos binary asteroid, targeted by NASA's 2022 Double Asteroid Redirection Test (DART) mission, was selected for a detailed case study. Initial observations determined Didymos to be 840 meters wide, with a 0.14 albedo, an 18.14 absolute magnitude (a measure of its intrinsic brightness), a 2.26-hour rotation period, rubble-pile strength (indicating it is a loose collection of rocks held together by gravity), and an S-type composition (indicating it is made of stony or siliceous minerals). These properties were derived by applying the PhAst algorithm to a combination of dense and sparse photometric data. ### Impact Analysis + PhAst successfully measured a 35-minute decrease in the mutual orbital period following the DART mission's impact. External sources validated these findings, demonstrating the algorithm's accuracy and reliability. The change in the mutual orbital period provided critical data on the effectiveness of the DART mission in altering the asteroid's trajectory, a key goal of planetary defense strategies. ## Results -PhAst was used to generate phase curves for over 2100 asteroids in 100 hours on a home computer, including data-retrieval time. The physical properties of various target asteroids of space missions and understudied asteroids were determined, including targets of the NASA LUCY Mission, UAE Mission, binary asteroids, and understudied asteroids. **See figure 2.** The rapid analysis capability highlights PhAst's potential for large-scale asteroid characterization, enabling detailed studies of large populations of asteroids in a relatively short time. + +PhAst was used to generate phase curves for over 2100 asteroids in 100 hours on a home computer, including data-retrieval time. The physical properties of various target asteroids of space missions and understudied asteroids were determined, including targets of the NASA LUCY Mission, UAE Mission, binary asteroids, and understudied asteroids [@fig2]. The rapid analysis capability highlights PhAst's potential for large-scale asteroid characterization, enabling detailed studies of large populations of asteroids in a relatively short time. ```{figure} figure3.png :name: fig2 :align: center -Physical Properties of Target Asteroids of Space Missions and Understudied Asteroids Determined +Physical Properties of Target Asteroids of Space Missions and Understudied Asteroids Determined ``` + ### Determining Physical Properties of Target Asteroids of Space Missions and Understudied Asteroids + PhAst was used to generate phase curves and determine the physical properties of various target asteroids of space missions and understudied asteroids. The results include: #### NASA LUCY Mission Targets -The NASA LUCY mission aims to explore Trojan asteroids, which share Jupiter's orbit around the Sun. Understanding these asteroids can provide insights into the early solar system since Trojans are considered remnants of the primordial material that formed the outer planets. -- **3548 Eurybates:** - - Absolute Magnitude (H) = 9.75 ± 0.05 - - Slope Parameter (G) = 0.11 - - Albedo = 0.05 - * Relevance: Eurybates is the largest and presumably the most ancient member of the Eurybates family, offering a window into the conditions of the early solar system. +The NASA LUCY mission aims to explore Trojan asteroids, which share Jupiter's orbit around the Sun. Understanding these asteroids can provide insights into the early solar system since Trojans are considered remnants of the primordial material that formed the outer planets. -- **10253 Westerwald:** - - Absolute Magnitude (H) = 15.33 ± 0.05 - - Slope Parameter (G) = 0.17 - - Albedo = 0.21 - * Relevance: Westerwald's high albedo suggests it might be a fragment from a larger parent body, providing clues about collisional processes in the early solar system. +3548 Eurybates +: Absolute Magnitude (H) = 9.75 ± 0.05 +: Slope Parameter (G) = 0.11 +: Albedo = 0.05 +: Relevance: Eurybates is the largest and presumably the most ancient member of the Eurybates family, offering a window into the conditions of the early solar system. +10253 Westerwald +: Absolute Magnitude (H) = 15.33 ± 0.05 +: Slope Parameter (G) = 0.17 +: Albedo = 0.21 +: Relevance: Westerwald's high albedo suggests it might be a fragment from a larger parent body, providing clues about collisional processes in the early solar system. #### UAE Mission Targets -The UAE space mission to explore asteroids aims to study their composition, structure, and history, contributing to our understanding of asteroid formation and the evolution of the solar system. -- **269 Justitia:** - - Absolute Magnitude (H) = 9.93 ± 0.09 - - Slope Parameter (G) = 0.11 - - Albedo = 0.09 - * Relevance: Justitia's relatively low albedo indicates a carbonaceous composition, which can help researchers understand the distribution of organic materials in the solar system. +The UAE space mission to explore asteroids aims to study their composition, structure, and history, contributing to our understanding of asteroid formation and the evolution of the solar system. -- **15094 Polymele:** - - Absolute Magnitude (H) = 11.69 ± 0.07 - - Slope Parameter (G) = 0.18 - - Albedo = 0.05 - * Relevance: Polymele's properties suggest it is a primitive body, providing valuable information about the early solar system's building blocks. +269 Justitia +: Absolute Magnitude (H) = 9.93 ± 0.09 +: Slope Parameter (G) = 0.11 +: Albedo = 0.09 +: Relevance: Justitia's relatively low albedo indicates a carbonaceous composition, which can help researchers understand the distribution of organic materials in the solar system. +15094 Polymele +: Absolute Magnitude (H) = 11.69 ± 0.07 +: Slope Parameter (G) = 0.18 +: Albedo = 0.05 +: Relevance: Polymele's properties suggest it is a primitive body, providing valuable information about the early solar system's building blocks. #### Binary Asteroids + Understanding binary asteroids, where two asteroids orbit each other, can offer insights into the formation and evolutionary history of these systems. The mutual orbital period and other physical properties provide data on their dynamics and interactions. -- **3378 Susanvictoria:** - - Absolute Magnitude (H) = 13.83 ± 0.05 - - Slope Parameter (G) = 0.27 - - Albedo = 0.19 - * Relevance: Studying binary systems like Susanvictoria helps in understanding the processes that lead to the formation of binary asteroids and their subsequent evolution. +3378 Susanvictoria +: Absolute Magnitude (H) = 13.83 ± 0.05 +: Slope Parameter (G) = 0.27 +: Albedo = 0.19 +: Relevance: Studying binary systems like Susanvictoria helps in understanding the processes that lead to the formation of binary asteroids and their subsequent evolution. -- **2825 Crosby:** - - Absolute Magnitude (H) = 13.33 ± 0.06 - - Slope Parameter (G) = 0.11 - - Albedo = 0.07 - * Relevance: Crosby's characteristics can provide insights into the collisional history and mechanical properties of binary asteroid systems. -The physical properties of the binary asteroids were submitted to the binary asteroid working group. +2825 Crosby +: Absolute Magnitude (H) = 13.33 ± 0.06 +: Slope Parameter (G) = 0.11 +: Albedo = 0.07 +: Relevance: Crosby's characteristics can provide insights into the collisional history and mechanical properties of binary asteroid systems. The physical properties of the binary asteroids were submitted to the binary asteroid working group. #### Understudied Asteroids -Characterizing understudied asteroids expands our knowledge of the diversity and distribution of asteroid properties in the solar system. -- **2006 MG13:** - - Absolute Magnitude (H) = 15.94 ± 0.08 - - Slope Parameter (G) = 0.21 - - Albedo = 0.19 - * Relevance: Detailed study of asteroids like 2006 MG13 helps fill gaps in our understanding of the physical and compositional diversity of asteroids. +Characterizing understudied asteroids expands our knowledge of the diversity and distribution of asteroid properties in the solar system. -- **2007 AD11:** - - Absolute Magnitude (H) = 15.76 ± 0.11 - - Slope Parameter (G) = 0.13 - - Albedo = 0.13 - * Relevance: Investigating such understudied bodies contributes to a more complete picture of asteroid population characteristics and their evolutionary paths. +2006 MG13 +: Absolute Magnitude (H) = 15.94 ± 0.08 +: Slope Parameter (G) = 0.21 +: Albedo = 0.19 +: Relevance: Detailed study of asteroids like 2006 MG13 helps fill gaps in our understanding of the physical and compositional diversity of asteroids. +2007 AD11 +: Absolute Magnitude (H) = 15.76 ± 0.11 +: Slope Parameter (G) = 0.13 +: Albedo = 0.13 +: Relevance: Investigating such understudied bodies contributes to a more complete picture of asteroid population characteristics and their evolutionary paths. ## Discussions + ### Determining the Success of Asteroid Deflection + The success of the DART mission was evaluated by analyzing the change in the orbital path of Dimorphos, the moonlet of Didymos, after deflection. Applying Kepler's Third Law, the pre-impact orbital period of 11.91 hours and post-impact orbital period of 11.34 hours were used to calculate an orbital radius change of 0.04 km. This change confirms the effectiveness of the DART mission in altering the asteroid's trajectory, a crucial component of planetary defense. ### Determining Asteroid Strength -Asteroid strength can be inferred from the rotation period. This inference is based on the fact that an asteroid's structural integrity must be sufficient to withstand the centrifugal forces generated by its rotation. If the rotation period is less than 2.2 hours, the asteroid must be a strength-bound single rock; otherwise, it would fly apart due to centrifugal forces exceeding the gravitational binding forces. This criterion is supported by studies such as those by Pravec and Harris (2000), who observed that most asteroids with rotation periods shorter than 2.2 hours are smaller than 150 meters and are likely monolithic. For larger asteroids, the rubble-pile structure is held together by self-gravity rather than cohesive forces, making them prone to disaggregation at faster rotation rates. This information is vital for assessing the structural integrity of asteroids and planning deflection missions. + +Asteroid strength can be inferred from the rotation period. This inference is based on the fact that an asteroid's structural integrity must be sufficient to withstand the centrifugal forces generated by its rotation. If the rotation period is less than 2.2 hours, the asteroid must be a strength-bound single rock; otherwise, it would fly apart due to centrifugal forces exceeding the gravitational binding forces. This criterion is supported by studies such as those by @Pravec2000, who observed that most asteroids with rotation periods shorter than 2.2 hours are smaller than 150 meters and are likely monolithic. For larger asteroids, the rubble-pile structure is held together by self-gravity rather than cohesive forces, making them prone to disaggregation at faster rotation rates. This information is vital for assessing the structural integrity of asteroids and planning deflection missions. ### Determining Asteroid Taxonomy + Asteroid taxonomy (chemical composition) can be determined from geometric albedo. C-type asteroids have lower albedo, S-type and M-type asteroids have moderate albedo, and rare E-type asteroids have the highest albedo. (S-type asteroids are made of stony or siliceous minerals, while C-type and M-type refer to carbonaceous and metallic compositions, respectively.) The taxonomic distribution provides insights into the conditions of the early solar system based on the spatial distribution of asteroid types. Understanding these compositions helps in determining the origins and evolutionary history of these asteroids. ### Early Solar System Conditions -The taxonomical distributions of carbonaceous, siliceous, and metallic asteroids in the main belt were compiled. Over 58% of the asteroids characterized by PhAst are carbonaceous, showing they are the most abundant type in our Solar System. Their abundance increases with distance from the Sun, reaching nearly 75% in the outer region of the main belt compared to over 45% in the inner region. **See figure 3.** This finding is consistent with previous research in the field, such as studies by DeMeo and Carry (2014), which indicate that carbonaceous asteroids are prevalent in the outer asteroid belt. + +The taxonomical distributions of carbonaceous, siliceous, and metallic asteroids in the main belt were compiled. Over 58% of the asteroids characterized by PhAst are carbonaceous, showing they are the most abundant type in our Solar System. Their abundance increases with distance from the Sun, reaching nearly 75% in the outer region of the main belt compared to over 45% in the inner region [@fig3]. This finding is consistent with previous research in the field, such as studies by @DeMeo2014, which indicate that carbonaceous asteroids are prevalent in the outer asteroid belt. Characterizing asteroid populations helps us better understand the diversity of compositions in the solar system by providing a detailed inventory of the different types of asteroids and their distribution. This information is crucial for several reasons: + - **Formation Conditions:** Different types of asteroids formed under varying conditions in the early solar system. For example, carbonaceous (C-type) asteroids, which are rich in organic compounds, are more prevalent in the outer regions of the asteroid belt, suggesting formation in cooler, volatile-rich environments. In contrast, siliceous (S-type) and metallic (M-type) asteroids are more common in the inner regions, indicating formation in hotter, more metal-rich conditions. - **Evolutionary Processes:** By studying the physical and chemical properties of asteroids, we can infer the processes that have shaped their evolution. This includes understanding how collisions, thermal processes, and space weathering have affected their surfaces and internal structures. @@ -177,46 +200,39 @@ Spatial Distribution of Asteroid Types ``` ### Errors and Limitations + Photometry was performed on images with a Signal-to-Noise Ratio (SNR) > 100, yielding a measurement uncertainty of 0.01. The average error in phase curve fitting was 0.10. Limited processing power restricted the preciseness of the best fit for rotation and mutual orbital periods to two significant digits. These limitations highlight the need for more powerful computational resources and more precise observational data to improve the accuracy of asteroid characterization. ## Conclusions + PhAst represents a significant advancement in asteroid characterization, combining dense and sparse photometry to yield comprehensive insights into asteroid properties. The successful application of PhAst to the Didymos binary asteroid and over 2100 other asteroids demonstrates its potential for large-scale use. By engaging citizen scientists, we can accelerate asteroid analysis and enhance our planetary defense strategies. ## Future Work -PhAst will serve as a powerful tool for accelerating the analysis of data produced by the Legacy Survey of Space and Time (LSST), set to begin in 2025. Over a decade, LSST aims to observe over 5 million asteroids across various filters, generating a nightly data volume of 20TB. The specific benefits and new opportunities that PhAst's applications might bring include: -- **Enhanced Planetary Defense:** By rapidly characterizing large populations of asteroids, including potentially hazardous asteroids (PHAs), PhAst can provide detailed analysis that are crucial for developing effective deflection strategies, thereby enhancing planetary defense capabilities. -- **Comprehensive Asteroid Mapping:** The integration of dense and sparse photometry allows for the creation of more accurate and comprehensive maps of asteroid distributions and compositions in the solar system. This can provide valuable insights into the formation and evolution of the solar system, aiding both scientific research and educational initiatives. -- **Resource Identification and Utilization:** PhAst's ability to determine the physical and compositional properties of asteroids can aid in identifying asteroids rich in valuable minerals or water. This opens up new opportunities for asteroid mining and resource utilization, which could support long-term space exploration and the development of space infrastructure. -- **Support for Future Space Missions:** PhAst can be used to provide detailed pre and post mission characterization of target asteroids for upcoming space missions including NASA’s OSIRIS-APEX which will fly-by near-Earth asteroid Apophis on April 23, 2029, JAXA’s Hayabusa2 SHARP to explore two asteroids, 2001 CC21 and 1998 KY26, and China’s first kinetic impact deflection test mission would target the near-Earth asteroid 2015 XF261 with a launch in 2027. -- **Exoplanetary Atmosphere Characterization:** PhAst can be expanded to exoplanetary atmosphere characterization by adapting its methodology to analyze the light curves from transiting exoplanets in multiple filters. This expansion would allow researchers to study the atmospheres of distant planets, providing insights into their composition, climate, and potential habitability. -- **Citizen Science and Public Engagement:** By making PhAst open-source and developing training modules for citizen scientists, the project promotes public engagement in scientific research. This democratization of science enables a wider community to contribute to and benefit from cutting-edge research, fostering a culture of curiosity and collaboration. -## Project Impact -The PhAst algorithm has been made open-source, and training modules have been developed for citizen scientists. These modules, created using Jupyter Notebooks, are designed for use by high school students and citizen scientists to support their engagement in asteroid characterization and planetary defense. Training on using open data for asteroid categorization has been provided to over 1,500 students during "Space Day" and "Asteroid Day" events in collaboration with observatories and community organizations such as Royal Astronomical Society of Canada. See link to Github: https://github.com/Spacegirl123/Asteroid-Characterization-By-PhAst - -## Acknowledgments -The development and application of PhAst have been possible thanks to contributions from numerous observatories, citizen scientists, and research institutions. Special thanks to the teams behind GAIA, Zwicky Transient Facility (ZTF), ATLAS, and other photometric surveys for providing the data that made this research possible. I also acknowledge the support of various citizen science communities and educational organizations for their collaboration and participation. - -## References - -[1] Center for Near-Earth Object Studies. Total number of asteroids discovered monthly. Retrieved from https://cneos.jpl.nasa.gov/stats/totals.html +PhAst will serve as a powerful tool for accelerating the analysis of data produced by the Legacy Survey of Space and Time (LSST), set to begin in 2025. Over a decade, LSST aims to observe over 5 million asteroids across various filters, generating a nightly data volume of 20TB. The specific benefits and new opportunities that PhAst's applications might bring include: -[2] NASA/Johns Hopkins University Applied Physics Laboratory. (2022, March). NASA's first planetary defense technology demonstration to collide with asteroid in 2022. https://www.nasa.gov/feature/nasa-s-first-planetary-defense-technology-demonstration-to-collide-with-asteroid-in-2022 +Enhanced Planetary Defense +: By rapidly characterizing large populations of asteroids, including potentially hazardous asteroids (PHAs), PhAst can provide detailed analysis that are crucial for developing effective deflection strategies, thereby enhancing planetary defense capabilities. -[3] Shevchenko, V. G., et al. (2019). Phase integral of asteroids. Astronomy & Astrophysics, 626(A87). https://doi.org/10.1051/0004-6361/201935588 +Comprehensive Asteroid Mapping +: The integration of dense and sparse photometry allows for the creation of more accurate and comprehensive maps of asteroid distributions and compositions in the solar system. This can provide valuable insights into the formation and evolution of the solar system, aiding both scientific research and educational initiatives. -[4] Talbert, T. (2022, October 11). NASA DART imagery shows changed orbit of target asteroid. NASA. https://www.nasa.gov/solar-system/nasa-dart-imagery-shows-changed-orbit-of-target-asteroid/ +Resource Identification and Utilization +: PhAst's ability to determine the physical and compositional properties of asteroids can aid in identifying asteroids rich in valuable minerals or water. This opens up new opportunities for asteroid mining and resource utilization, which could support long-term space exploration and the development of space infrastructure. -[5] Jet Propulsion Laboratory. (n.d.). Small-Body Database Lookup. https://ssd.jpl.nasa.gov/tools/sbdb_lookup.html#/?sstr=65803 +Support for Future Space Missions +: PhAst can be used to provide detailed pre and post mission characterization of target asteroids for upcoming space missions including NASA’s OSIRIS-APEX which will fly-by near-Earth asteroid Apophis on April 23, 2029, JAXA’s Hayabusa2 SHARP to explore two asteroids, 2001 CC21 and 1998 KY26, and China’s first kinetic impact deflection test mission would target the near-Earth asteroid 2015 XF261 with a launch in 2027. -[6] Fink Broker. (n.d.). ZTF Minor Planet Photometric Data Release. https://fink-portal.org/ +Exoplanetary Atmosphere Characterization +: PhAst can be expanded to exoplanetary atmosphere characterization by adapting its methodology to analyze the light curves from transiting exoplanets in multiple filters. This expansion would allow researchers to study the atmospheres of distant planets, providing insights into their composition, climate, and potential habitability. -[7] European Space Agency. (n.d.). Gaia Data Release 3. https://www.cosmos.esa.int/web/gaia/dr3 +Citizen Science and Public Engagement +: By making PhAst open-source and developing training modules for citizen scientists, the project promotes public engagement in scientific research. This democratization of science enables a wider community to contribute to and benefit from cutting-edge research, fostering a culture of curiosity and collaboration. -[8] ALCDEF. (n.d.). Asteroid Lightcurve Photometry Database. https://alcdef.org/ +## Project Impact -[9] Planetary Science Institute. (n.d.). Asteroid Photometric Catalog (APC) "Third Update." https://sbn.psi.edu/pds/resource/apc.html +The PhAst algorithm has been made open-source, and training modules have been developed for citizen scientists. These modules, created using Jupyter Notebooks, are designed for use by high school students and citizen scientists to support their engagement in asteroid characterization and planetary defense. Training on using open data for asteroid categorization has been provided to over 1,500 students during "Space Day" and "Asteroid Day" events in collaboration with observatories and community organizations such as Royal Astronomical Society of Canada. See link to Github: https://github.com/Spacegirl123/Asteroid-Characterization-By-PhAst -[10] Pravec, P., & Harris, A. W. (2000). Fast and Slow Rotation of Asteroids. Icarus, 148(1), 12-20. https://doi.org/10.1006/icar.2000.6482 +## Acknowledgments -[11] DeMeo, F. E., & Carry, B. (2014). Solar System evolution from compositional mapping of the asteroid belt. Nature, 505(7485), 629-634. https://doi.org/10.1038/nature12908 +The development and application of PhAst have been possible thanks to contributions from numerous observatories, citizen scientists, and research institutions. Special thanks to the teams behind GAIA, Zwicky Transient Facility (ZTF), ATLAS, and other photometric surveys for providing the data that made this research possible. I also acknowledge the support of various citizen science communities and educational organizations for their collaboration and participation. diff --git a/papers/Arushi_Nath/mybib.bib b/papers/Arushi_Nath/mybib.bib index 9c1fa0275a..dbe5640eb5 100644 --- a/papers/Arushi_Nath/mybib.bib +++ b/papers/Arushi_Nath/mybib.bib @@ -1,130 +1,93 @@ -# Feel free to delete these first few references, which are specific to the template: - -@book{hume48, - author = "David Hume", - year = "1748", - title = "An enquiry concerning human understanding", - address = "Indianapolis, IN", - publisher = "Hackett", - doi = {https://doi.org/10.1017/CBO9780511808432}, +@misc{CNEOS_Totals, + author = {{Center for Near-Earth Object Studies}}, + title = {Total number of asteroids discovered monthly}, + howpublished = {\url{https://cneos.jpl.nasa.gov/stats/totals.html}}, + note = {Accessed: October 1, 2024} } -@article{Atr03, - author = "P Atreides", - year = "2003", - title = "How to catch a sandworm", - journal = "Transactions on Terraforming", - volume = 21, - issue = 3, - pages = {261-300} +@misc{NASA_DART2022, + author = {{NASA/Johns Hopkins University Applied Physics Laboratory}}, + title = {NASA's first planetary defense technology demonstration to collide with asteroid in 2022}, + month = {March}, + year = {2022}, + howpublished = {\url{https://www.nasa.gov/feature/nasa-s-first-planetary-defense-technology-demonstration-to-collide-with-asteroid-in-2022}}, + note = {Accessed: October 1, 2024} } -@misc{terradesert, - author = {{TerraDesert Team}}, - title = {Code for terraforming a desert}, - year = {2000}, - url = {https://terradesert.com/code/}, - note = {Accessed 1 Jan. 2000} +@article{Shevchenko2019, + author = {Shevchenko, V. G. and Tedesco, E. F. and Kovalchuk, L. O. and Fiacconi, A. M. and Zubarev, V. A.}, + title = {Phase integral of asteroids}, + journal = {Astronomy \& Astrophysics}, + volume = {626}, + pages = {A87}, + year = {2019}, + doi = {10.1051/0004-6361/201935588} } -# These references may be helpful: - -@inproceedings{jupyter, - abstract = {It is increasingly necessary for researchers in all fields to write computer code, and in order to reproduce research results, it is important that this code is published. We present Jupyter notebooks, a document format for publishing code, results and explanations in a form that is both readable and executable. We discuss various tools and use cases for notebook documents.}, - author = {Kluyver, Thomas and Ragan-Kelley, Benjamin and Pérez, Fernando and Granger, Brian and Bussonnier, Matthias and Frederic, Jonathan and Kelley, Kyle and Hamrick, Jessica and Grout, Jason and Corlay, Sylvain and Ivanov, Paul and Avila, Damián and Abdalla, Safia and Willing, Carol and {Jupyter development team}}, - editor = {Loizides, Fernando and Scmidt, Birgit}, - location = {Netherlands}, - publisher = {IOS Press}, - url = {https://eprints.soton.ac.uk/403913/}, - booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas}, - year = {2016}, - pages = {87--90}, - title = {Jupyter Notebooks - a publishing format for reproducible computational workflows}, +@misc{Talbert2022, + author = {Talbert, Tricia}, + title = {NASA DART imagery shows changed orbit of target asteroid}, + howpublished = {NASA}, + month = {October}, + day = {11}, + year = {2022}, + url = {https://www.nasa.gov/solar-system/nasa-dart-imagery-shows-changed-orbit-of-target-asteroid/}, + note = {Accessed: October 1, 2024} } -@article{matplotlib, - abstract = {Matplotlib is a 2D graphics package used for Python for application development, interactive scripting, and publication-quality image generation across user interfaces and operating systems.}, - author = {Hunter, J. D.}, - publisher = {IEEE COMPUTER SOC}, - year = {2007}, - doi = {https://doi.org/10.1109/MCSE.2007.55}, - journal = {Computing in Science \& Engineering}, - number = {3}, - pages = {90--95}, - title = {Matplotlib: A 2D graphics environment}, - volume = {9}, +@misc{JPL_SBDB, + author = {{Jet Propulsion Laboratory}}, + title = {Small-Body Database Lookup}, + howpublished = {\url{https://ssd.jpl.nasa.gov/tools/sbdb_lookup.html\#/?sstr=65803}}, + note = {Accessed: October 1, 2024} } -@article{numpy, - author = {Harris, Charles R. and Millman, K. Jarrod and van der Walt, Stéfan J. and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J. and Kern, Robert and Picus, Matti and Hoyer, Stephan and van Kerkwijk, Marten H. and Brett, Matthew and Haldane, Allan and del Río, Jaime Fernández and Wiebe, Mark and Peterson, Pearu and Gérard-Marchant, Pierre and Sheppard, Kevin and Reddy, Tyler and Weckesser, Warren and Abbasi, Hameer and Gohlke, Christoph and Oliphant, Travis E.}, - publisher = {Springer Science and Business Media {LLC}}, - doi = {https://doi.org/10.1038/s41586-020-2649-2}, - date = {2020-09}, - year = {2020}, - journal = {Nature}, - number = {7825}, - pages = {357--362}, - title = {Array programming with {NumPy}}, - volume = {585}, +@misc{FinkBroker, + author = {{Fink Broker}}, + title = {ZTF Minor Planet Photometric Data Release}, + howpublished = {\url{https://fink-portal.org/}}, + note = {Accessed: October 1, 2024} } -@misc{pandas1, - author = {{The Pandas Development Team}}, - title = {pandas-dev/pandas: Pandas}, - month = feb, - year = {2020}, - publisher = {Zenodo}, - version = {latest}, - url = {https://doi.org/10.5281/zenodo.3509134}, +@misc{ESA_Gaia_DR3, + author = {{European Space Agency}}, + title = {Gaia Data Release 3}, + howpublished = {\url{https://www.cosmos.esa.int/web/gaia/dr3}}, + note = {Accessed: October 1, 2024} } -@inproceedings{pandas2, - author = {Wes McKinney}, - title = {{D}ata {S}tructures for {S}tatistical {C}omputing in {P}ython}, - booktitle = {{P}roceedings of the 9th {P}ython in {S}cience {C}onference}, - pages = {56 - 61}, - year = {2010}, - editor = {{S}t\'efan van der {W}alt and {J}arrod {M}illman}, - doi = {https://doi.org/10.25080/Majora-92bf1922-00a}, +@misc{ALCDEF, + author = {{ALCDEF}}, + title = {Asteroid Lightcurve Photometry Database}, + howpublished = {\url{https://alcdef.org/}}, + note = {Accessed: October 1, 2024} } -@article{scipy, - author = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and - Haberland, Matt and Reddy, Tyler and Cournapeau, David and - Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and - Bright, Jonathan and {van der Walt}, St{\'e}fan J. and - Brett, Matthew and Wilson, Joshua and Millman, K. Jarrod and - Mayorov, Nikolay and Nelson, Andrew R. J. and Jones, Eric and - Kern, Robert and Larson, Eric and Carey, C J and - Polat, {\.I}lhan and Feng, Yu and Moore, Eric W. and - {VanderPlas}, Jake and Laxalde, Denis and Perktold, Josef and - Cimrman, Robert and Henriksen, Ian and Quintero, E. A. and - Harris, Charles R. and Archibald, Anne M. and - Ribeiro, Ant{\^o}nio H. and Pedregosa, Fabian and - {van Mulbregt}, Paul and {SciPy 1.0 Contributors}}, - title = {{{SciPy} 1.0: Fundamental Algorithms for Scientific - Computing in Python}}, - journal = {Nature Methods}, - year = {2020}, - volume = {17}, - pages = {261--272}, - adsurl = {https://rdcu.be/b08Wh}, - doi = {https://doi.org/10.1038/s41592-019-0686-2}, +@misc{PSI_APC, + author = {{Planetary Science Institute}}, + title = {Asteroid Photometric Catalog (APC) "Third Update"}, + howpublished = {\url{https://sbn.psi.edu/pds/resource/apc.html}}, + note = {Accessed: October 1, 2024} } -@article{sklearn1, - author = {Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, - year = {2011}, - journal = {Journal of Machine Learning Research}, - pages = {2825--2830}, - title = {Scikit-learn: Machine Learning in {P}ython}, - volume = {12}, +@article{Pravec2000, + author = {Pravec, P. and Harris, A. W.}, + title = {Fast and Slow Rotation of Asteroids}, + journal = {Icarus}, + volume = {148}, + number = {1}, + pages = {12--20}, + year = {2000}, + doi = {10.1006/icar.2000.6482} } -@inproceedings{sklearn2, - author = {Buitinck, Lars and Louppe, Gilles and Blondel, Mathieu and Pedregosa, Fabian and Mueller, Andreas and Grisel, Olivier and Niculae, Vlad and Prettenhofer, Peter and Gramfort, Alexandre and Grobler, Jaques and Layton, Robert and VanderPlas, Jake and Joly, Arnaud and Holt, Brian and Varoquaux, Gaël}, - booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning}, - year = {2013}, - pages = {108--122}, - title = {{API} design for machine learning software: experiences from the scikit-learn project}, +@article{DeMeo2014, + author = {DeMeo, F. E. and Carry, B.}, + title = {Solar System evolution from compositional mapping of the asteroid belt}, + journal = {Nature}, + volume = {505}, + number = {7485}, + pages = {629--634}, + year = {2014}, + doi = {10.1038/nature12908} } diff --git a/papers/Arushi_Nath/myst.yml b/papers/Arushi_Nath/myst.yml index 4e6e742d97..61ade50344 100644 --- a/papers/Arushi_Nath/myst.yml +++ b/papers/Arushi_Nath/myst.yml @@ -1,5 +1,7 @@ version: 1 +extends: ../proceedings.yml project: + doi: 10.25080/TWCF2755 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-Arushi_Nath # Ensure your title is the same as in your `main.md` @@ -17,7 +19,18 @@ project: - Open Data - Citizen Scientists - Asteroid Characterization - + abbreviations: + ATLAS: Asteroid Terrestrial-impact Last Alert System + DART: Double Asteroid Redirection Test + ZTF: Zwicky Transient Facility + LSST: Legacy Survey of Space and Time + ALCDEF: Asteroid Lightcurve Database + C-type: carbonaceous + S-type: siliceous + M-type: metallic + PDS: Photometric Data Catalog + MPC: Minor Planet Center + SNR: Signal-to-Noise Ratio # It is possible to explicitly ignore the `doi-exists` check for certain citation keys error_rules: - rule: doi-exists @@ -28,13 +41,5 @@ project: - jupyter - sklearn1 - sklearn2 - # A banner will be generated for you on publication, this is a placeholder - banner: banner.png - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 site: template: article-theme diff --git a/papers/Arushi_Nath/thumbnail.png b/papers/Arushi_Nath/thumbnail.png new file mode 100644 index 0000000000..3b98895060 Binary files /dev/null and b/papers/Arushi_Nath/thumbnail.png differ diff --git a/papers/Gagnon_Kebe_Tahiri/banner.png b/papers/Gagnon_Kebe_Tahiri/banner.png index c5dd028e26..c02cb04903 100644 Binary files a/papers/Gagnon_Kebe_Tahiri/banner.png and b/papers/Gagnon_Kebe_Tahiri/banner.png differ diff --git a/papers/Gagnon_Kebe_Tahiri/main.tex b/papers/Gagnon_Kebe_Tahiri/main.tex index c50c767999..bb06942f7c 100644 --- a/papers/Gagnon_Kebe_Tahiri/main.tex +++ b/papers/Gagnon_Kebe_Tahiri/main.tex @@ -1,21 +1,21 @@ \begin{abstract} Cumacea (crustaceans: Peracarida) are vital indicators of benthic health in marine ecosystems. This study investigated the influence of environmental (i.e., biological or ecosystemic), climatic (i.e., meteorological or atmospheric), and geographic (i.e., spatial or regional) attributes on their genetic variability in the Northern North Atlantic, focusing on Icelandic waters. We analyzed mitochondrial sequences of the 16S rRNA gene from 62 Cumacea specimens. Using the \textit{aPhyloGeo} software, we compared these sequences with relevant parameters such as latitude (decimal degree) at the start of sampling, wind speed (m/s) at the start of sampling, O\textsubscript{2} concentration (mg/L), and depth (m) at the start of sampling. -Our analyses revealed variability in most spatial and biological attributes, reflecting the diversity of ecological requirements and benthic habitats. The most common Cumacea families, Diastylidae and Leuconidae, suggest adaptations to various marine environments. Phylogeographic analysis showed a divergence between specific genetic sequences and two habitat attributes: wind speed (m/s) at the start of sampling and O\tsubscript{2} concentration (mg/L). This indicates potential local adaptation to these fluctuating conditions. +Our analyses revealed variability in most spatial and biological attributes, reflecting the diversity of ecological requirements and benthic habitats. The most common Cumacea families, Diastylidae and Leuconidae, suggest adaptations to various marine environments. Phylogeographic analysis showed a divergence between specific genetic sequences and two habitat attributes: wind speed (m/s) at the start of sampling and O\textsubscript{2} concentration (mg/L). This indicates potential local adaptation to these fluctuating conditions. -These results reinforce the importance of further research into the relationship between Cumacea genetics and global environmental factors. Understanding these relationships is essential for interpreting the evolutionary dynamics and adaptation of deep-sea Cumacea. This study sheds much-needed light on invertebrate acclimatization to climate change, anthropomorphic pressures, and deep-water habitat management. It can contribute to the evolution of more efficient conservation strategies and inform policies that protect vulnerable marine ecosystems. +These results reinforce the importance of further research into the relationship between Cumacea genetics and global environmental factors. Understanding these relationships is essential for interpreting the evolutionary dynamics and adaptation of deep-sea Cumacea. This study sheds much-needed light on invertebrate acclimatization to climate change, anthropomorphic pressures, and deep-water habitat management. It can contribute to the evolution of more efficient conservation strategies and inform policies that protect vulnerable marine ecosystems. The \textit{aPhyloGeo} Python package is freely and publicly available on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub} and \href{https://pypi.org/project/aphylogeo/}{PyPi}, providing an invaluable tool for future research. \end{abstract} \section{Introduction}\label{introduction} -The North Atlantic and Subarctic regions, particularly the Icelandic waters, are of ecological interest due to their diverse water masses and unique oceanographic features \citep{schnurr_composition_2014, meisner_benthic_2014, uhlir_adding_2021}. These areas form vital {benthic habitats}\footnote{These are areas on the bottom of the oceans or lakes, including sediments and organisms that live in them.} \citep{levin2009ecological} and enhance our understanding of deep-sea ecosystems and biodiversity patterns \citep{rogers2007corals, danovaro2008exponential, uhlir_adding_2021}. The IceAGE project and its predecessors, BIOFAR and BIOICE, provide invaluable data for studying the impacts of climate change and seabed mining, especially in the Greenland, Iceland, and Norwegian (GIN) seas \citep{meisner_prefacebiodiversity_2018}. +The North Atlantic and Subarctic regions, particularly the Icelandic waters, are of ecological interest due to their diverse water masses and unique oceanographic features \citep{schnurr_composition_2014, meisner_benthic_2014, uhlir_adding_2021}. These areas form vital {benthic habitats}\footnote{These are areas on the bottom of the oceans or lakes, including sediments and organisms that live in them.} \citep{levin2009ecological} and enhance our understanding of deep-sea ecosystems and biodiversity patterns \citep{rogers2007corals, danovaro2008exponential, uhlir_adding_2021}. The IceAGE project and its predecessors, BIOFAR and BIOICE, provide invaluable data for studying the impacts of climate change and seabed mining, especially in the Greenland, Iceland, and Norwegian (GIN) seas \citep{meisner_prefacebiodiversity_2018}. -Cumacea, a crustacean taxon within Peracarida, provide major indicators of marine ecosystem health due to their sensitivity to environmental fluctuations \citep{stransky_diversity_2010} and their contribution to benthic food webs \citep{rehm2009cumacea}. Despite their ecological importance, deep-sea benthic invertebrates’ evolutionary history remains uncharted, notably in the North Atlantic \citep{jennings_phylogeographic_2014}. Interpreting these deep-sea organisms' genetic distribution and demography is central for predicting their response to climate change and anthropogenic pressures, such as seabed mining \citep{jennings_phylogeographic_2014, meisner_prefacebiodiversity_2018}. +Cumacea, a crustacean taxon within Peracarida, provide major indicators of marine ecosystem health due to their sensitivity to environmental fluctuations \citep{stransky_diversity_2010} and their contribution to benthic food webs \citep{rehm2009cumacea}. Despite their ecological importance, deep-sea benthic invertebrates’ evolutionary history remains uncharted, notably in the North Atlantic \citep{jennings_phylogeographic_2014}. Interpreting these deep-sea organisms' genetic distribution and demography is central for predicting their response to climate change and anthropogenic pressures, such as seabed mining \citep{jennings_phylogeographic_2014, meisner_prefacebiodiversity_2018}. Given the urgency of the above factors, this study aims to analyze the influence of ecological (climatic and environmental) and geographic parameters on the genetic variability of Cumacea in the Northern North Atlantic. Specifically, we will examine whether genetic adaptation exists between the genetic structure of the 16S rRNA mitochondrial gene region of cumacean species sampled and their habitat attributes. If so, we will determine the attribute that diverges most from a specific gene sequence of this cumaceans gene (i.e., a window) and further explore the potential associated protein using bioinformatics tools to interpret its biological relevance. Our approach includes confirming different {phylogeographic models}\footnote{Phylogeographic models are computational tools that analyze relationships between the genetic structures of populations and their geographic distributions. In our case, by incorporating regional, biological, and atmospheric characteristics, we can interpret their impact on the genetic distribution of cumacean species,} and updating a Python package (currently in beta), \textit{aPhyloGeo}, to simplify these analyses. -This paper is organized as follows: Section \autoref{related-works} reviews pertinent studies on the biodiversity and biogeography of deep-sea benthic invertebrates; Section \autoref{contribution} summarizes the aims and contributions of this study, highlighting aspects relating to the conservation and adaptation of marine invertebrates to climate change; Section \autoref{materials-methods} describes the data collection, sampling procedures, and genetic analyses; Section \autoref{metrics} describes the metrics used to evaluate the phylogeographic models; Section \autoref{results} presents the results; finally, Section \autoref{conclusion} discusses their implications for future research and conservation efforts. +This paper is organized as follows: \autoref{related-works} reviews pertinent studies on the biodiversity and biogeography of deep-sea benthic invertebrates; \autoref{contribution} summarizes the aims and contributions of this study, highlighting aspects relating to the conservation and adaptation of marine invertebrates to climate change; \autoref{materials-methods} describes the data collection, sampling procedures, and genetic analyses; \autoref{metrics} describes the metrics used to evaluate the phylogeographic models; \autoref{results} presents the results; finally, \autoref{conclusion} discusses their implications for future research and conservation efforts. \section{Related Works}\label{related-works} Assessing and quantifying the biodiversity of deep-sea benthic invertebrates has become increasingly crucial since it was discovered that their species richness may be underestimated \citep{grassle1992deep}. Subsequent research has highlighted the need for large-scale distribution models to interpret the diversity of these organisms across their ecological and evolutionary contexts \citep{rex1997large}. That is why recent efforts have focused on mapping, managing, and studying the seabed \citep{brown2011benthic}. Advanced technologies such as acoustic detection are improving our knowledge of benthic ecosystem complexity \citep{brown2011benthic}. Integrating genetic and habitat attributes gives a deeper understanding of how ecosystemic, meteorological, and spatial attributes influence the genetic differences, distribution, biodiversity, and resilience of deep-sea benthic organisms \citep{vrijenhoek2009cryptic}. @@ -30,19 +30,19 @@ \section{Our Contribution}\label{contribution} Furthermore, our genetic and environmental data highlights critical habitats of high conservation interest, which can be considered for establishing marine protected areas \citep{levin2009ecological}. These results are essential for developing informed conservation strategies in the context of climate change. Finally, our study paves the way for further research on other invertebrate species across different geographic regions. By extending this research to diverse environments and taxonomic groups, scientists will gain a more complete understanding of the adaptation and resilience of marine invertebrates to changing conditions. This work contributes essential insights to the field and supports the development of informed conservation strategies. \section{Materials and Methods}\label{materials-methods} -This section describes our data and introduces the main stages of data pre-processing and the \textit{aPhyloGeo} software. A flow chart, constructed with the diagram software \href{https://app.diagrams.net/}{draw.io}, summarizes this section (Figure \ref{fig:fig1}). +This section describes our data and introduces the main stages of data pre-processing and the \textit{aPhyloGeo} software. A flow chart, constructed with the diagram software \href{https://app.diagrams.net/}{draw.io}, summarizes this section (\autoref{fig:fig1}). \begin{figure}[htbp] \centering \includegraphics[width=0.7\textwidth]{diagram.drawio.png} - \caption{Flow chart summarizing the Materials and Methods section workflow. Six different colors highlight the blocks. The first block (blue) represents our database. The second block (red) is data pre-processing, where we remove attributes. The third and fourth blocks (orange) implement the \textit{aPhyloGeo} software and its parameters for our phylogeographic analyses (see in the second step of the section \autoref{aPhyloGeo-software}). The fifth block (grey) calculates phylogenetic tree comparison distances. The sixth block (yellow) compares the distances between the phylogenetic trees produced. The seventh block (purple) identifies regions with high mutation rates based on the results of the tree comparisons. *See YAML files on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub} for more details on these parameters. \label{fig:fig1}} + \caption{Flow chart summarizing the Materials and Methods section workflow. Six different colors highlight the blocks. The first block (blue) represents our database. The second block (red) is data pre-processing, where we remove attributes. The third and fourth blocks (orange) implement the \textit{aPhyloGeo} software and its parameters for our phylogeographic analyses (see in the second step of the \autoref{aPhyloGeo-software}). The fifth block (grey) calculates phylogenetic tree comparison distances. The sixth block (yellow) compares the distances between the phylogenetic trees produced. The seventh block (purple) identifies regions with high mutation rates based on the results of the tree comparisons. *See YAML files on \href{https://github.com/tahiri-lab/aPhyloGeo}{GitHub} for more details on these parameters. \label{fig:fig1}} \end{figure} \subsection{Description of the data} The study area was located in a northern region of the North Atlantic, including the Icelandic Sea, the Denmark Strait, and the Norwegian Sea. The specimens examined were collected as part of the IceAGE project (Icelandic marine Animals: Genetic and Ecology; Cruise ship M85/3 in 2011), which focused on the deep continental slopes and abyssal waters around Iceland \citep{meisner_prefacebiodiversity_2018}. The sampling period for the included specimens was from August 30 to September 22, 2011, and they were collected at depths ranging from 316 m to 2568 m. Detailed protocols concerning the sampling plan, sample processing, DNA extraction steps, PCR amplification, sequencing, and aligned DNA sequences are available in \citep{uhlir_adding_2021}. \subsection{Data pre-processing} -We used data from the article \citep{uhlir_adding_2021}, IceAGE project, and related data from the bold system's database, as described in \citep{uhlir_adding_2021}. Given these databases' enormous breadth of features, we applied a selective reduction procedure. Attributes that were not directly relevant to the analysis of correlations between Cumacea genetics and habitat properties, displayed little to no variability (non-numerical data), and had a large number of missing data (> 95\%) were omitted from our study. Out of the 495 available in the IceAGE dataset, we considered 62 specimens for which mitochondrial DNA sequences of the 16S rRNA gene were available. +We used data from the article \citep{uhlir_adding_2021}, IceAGE project, and related data from the bold system's database, as described in \citep{uhlir_adding_2021}. Given these databases' enormous breadth of features, we applied a selective reduction procedure. Attributes that were not directly relevant to the analysis of correlations between Cumacea genetics and habitat properties, displayed little to no variability (non-numerical data), and had a large number of missing data (> 95\%) were omitted from our study. Out of the 495 available in the IceAGE dataset, we considered 62 specimens for which mitochondrial DNA sequences of the 16S rRNA gene were available. Next, we calculated the variance using the $var()$ function in RStudio Desktop 4.3.2 for each of the selected numerical attributes. This step aimed to eliminate attributes with low variation, as they are unlikely to provide critical data to the analysis. We set a variance threshold of ≤ 0.1 to exclude uninformative attributes. The latter allows us to retain attributes whose variability is reasonably sufficient for our analyses while rejecting those with little variation. Only water salinity was eliminated based on this criterion ($S^2 = 0.02146629$). The formula (equation \ref{variance}) and code (\autoref{lst:variance}) used to calculate the variance of our final features, available in the data file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}, are provided below: @@ -107,39 +107,39 @@ \subsection{Data pre-processing} print(correlation_matrix) \end{lstlisting} -This selection of attributes and data resulted in a table containing 62 rows ($n=62$) and 17 columns (number of attributes). +This selection of attributes and data resulted in a table containing 62 rows ($n=62$) and 17 columns (number of attributes). \subsection{Selected attributes in the IceAGE database} -\subsubsection{Geographic data} +\subsubsection{Geographic data} \begin{itemize} \item The latitude (Figure \ref{fig:fig2}a) and longitude (Figure \ref{fig:fig2}b) at the start of sampling, both in decimal degrees (DD), as they are intimately linked to the environmental gradients and historical mechanisms modeling genetic heterogeneity \citep{gaither2013origins}. -\item The sectors across the seas around Iceland: the Denmark Strait ($n=28$), the Iceland Basin ($n=15$), the Irminger Basin ($n=12$), the Norwegian Sea ($n=4$), and the Norwegian Basin ($n=3$). +\item The sectors across the seas around Iceland: the Denmark Strait ($n=28$), the Iceland Basin ($n=15$), the Irminger Basin ($n=12$), the Norwegian Sea ($n=4$), and the Norwegian Basin ($n=3$). \end{itemize} -\subsubsection{Environmental data} +\subsubsection{Environmental data} \begin{itemize} -\item Depth (m) at the start of sampling (Figure \ref{fig:fig2}c), as well as water temperature ($^\circ$C) (Figure \ref{fig:fig2}e), and O\textsubscript{2} concentration (mg/L) (Figure \ref{fig:fig2}f), as these are vital elements of the marine ecosystem that have an impact on the distribution and evolutionary acclimatization of marine species \citep{rex2006global, danovaro2010first}. +\item Depth (m) at the start of sampling (Figure \ref{fig:fig2}c), as well as water temperature ($^\circ$C) (Figure \ref{fig:fig2}e), and O\textsubscript{2} concentration (mg/L) (Figure \ref{fig:fig2}f), as these are vital elements of the marine ecosystem that have an impact on the distribution and evolutionary acclimatization of marine species \citep{rex2006global, danovaro2010first}. \item The sampling sites' sedimentary characteristics directly influence the distribution of Cumacea \citep{uhlir_adding_2021}. In this study, they are divided into six ecological niche categories: mud ($n=30$), sandy mud ($n=15$), sand ($n=9$), forams ($n=3$), muddy sand ($n=3$), and gravel ($n=2$). \end{itemize} -\subsubsection{Climatic data} -Wind speed (m/s) (Figure \ref{fig:fig2}d) and wind direction at the start and end of sampling were also included, giving the contribution of wind to benthic ecosystem dynamics and the restructuring of species distribution by wind currents and sediment transport \citep{siedlecki2016experiments, waga_recent_2020,saeedi_environmental_2022}. The wind direction at the start of sampling comprises six orientations: South-West ($n=22$), South ($n=15$), North-East ($n=9$), South-South-East ($n=9$), North-West ($n=5$), and East ($n=2$); while that at the end of sampling is composed of seven orientations: South ($n=15$), South-West ($n=15$), North-East ($n=9$), West-South-West ($n=7$), South-East ($n=6$), North-North-West ($n=5$), South-South-East ($n=3$), and East ($n=2$). +\subsubsection{Climatic data} +Wind speed (m/s) (Figure \ref{fig:fig2}d) and wind direction at the start and end of sampling were also included, giving the contribution of wind to benthic ecosystem dynamics and the restructuring of species distribution by wind currents and sediment transport \citep{siedlecki2016experiments, waga_recent_2020,saeedi_environmental_2022}. The wind direction at the start of sampling comprises six orientations: South-West ($n=22$), South ($n=15$), North-East ($n=9$), South-South-East ($n=9$), North-West ($n=5$), and East ($n=2$); while that at the end of sampling is composed of seven orientations: South ($n=15$), South-West ($n=15$), North-East ($n=9$), West-South-West ($n=7$), South-East ($n=6$), North-North-West ($n=5$), South-South-East ($n=3$), and East ($n=2$). \subsection{Selected attributes in the bold system's database} -\subsubsection{Taxonomic data} +\subsubsection{Taxonomic data} The family, genus, and scientific name of the cumaceans sampled were integrated into our data to study evolutionary relationships and genetic variation to habitat attributes among the specimens in our dataset. These comprise seven families: Diastylidae ($n=21$), Lampropidae ($n=13$), Leuconidae ($n=12$), Astacidae ($n=7$), Bodotriidae ($n=4$), Ceratocumatidae ($n=3$), and Pseudocumatidae ($n=2$). A total of 21 cumacean species were found in our sample (Figure \ref{fig:fig3}). We have also included the sample identity (id) so that each sample remains unique. Some specimens were only identified to genus ($n=1$) or family ($n=5$) in our sample. -\subsection{Selected attributes from article \cite{uhlir_adding_2021}} -\subsubsection{Other environmental data} +\subsection{Selected attributes from article \cite{uhlir_adding_2021}} +\subsubsection{Other environmental data} The habitat and water mass of the sampling points were the only water attributes taken directly from Table 1 of \citep{uhlir_adding_2021}, as they can give us insight into how they may affect Cumacea genetic diversity and the acclimatization of these species in the GIN seas around Iceland. Thus, the water masses definitions, as described in \citep{uhlir_adding_2021}, were used as a reference: Arctic Polar Water (APW, $n=15$), Iceland Sea Overflow Water (ISOW, $n=15$), North Atlantic Water (NAW, $n=9$), Arctic Polar Water/Norwegian Sea Arctic Intermediate Water (APW/NSAIW, $n=7$), warm Norwegian Sea Deep Water (NSDWw, $n=8$), Labrador Sea Water (LSW, $n=3$), cold Norwegian Sea Deep Water (NSDWc, $n=3$), and Norwegian Sea Arctic Intermediate Water (NSAIW, $n=2$) (Figure \ref{fig:fig4}). In terms of habitat, we considered the three categories used in \citep{uhlir_adding_2021}: Deep Sea ($n=38$), Shelf ($n=15$), and Slope ($n=9$) (Figure \ref{fig:fig5}). -\subsubsection{Genetic data} +\subsubsection{Genetic data} To better interpret benthic species' relationship and evolutionary responses, genetic data are required \citep{wilson_speciation_1987, uhlir_adding_2021}. Thus, the aligned DNA sequence of the 16S rRNA mitochondrial gene region from each of the samples was included in our analyses. This region is standard in phylogeny and phylogeography studies \citep{hugenholtz1998impact} and sufficiently conserved over time to guarantee exact alignments between different species or populations \citep{saccone1999evolutionary}. We examined 62 of the 306 aligned DNA sequences used for phylogeographic analyses by \citep{uhlir_adding_2021}. As some specimens in our sample have their DNA sequence duplicated, or even quadruplicated with a difference of one or two nucleotides, we took into account the longest-aligned DNA sequence of each specimen. \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}} -We used the cross-platform Python software \textit{aPhyloGeo} for our phylogeographic analyses, designed to analyze phylogenetic trees using ecological and geographic attributes (\autoref{lst:main}). Developed by My-Linh Luu, Georges Marceau, David Beauchemin, and Nadia Tahiri, \textit{aPhyloGeo} offers tools to study and identify potential divergence between species genetics and habitat characteristics, enabling us to understand the evolution of species under different environmental conditions \citep{koshkarov_phylogeography_2022}. +We used the cross-platform Python software \textit{aPhyloGeo} for our phylogeographic analyses, designed to analyze phylogenetic trees using ecological and geographic attributes (\autoref{lst:main}). Developed by My-Linh Luu, Georges Marceau, David Beauchemin, and Nadia Tahiri, \textit{aPhyloGeo} offers tools to study and identify potential divergence between species genetics and habitat characteristics, enabling us to understand the evolution of species under different environmental conditions \citep{koshkarov_phylogeography_2022}. We selected this software for our analysis because, to our knowledge, it is the first phylogeographic tool capable of establishing similarity or dissimilarity between species genetics and environmental, climatic, and geographical attributes - precisely the objective of our study \citep{koshkarov_phylogeography_2022}. The \textit{aPhyloGeo} software offers several key functionalities: @@ -177,15 +177,15 @@ \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}} # Generate phylogenetic trees based on aligned sequences # Create phylogenetic trees from the multiple sequence alignments (MSA). genetic_trees = utils.genetic_pipeline(alignments.msa) - + # Create a GeneticTrees object # Represent the generated phylogenetic trees in Newick format. trees = GeneticTrees(trees_dict=genetic_trees, format="newick") - + # Generate attribute trees based on attribute data # Create trees representing the relationships between different attributes. attribute_trees = utils.attribute_pipeline(attribute_data) - + # Filter the results based on the generated trees # Filter the results to ensure they meet certain criteria. utils.filter_results(attribute_trees, genetic_trees, attribute_data) @@ -196,26 +196,26 @@ \subsection{{\textit{aPhyloGeo} software}\label{aPhyloGeo-software}} \begin{enumerate} \item \textbf{The first step} was to collect DNA sequences from Cumacea of sufficient quality for the needs of our results \citep{koshkarov_phylogeography_2022}. In this study, 62 cumaceans samples were selected to represent 62 sequences of the 16S rRNA mitochondrial gene. We then included two climatic attributes, namely wind speed (m/s) at the start and end of the sampling; three environmental characteristics, such as depth (m) at the start of sampling, water temperature ($^\circ$C), and O\textsubscript{2} concentration (mg/L); and two geographic variables, latitude (DD) and longitude (DD) at the start of sampling. -\item \textbf{In the second step}, trees were generated separately from biological, spatial, meteorological, and genetic data. Concerning spatial attributes, we calculated the dissimilarity between each pair of cumaceans from distinct spatial conditions \citep{koshkarov_phylogeography_2022}. This produced a symmetrical square matrix \citep{koshkarov_phylogeography_2022}. The {neighbor-joining algorithm}\footnote{It is a method used to construct phylogenetic trees using distance matrices.} was used to build the spatial tree from this matrix \citep{koshkarov_phylogeography_2022}. Each geographic attribute gives rise to a geographic tree. If there are $m$ windows affected by this attribute, there will be $m$ geographic trees. The same approach was applied to biological, meteorological, and genetic data. +\item \textbf{In the second step}, trees were generated separately from biological, spatial, meteorological, and genetic data. Concerning spatial attributes, we calculated the dissimilarity between each pair of cumaceans from distinct spatial conditions \citep{koshkarov_phylogeography_2022}. This produced a symmetrical square matrix \citep{koshkarov_phylogeography_2022}. The {neighbor-joining algorithm}\footnote{It is a method used to construct phylogenetic trees using distance matrices.} was used to build the spatial tree from this matrix \citep{koshkarov_phylogeography_2022}. Each geographic attribute gives rise to a geographic tree. If there are $m$ windows affected by this attribute, there will be $m$ geographic trees. The same approach was applied to biological, meteorological, and genetic data. -For genetic data, phylogenetic reconstruction was reiterated to build genetic trees based on 62 mitochondrial 16S rRNA sequences, considering only data within a window that progresses along the alignment \citep{koshkarov_phylogeography_2022}. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) \citep{koshkarov_phylogeography_2022}. +For genetic data, phylogenetic reconstruction was reiterated to build genetic trees based on 62 mitochondrial 16S rRNA sequences, considering only data within a window that progresses along the alignment \citep{koshkarov_phylogeography_2022}. This displacement can vary according to the steps and the size of the window defined by the user (their length is determined by the number of base pairs (bp)) \citep{koshkarov_phylogeography_2022}. In our case, we set up the \textit{aPhyloGeo} software as follows: $pairwiseAligner$ for sequence alignment; $\text{Hamming distance}$ to measure simple dissimilarities between sequences of identical length; $\text{Wider Fit by elongating with Gap (starAlignment)}$ algorithm takes alignment gaps into account, which is often mandatory in the case of major deletions or insertions in the sequences; $\text{windows\_size}$: 1 nucleotide (nt); and finally, $\text{step\_size}$: 10 nt. The last two configurations imply that for each 1 nt window, a phylogenetic tree is produced using the nucleotide of each cumacean, then the window is moved by 10 nt, creating a new tree. Each window in the alignment will give a genetic tree. If there are $n$ windows, there will be $n$ phylogenetic trees. Genetic trees will be used in an object called $T_1$, while spatial and ecological trees are used in another object called $T_2$. -\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and regional trees using Robinson-Foulds distance \citep{robinson_comparison_1981}, normalized Robinson-Foulds distance, Euclidean distance, and Least Squares distance. These contribute to understanding the correspondence between cumaceans genetic sequences and their habitat. The approach also takes bootstrapping into account \citep{koshkarov_phylogeography_2022}. The results of these metrics were obtained using the functions $least\_square(tree1, tree2)$, $robinson\_foulds(tree1, tree2)$, $euclidean\_dist(tree1, tree2)$ from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). Those for the normalized Robinson-Foulds distance were obtained with the function $robinson\_foulds(tree1, tree2)$ (see the last line of code in \autoref{lst:robinsonFoulds}). The metric output tells us which of our attributes have the greatest divergence of phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). +\item \textbf{In the third step}, the genetic trees constructed in each sliding window are compared with ecosystemic, atmospheric, and regional trees using Robinson-Foulds distance \citep{robinson_comparison_1981}, normalized Robinson-Foulds distance, Euclidean distance, and Least Squares distance. These contribute to understanding the correspondence between cumaceans genetic sequences and their habitat. The approach also takes bootstrapping into account \citep{koshkarov_phylogeography_2022}. The results of these metrics were obtained using the functions $least\_square(tree1, tree2)$, $robinson\_foulds(tree1, tree2)$, $euclidean\_dist(tree1, tree2)$ from the \textit{aPhyloGeo} software and were organized by the main function (\autoref{lst:main}). Those for the normalized Robinson-Foulds distance were obtained with the function $robinson\_foulds(tree1, tree2)$ (see the last line of code in \autoref{lst:robinsonFoulds}). The metric output tells us which of our attributes have the greatest divergence of phylogenetic relationships in our samples, based on the magnitude of the metric distances (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). In addition to identifying the specific attribute, a sliding-window approach enables the precise localization of subtle sequences with high rates of genetic mutation \citep{koshkarov_phylogeography_2022}. This method requires shifting a fixed-size window over the alignment of genetic sequences, allowing phylogenetic trees to be reconstructed for each part of the sequence. It therefore allows us to recognize changes in evolutionary relationships along the 16S rRNA mitochondrial gene region of cumacean species. This method is essential for determining whether cumaceans-specific gene sequences in this region of their genome may be affected by certain ecological or spatial attributes of their habitat (see Figure \ref{fig:fig6} and Figure \ref{fig:fig7}). \end{enumerate} \subsection{Metrics}\label{metrics} -Our phylogeographic study used four distance metrics to quantify topological differences between phylogenetic trees. It also assesses dissimilarities between genetic sequences and ecological and regional attributes. This enables a comprehensive analysis of the evolutionary dynamics of cumacean populations in different environmental contexts. +Our phylogeographic study used four distance metrics to quantify topological differences between phylogenetic trees. It also assesses dissimilarities between genetic sequences and ecological and regional attributes. This enables a comprehensive analysis of the evolutionary dynamics of cumacean populations in different environmental contexts. -The following section presents a more concise version of the functions mentioned in the second and third steps of section \autoref{aPhyloGeo-software}: +The following section presents a more concise version of the functions mentioned in the second and third steps of \autoref{aPhyloGeo-software}: \subsubsection{Robinson-Foulds distance}\label{RF} -The Robinson-Foulds (RF) distance calculates the distance between phylogenetic trees built in each sliding window ($T_1$) and the attributes trees ($T_2$) (see the list in the first step of the section \autoref{aPhyloGeo-software}) \citep{tahiri2018new, koshkarov_phylogeography_2022}. This measurement is used to evaluate the topological differences between the two sets of trees (see Equation \eqref{eq:rf} and \autoref{lst:robinsonFoulds}). +The Robinson-Foulds (RF) distance calculates the distance between phylogenetic trees built in each sliding window ($T_1$) and the attributes trees ($T_2$) (see the list in the first step of the \autoref{aPhyloGeo-software}) \citep{tahiri2018new, koshkarov_phylogeography_2022}. This measurement is used to evaluate the topological differences between the two sets of trees (see Equation \eqref{eq:rf} and \autoref{lst:robinsonFoulds}). -For example, it evaluates the number of division differences between phylogenetic trees built within certain user-defined sliding windows (see the second step of the section \autoref{aPhyloGeo-software}) and geographic trees built with latitude data (DD) at the start of sampling \citep{robinson_comparison_1981}. A high distance between a specific window and other windows considered in the RF distance analysis implies that the habitat feature has little to no impact on this particular DNA sequence and that this attribute cannot explain the genetic divergences observed in this DNA sequence. +For example, it evaluates the number of division differences between phylogenetic trees built within certain user-defined sliding windows (see the second step of the \autoref{aPhyloGeo-software}) and geographic trees built with latitude data (DD) at the start of sampling \citep{robinson_comparison_1981}. A high distance between a specific window and other windows considered in the RF distance analysis implies that the habitat feature has little to no impact on this particular DNA sequence and that this attribute cannot explain the genetic divergences observed in this DNA sequence. \begin{equation}\label{eq:rf} \text{RF}(T_1, T_2) = | \Sigma(T_1) \Delta \Sigma(T_2) | @@ -262,7 +262,7 @@ \subsubsection{Robinson-Foulds distance}\label{RF} \end{lstlisting} \subsubsection{Normalized Robinson-Foulds distance}\label{RFnorm} -The normalized Robinson-Foulds (nRF) distance scales the RF distance to account for the size variations in the trees (number of clades; i.e., a group of species with a common origin), allowing a more equitable comparison. It scales the distance to a range between 0 and 1. In our context, the distance has been normalized by $2n-6$, where $n$ represents the number of taxa (see Equation \eqref{eq:rf_norm} and the last line of code in \autoref{lst:robinsonFoulds}). +The normalized Robinson-Foulds (nRF) distance scales the RF distance to account for the size variations in the trees (number of clades; i.e., a group of species with a common origin), allowing a more equitable comparison. It scales the distance to a range between 0 and 1. In our context, the distance has been normalized by $2n-6$, where $n$ represents the number of taxa (see Equation \eqref{eq:rf_norm} and the last line of code in \autoref{lst:robinsonFoulds}). Since the size of environmental trees constructed with O\textsubscript{2} concentration data (mg/L) differs from that of other attributes due to missing data, this nRF distance allows us to compare its dissimilarity with the phylogenetic trees in a fairer way \citep{tahiri2018new, koshkarov_phylogeography_2022}. It reveals the relative influence of O\textsubscript{2} concentration (mg/L) on cumacean phylogenetic relationships, independent of tree size \citep{tahiri2018new, koshkarov_phylogeography_2022}. A high value of this metric between a specific window and other windows considered in the nRF distance analysis does not allow us to conclude that there is a correlation between this DNA sequence and the attribute. It may indicate a topological dissimilarity between the habitat attribute tree and the gene trees at that position in the DNA sequence alignments. @@ -275,7 +275,7 @@ \subsubsection{Normalized Robinson-Foulds distance}\label{RFnorm} \subsubsection{Euclidean distance}\label{euclidean} In our study, the Euclidean distance calculates the distance between two sets of points in a multidimensional space, which designates the divisions of the two sets of trees ($T_1$ and $T_2$). It is used to compare divisions between two respective sets of trees to assess the degree of divergence or similarity of their topologies (see Equation \eqref{eq:euclidean} and \autoref{lst:euclideanDist}). Branches are weighted according to their length, which makes it possible to obtain quantitative dissimilarities between leaf (i.e., cumacean species) pairs (i.e., genetic distance) in the two sets of trees \citep{choi2009comparison}. Thus, for each pair of leaves, their distance in the genetic trees and the habitat attribute trees are compared \citep{choi2009comparison}. -By comparing the two sets of trees $T_1$ and $T_2$ using this metric, it is possible to measure the extent to which genetic divergences correspond to fluctuations in habitat attributes. This is crucial for interpreting evolutionary relationships with these factors. A high distance of this metric between a specific window and other windows considered in the Euclidean distance analysis reveals evolutionary divergences between members of the cumacean populations at the level of this DNA sequence (see Figure \ref{fig:fig6}d and Figure \ref{fig:fig7}d). +By comparing the two sets of trees $T_1$ and $T_2$ using this metric, it is possible to measure the extent to which genetic divergences correspond to fluctuations in habitat attributes. This is crucial for interpreting evolutionary relationships with these factors. A high distance of this metric between a specific window and other windows considered in the Euclidean distance analysis reveals evolutionary divergences between members of the cumacean populations at the level of this DNA sequence (see Figure \ref{fig:fig6}d and Figure \ref{fig:fig7}d). \begin{equation}\label{eq:euclidean} d_{\text{Euclidean}}(T_1, T_2) = \sqrt{\sum_{i=1}^{n} (T1_i - T2_i)^2} @@ -306,21 +306,21 @@ \subsubsection{Euclidean distance}\label{euclidean} # Load the first tree from Newick format into a dendropy Tree object # Analyzes the string formatted by Newick and prepares the tree for comparison. tree1_tc = dendropy.Tree.get( - data=tree1.format("newick"), - schema="newick", + data=tree1.format("newick"), + schema="newick", taxon_namespace=tns ) - + # Load the second tree from Newick format into a dendropy Tree object # Similar to the first tree, this step prepares the second tree for comparison. tree2_tc = dendropy.Tree.get( - data=tree2.format("newick"), - schema="newick", + data=tree2.format("newick"), + schema="newick", taxon_namespace=tns ) # Encode the bipartitions of both trees - # This step converts the trees into a format where the presence or absence of + # This step converts the trees into a format where the presence or absence of # Each bipartition (split) is coded, which is necessary to calculate distances. tree1_tc.encode_bipartitions() tree2_tc.encode_bipartitions() @@ -347,7 +347,7 @@ \subsubsection{Least Squares distance}\label{LS} def least_square(tree1, tree2): """ - + Parameters: - tree1: Genetic trees. - tree2: Atmospherical, ecosystemic, and spatial trees. @@ -355,16 +355,16 @@ \subsubsection{Least Squares distance}\label{LS} Returns: - ls: The Least-Squares distance between the two sets of trees. """ - + # Initialize the Least-Squares distance to zero ls = 0.0 - + # Retrieve the list of terminal leaves (species) from the first tree leaves = tree1.get_terminals() - + # Extract the names of the terminal leaves leaves_name = [leaf.name for leaf in leaves] - + # Iterate over each pair of leaves in the trees for i in leaves_name: # Remove the first leaf from the list to avoid redundant comparisons @@ -376,7 +376,7 @@ \subsubsection{Least Squares distance}\label{LS} d2 = tree2.distance(tree2.find_any(i), tree2.find_any(j)) # Accumulate the absolute difference of distances into the LSD ls += abs(d1 - d2) - + return ls \end{lstlisting} @@ -414,7 +414,7 @@ \section{Results}\label{results} \caption{Frequency distribution of cumacean species in our sample. The bars represent the number of individuals for each species. The percentages (\%) displayed above the bars indicate the relative abundance of each species in the total sample. The mean and median values of the frequency distribution are shown in the top right-hand corner of the histogram. Unlike less common species, those that are abundant (such as \emph{Leptostylis ampullacea} and \emph{Leucon pallidus}) may have adaptive characteristics that enable them to exploit resources more easily, resist interspecific competition or withstand changing biological conditions. \label{fig:fig3}} \end{figure} -The distribution and diversity of the various cumacean species found in our sample are shown in Figure \ref{fig:fig3}. It shows that the most represented species are \emph{Leptostylis ampullacea} (14.1\%) and \emph{Leucon pallidus} (12.5\%). In contrast, species like \emph{Bathycuma brevirostre} and \emph{Styloptocuma gracillimum} are less represented (1.6\%), implying that some species may have restricted ecological niches or face ecological forces that limit their distribution. The dominance of certain species (such as \emph{Leptostylis ampullacea} and \emph{Leucon pallidus}) suggests that they may have adaptive traits that enable them to make the most of the accessible resources, resist interspecific competition, or survive in fluctuating ecosystemic conditions, aligns with our study’s aim of relating genetic adaptation to habitat characteristics. +The distribution and diversity of the various cumacean species found in our sample are shown in Figure \ref{fig:fig3}. It shows that the most represented species are \emph{Leptostylis ampullacea} (14.1\%) and \emph{Leucon pallidus} (12.5\%). In contrast, species like \emph{Bathycuma brevirostre} and \emph{Styloptocuma gracillimum} are less represented (1.6\%), implying that some species may have restricted ecological niches or face ecological forces that limit their distribution. The dominance of certain species (such as \emph{Leptostylis ampullacea} and \emph{Leucon pallidus}) suggests that they may have adaptive traits that enable them to make the most of the accessible resources, resist interspecific competition, or survive in fluctuating ecosystemic conditions, aligns with our study’s aim of relating genetic adaptation to habitat characteristics. \begin{figure}[htbp] \centering @@ -422,7 +422,7 @@ \section{Results}\label{results} \caption{Distribution of cumacean families by water mass. This histogram represents the frequency of occurrence of the different cumacean families in our samples, classified according to the water mass in which they were collected. Eight water mass categories are represented: Arctic Polar Water (APW), Arctic Polar Water/North Sub-Arctic Intermediate Water (APW/NSAIW), Iceland Scotland Overflow Water (ISOW), Labrador Sea Water (LSW), North Atlantic Water (NAW), North Sub-Arctic Intermediate Water (NSAIW), cold North Sub-Atlantic Deep Water (NSDWc), and warm North Sub-Atlantic Deep Water (NSDWw). Seven families are represented: Astacidae (red), Bodotriidae (brown), Ceratocumatidae (green), Diastylidae (turquoise), Lampropidae (blue), Leuconidae (purple), and Pseudocumatidae (pink). The presence of the Diastylidae (turquoise) family in the majority of water bodies (APW, APW/NSAIW, ISOW, NSAIW, NSDWc, and NSDWw) accentuates the resilience and ecological acclimatization of this family to various ecological niches and conditions. \label{fig:fig4}} \end{figure} -The following figure supports the objective of our study by showing the distribution of the various cumacean families in the different water bodies (Figure \ref{fig:fig4}). The Diastylidae family, for example, is the most common in all water bodies (turquoise color in Figure \ref{fig:fig4}), testifying to its resilience and ecological adaptability to a wide variety of habitat conditions, reminiscent of the dominance of \emph{Leptostylis ampullacea} (Figure \ref{fig:fig3}, 14.1\%) which belongs to the Diastylidae family. +The following figure supports the objective of our study by showing the distribution of the various cumacean families in the different water bodies (Figure \ref{fig:fig4}). The Diastylidae family, for example, is the most common in all water bodies (turquoise color in Figure \ref{fig:fig4}), testifying to its resilience and ecological adaptability to a wide variety of habitat conditions, reminiscent of the dominance of \emph{Leptostylis ampullacea} (Figure \ref{fig:fig3}, 14.1\%) which belongs to the Diastylidae family. \begin{figure}[] \centering @@ -444,7 +444,7 @@ \section{Results}\label{results} \caption{Analysis of fluctuations in four distance metrics using multiple sequence alignment (MSA): a) Least Squares distance, b) Robinson-Foulds distance, c) normalized Robinson-Foulds distance, and d) Euclidean distance. These variations in distance are studied to establish their dissimilarity with the variation in Otextsubscript{2} concentration (mg/L) at the sampling sites. \label{fig:fig7}} \end{figure} -The divergence between the genetic sequences and two attributes, one climatic (wind speed (m/s) at the start of sampling) and the other environmental (O\textsubscript{2} concentration (mg/L)) is presented in Figure \ref{fig:fig6} and Figure \ref{fig:fig7}. All the attributes given in the first step of the \autoref{aPhyloGeo-software} section were analyzed and their script and figure will be soon available in the $img$ and $script$ python file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}. However, only these two attributes showed the most interesting mutation rate. Using the four metrics mentioned in the section \autoref{metrics}, we noticed that the Euclidean distance is particularly sensitive to our data, manifesting considerable sequence variation at the position in MSA 520-529 amino acids (aa) (Euclidean distance: 0.8 < x < 0.9; Figure \ref{fig:fig6}d) and 1190-199 aa (Euclidean distance: 1.2 < x < 1.3; Figure \ref{fig:fig7}d). Unlike the other windows for this metric in the two figures (see Figure \ref{fig:fig6}d and Figure \ref{fig:fig7}d), the fluctuations in wind speed (m/s) at the start of sampling and in O\textsubscript{2} concentration (mg/L) do not appear to explain the variations in these two specific sequences. This implies that these genetic sites are subject to selection pressures or evolutionary changes, due to biological (O\textsubscript{2} concentration (mg/L)) and meteorological conditions (wind speed (m/s) at the start of sampling). These results align with our study's aim to identify the genetic region of cumaceans with the highest mutation rate linked to a specific habitat attribute. +The divergence between the genetic sequences and two attributes, one climatic (wind speed (m/s) at the start of sampling) and the other environmental (O\textsubscript{2} concentration (mg/L)) is presented in Figure \ref{fig:fig6} and Figure \ref{fig:fig7}. All the attributes given in the first step of the \autoref{aPhyloGeo-software} section were analyzed and their script and figure will be soon available in the $img$ and $script$ python file on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub}. However, only these two attributes showed the most interesting mutation rate. Using the four metrics mentioned in the \autoref{metrics}, we noticed that the Euclidean distance is particularly sensitive to our data, manifesting considerable sequence variation at the position in MSA 520-529 amino acids (aa) (Euclidean distance: 0.8 < x < 0.9; Figure \ref{fig:fig6}d) and 1190-199 aa (Euclidean distance: 1.2 < x < 1.3; Figure \ref{fig:fig7}d). Unlike the other windows for this metric in the two figures (see Figure \ref{fig:fig6}d and Figure \ref{fig:fig7}d), the fluctuations in wind speed (m/s) at the start of sampling and in O\textsubscript{2} concentration (mg/L) do not appear to explain the variations in these two specific sequences. This implies that these genetic sites are subject to selection pressures or evolutionary changes, due to biological (O\textsubscript{2} concentration (mg/L)) and meteorological conditions (wind speed (m/s) at the start of sampling). These results align with our study's aim to identify the genetic region of cumaceans with the highest mutation rate linked to a specific habitat attribute. These results provide important insight into the genetic adaptation of cumaceans to their environment. These results need to be analyzed in greater depth to certify their involvement, especially in contrast with \citep{uhlir_adding_2021}, which investigated similar topics of environmental and climatic effects on cumaceans distribution and genetics. The \textit{aPhyloGeo} package is still in the process of being updated. A more in-depth analysis of the results is available on \href{https://github.com/tahiri-lab/Cumacea_aPhyloGeo}{GitHub} in the supplementary file. @@ -453,7 +453,7 @@ \section{Conclusion}\label{conclusion} The novelty in our research lies in the exhaustive divergence between habitat attributes and genetic mutability in cumaceans, particularly in identifying genetic windows associated with habitat fluctuations, which has not been widely investigated in previous studies \citep{manel2003landscape, vrijenhoek2009cryptic}. In this case, our integrated method identifies specific genetic regions sensitive to ecosystemic and atmospheric variations. Thus, by seeking to determine which of these two attributes diverges most with the DNA sequences, the eventual identification of proteins linked to one of these variable DNA sequences will make it possible to represent its functional effects in responses to habitat changes. Our future research will focus on verifying the prediction of this protein and assessing its role in the physiological adaptation of cumaceans to fluctuating conditions, adding a link between genetic data and ecological function. -Interpreting how marine invertebrates genetically adapt to variations in their habitat can help us better predict their responses to climate change and advance conservation plans to protect them. Identifying the specific attributes that influence genetic variability of Cumacea can contribute to the designation and supervision of marine protected areas, assuring they include habitats crucial to the survival and acclimatization of these species. Thus, our results can inform the management of fishing and seabed mining companies by revealing ecologically vulnerable areas where these disturbances can seriously affect benthic biodiversity. +Interpreting how marine invertebrates genetically adapt to variations in their habitat can help us better predict their responses to climate change and advance conservation plans to protect them. Identifying the specific attributes that influence genetic variability of Cumacea can contribute to the designation and supervision of marine protected areas, assuring they include habitats crucial to the survival and acclimatization of these species. Thus, our results can inform the management of fishing and seabed mining companies by revealing ecologically vulnerable areas where these disturbances can seriously affect benthic biodiversity. Furthermore, our results provide essential knowledge to guide future studies on the genetic adaptation of Cumacea and other invertebrates to ecological and regional variability. Based on these findings, future research should focus on additional ecosystemic and meteorological attributes, such as nutrient accessibility, water pH, ocean currents, and the degree of human disturbance, to further improve the interpretation of the complex interactions between genetics and the environment. Broadening the scope of application to other marine species, not just marine invertebrates, and diverse geographic regions would allow us to generalize the results more effectively. With this in mind, longitudinal study models on these species could reflect long-term climatic and biological fluctuations and improve our knowledge of the dynamics of genetic acclimatization. diff --git a/papers/Gagnon_Kebe_Tahiri/myst.yml b/papers/Gagnon_Kebe_Tahiri/myst.yml index 36a9b321ea..5945d39b9f 100644 --- a/papers/Gagnon_Kebe_Tahiri/myst.yml +++ b/papers/Gagnon_Kebe_Tahiri/myst.yml @@ -1,9 +1,12 @@ version: 1 +extends: ../proceedings.yml project: + doi: 10.25080/NVYF1037 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-gagnon_kebe_tahiri title: Ecological and Geographic Influences on Cumacea Genetics in the Northern North Atlantic subtitle: by aPhyloGeo software + description: Cumacea are vital indicators of benthic health in marine ecosystems. This study investigated the influence of environmental (i.e., biological or ecosystemic), climatic (i.e., meteorological or atmospheric), and geographic (i.e., spatial or regional) attributes on their genetic variability in the Northern North Atlantic, focusing on Icelandic waters. # Authors should have affiliations, emails and ORCIDs if available authors: - name: Justin Gagnon @@ -27,19 +30,14 @@ project: - Phylogeography # Add the abbreviations that you use in your paper here abbreviations: - MyST: Markedly Structured Text + DD: decimal degrees + PCR: Polymerase Chain Reaction + rRNA: Ribosomal ribonucleic acid + GIN: Greenland, Iceland, and Norwegian # It is possible to explicitly ignore the `doi-exists` check for certain citation keys error_rules: - rule: doi-exists severity: ignore keys: - # A banner will be generated for you on publication, this is a placeholder - banner: banner.png - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-09-23 site: template: article-theme diff --git a/papers/Gagnon_Kebe_Tahiri/thumbnail.png b/papers/Gagnon_Kebe_Tahiri/thumbnail.png new file mode 100644 index 0000000000..2ef59a3721 Binary files /dev/null and b/papers/Gagnon_Kebe_Tahiri/thumbnail.png differ diff --git a/papers/Suvrakamal_Das/banner.jpg b/papers/Suvrakamal_Das/banner.jpg deleted file mode 100644 index e7b6696129..0000000000 Binary files a/papers/Suvrakamal_Das/banner.jpg and /dev/null differ diff --git a/papers/Suvrakamal_Das/banner.png b/papers/Suvrakamal_Das/banner.png new file mode 100644 index 0000000000..e767488250 Binary files /dev/null and b/papers/Suvrakamal_Das/banner.png differ diff --git a/papers/Suvrakamal_Das/main.md b/papers/Suvrakamal_Das/main.md index dd5d4276e7..0d7024bbe0 100644 --- a/papers/Suvrakamal_Das/main.md +++ b/papers/Suvrakamal_Das/main.md @@ -1,6 +1,9 @@ -## Abstract - -The quest for more efficient and faster deep learning models has led to the development of various alternatives to Transformers, one of which is the Mamba model. This paper provides a comprehensive comparison between Mamba models and Transformers, focusing on their architectural differences, performance metrics, and underlying mechanisms. It analyzes and synthesizes findings from extensive research conducted by various authors on these models. The synergy between Mamba models and the SciPy ecosystem enhances their integration into science. By providing an in-depth comparison using Python and its scientific ecosystem, this paper aims to clarify the strengths and weaknesses of Mamba models relative to Transformers. It offers the results obtained along with some thoughts on the possible ramifications for future research and applications in a range of academic and professional fields. +--- +title: Mamba Models a possible replacement for Transformers? +subtitle: A Memory-Efficient Approach for Scientific Computing +abstract: | + The quest for more efficient and faster deep learning models has led to the development of various alternatives to Transformers, one of which is the Mamba model. This paper provides a comprehensive comparison between Mamba models and Transformers, focusing on their architectural differences, performance metrics, and underlying mechanisms. It analyzes and synthesizes findings from extensive research conducted by various authors on these models. The synergy between Mamba models and the SciPy ecosystem enhances their integration into science. By providing an in-depth comparison using Python and its scientific ecosystem, this paper aims to clarify the strengths and weaknesses of Mamba models relative to Transformers. It offers the results obtained along with some thoughts on the possible ramifications for future research and applications in a range of academic and professional fields. +--- ### Introduction @@ -20,11 +23,11 @@ Finally, we explore the potential applications and future directions of Mamba mo ### State Space Models -The central goal of machine learning is to develop models capable of efficiently processing sequential data across a range of modalities and tasks [@mamba_github]. This is particularly challenging when dealing with **long sequences**, especially those exhibiting **long-range dependencies (LRDs)** – where information from distant past time steps significantly influences the current state or future predictions. Examples of such sequences abound in real-world applications, including speech, video, medical, time series, and natural language. However, traditional models struggle to effectively handle such long sequences. +The central goal of machine learning is to develop models capable of efficiently processing sequential data across a range of modalities and tasks [@mamba_github]. This is particularly challenging when dealing with **long sequences**, especially those exhibiting **long-range dependencies (LRDs)** – where information from distant past time steps significantly influences the current state or future predictions. Examples of such sequences abound in real-world applications, including speech, video, medical, time series, and natural language. However, traditional models struggle to effectively handle such long sequences. **Recurrent Neural Networks (RNNs)** [@Sherstinsky_2020], often considered the natural choice for sequential data, are inherently stateful and require only constant computation per time step. However, they are slow to train and suffer from the well-known "**vanishing gradient problem**", which limits their ability to capture LRDs. **Convolutional Neural Networks (CNNs)** [@oshea2015introduction], while efficient for parallelizable training, are not inherently sequential and struggle with long context lengths, resulting in more expensive inference. **Transformers** [@vaswani2023attention], despite their recent success in various tasks, typically require specialized architectures and attention mechanisms to handle LRDs, which significantly increase computational complexity and memory usage. -A promising alternative for tackling LRDs in long sequences is **State Space Models (SSMs)** [@gu2022efficiently], a foundational mathematical framework deeply rooted in diverse scientific disciplines like control theory and computational neuroscience. SSMs provide a continuous-time representation of a system's state and evolution, offering a powerful paradigm for capturing LRDs. While SSMs and S4s does not prevent the vanishing gradient problem but it reduces the impact with the help of HiPPO framework and NPLR Parametrization. They represent a system's behavior in terms of its internal **state** and how this state evolves over time. SSMs are widely used in various fields, including control theory, signal processing, and computational neuroscience. +A promising alternative for tackling LRDs in long sequences is **State Space Models (SSMs)** [@gu2022efficiently], a foundational mathematical framework deeply rooted in diverse scientific disciplines like control theory and computational neuroscience. SSMs provide a continuous-time representation of a system's state and evolution, offering a powerful paradigm for capturing LRDs. While SSMs and S4s does not prevent the vanishing gradient problem but it reduces the impact with the help of HiPPO framework and NPLR Parametrization. They represent a system's behavior in terms of its internal **state** and how this state evolves over time. SSMs are widely used in various fields, including control theory, signal processing, and computational neuroscience. #### Continuous-time Representation @@ -32,21 +35,21 @@ The continuous-time SSM describes a system's evolution using differential equati The core equations of the continuous-time SSM are: -* **State Evolution:** -  $${x'(t) = Ax(t) + Bu(t)}$$ +- **State Evolution:** +   $${x'(t) = Ax(t) + Bu(t)}$$ -* **Output Generation:** -  $${y(t) = Cx(t) + Du(t)}$$ +- **Output Generation:** +   $${y(t) = Cx(t) + Du(t)}$$ where: -* $x(t)$ is the state vector at time $t$, belonging to a $N$-dimensional space. -* $u(t)$ is the input signal at time $t$. -* $y(t)$ is the output signal at time $t$. -* $A$ is the state matrix, controlling the evolution of the state vector $x(t)$. -* $B$ is the control matrix, mapping the input signal $u(t)$ to the state space. -* $C$ is the output matrix, projecting the state vector $x(t)$ onto the output space. -* $D$ is the command matrix, directly mapping the input signal $u(t)$ to the output. (For simplicity, we often assume $D$ = 0, as $Du(t)$ can be viewed as a skip connection.) +- $x(t)$ is the state vector at time $t$, belonging to a $N$-dimensional space. +- $u(t)$ is the input signal at time $t$. +- $y(t)$ is the output signal at time $t$. +- $A$ is the state matrix, controlling the evolution of the state vector $x(t)$. +- $B$ is the control matrix, mapping the input signal $u(t)$ to the state space. +- $C$ is the output matrix, projecting the state vector $x(t)$ onto the output space. +- $D$ is the command matrix, directly mapping the input signal $u(t)$ to the output. (For simplicity, we often assume $D$ = 0, as $Du(t)$ can be viewed as a skip connection.) This system of equations defines a continuous-time mapping from input $u(t)$ to output $y(t)$ through a latent state $x(t)$. The state matrix $A$ plays a crucial role in determining the dynamics of the system and its ability to capture long-range dependencies. @@ -56,9 +59,9 @@ Despite their theoretical elegance, naive applications of SSMs often struggle wi HiPPO focuses on finding specific state matrices $A$ that allow the state vector $x(t)$ to effectively memorize the history of the input signal $u(t)$. It achieves this by leveraging the properties of orthogonal polynomials. The HiPPO framework derives several structured state matrices, including: -* **HiPPO-LegT (Translated Legendre):** Based on Legendre polynomials, this matrix enables the state to capture the history of the input within sliding windows of a fixed size. -* **HiPPO-LagT (Translated Laguerre):** Based on Laguerre polynomials, this matrix allows the state to capture a weighted history of the input, where older information decays exponentially. -* **HiPPO-LegS (Scaled Legendre):** Based on Legendre polynomials, this matrix captures the history of the input with respect to a linearly decaying weight. +- **HiPPO-LegT (Translated Legendre):** Based on Legendre polynomials, this matrix enables the state to capture the history of the input within sliding windows of a fixed size. +- **HiPPO-LagT (Translated Laguerre):** Based on Laguerre polynomials, this matrix allows the state to capture a weighted history of the input, where older information decays exponentially. +- **HiPPO-LegS (Scaled Legendre):** Based on Legendre polynomials, this matrix captures the history of the input with respect to a linearly decaying weight. #### Discrete-time SSM: Recurrent Representation @@ -74,13 +77,13 @@ $$ where: -* $\Delta$ acts as a gating factor, selectively weighting the contribution of matrices $A$ and $B$ at each step. This allows the model to dynamically adjust the influence of past hidden states and current inputs. +- $\Delta$ acts as a gating factor, selectively weighting the contribution of matrices $A$ and $B$ at each step. This allows the model to dynamically adjust the influence of past hidden states and current inputs. -* $A$ represents the state transition matrix. When modulated by $\Delta$, it governs the propagation of information from the previous hidden state to the current hidden state. +- $A$ represents the state transition matrix. When modulated by $\Delta$, it governs the propagation of information from the previous hidden state to the current hidden state. -* $B$ denotes the input matrix. After modulation by $\Delta$, it determines how the current input is integrated into the hidden state. +- $B$ denotes the input matrix. After modulation by $\Delta$, it determines how the current input is integrated into the hidden state. -* $C$ serves as the output matrix. It maps the hidden state to the model's output, effectively transforming the internal representations into a desired output space. +- $C$ serves as the output matrix. It maps the hidden state to the model's output, effectively transforming the internal representations into a desired output space. :::{figure} ssm.svg :label: fig:ssm @@ -109,17 +112,20 @@ The state-space models (SSMs) compute the output using a linear recurrent neural $$ h_t = \overline{A} h_{t-1} + \overline{B} x_t $$ + where -* $h_t$ is hidden state matrix at time step t -* $x_t$ is input vector at time t +- $h_t$ is hidden state matrix at time step t +- $x_t$ is input vector at time t The initial hidden state $h_0$ is computed as: + $$ h_0 = \overline{A} h_{-1} + \overline{B} x_0 = \overline{B} x_0 $$ Subsequently, the hidden state at the next time step, $h_1$, is obtained through the recursion: + $$ h_1 = \overline{A} h_0 + \overline{B} x_1 = \overline{A} \overline{B} $$ @@ -130,10 +136,11 @@ $$ y_t = C h_t $$ -* C is the output control matrix -* $y_t$ is output vector at time t -* $h_t$ is the Internal hidden state at time t +- C is the output control matrix +- $y_t$ is output vector at time t +- $h_t$ is the Internal hidden state at time t +```{math} \begin{align*} y_0 &= C h_0 = C \overline{B} x_0 \\ y_1 &= C h_1 = C \overline{A} \overline{B} x_0 + C \overline{B} x_1 \\ @@ -141,6 +148,7 @@ y_2 &= C \overline{A}^2 \overline{B} x_0 + C \overline{A} \overline{B} x_1 + C \ &\vdots\\ y_t &= C \overline{A}^t \overline{B} x_0 + C \overline{A}^{t-1} \overline{B} x_1 + \ldots + C \overline{A} \overline{B} x_{t-1} + C \overline{B} x_t \end{align*} +``` $$ Y = K \cdot X @@ -148,8 +156,8 @@ $$ where : -* $X$ is the input matrix *i.e.* $[x_0, x_1, \ldots, x_L]$ -* $ +- $X$ is the input matrix _i.e._ $[x_0, x_1, \ldots, x_L]$ +- $ K = \left( C \overline{B}, \, C \overline{A} \overline{B}, \, \ldots, \, C \overline{A}^{L-1} \overline{B} \right) $ @@ -166,31 +174,31 @@ The core computational bottleneck in SSMs stems from repeated matrix multiplicat Diagonalization involves finding a change of basis that transforms $A$ into a diagonal form. However, this approach faces significant challenges when $A$ is **non-normal**. Non-normal matrices have complex eigenstructures, which can lead to several problems: -* **Numerically unstable diagonalization:** Diagonalizing non-normal matrices can be numerically unstable, especially for large matrices. This is because the eigenvectors may be highly sensitive to small errors in the matrix, leading to large errors in the computed eigenvalues and eigenvectors. -* **Exponentially large entries:** The diagonalization of some non-normal matrices, including the HiPPO matrices, can involve matrices with entries that grow exponentially with the dimension $N$. This can lead to overflow issues during computation and render the diagonalization infeasible in practice. +- **Numerically unstable diagonalization:** Diagonalizing non-normal matrices can be numerically unstable, especially for large matrices. This is because the eigenvectors may be highly sensitive to small errors in the matrix, leading to large errors in the computed eigenvalues and eigenvectors. +- **Exponentially large entries:** The diagonalization of some non-normal matrices, including the HiPPO matrices, can involve matrices with entries that grow exponentially with the dimension $N$. This can lead to overflow issues during computation and render the diagonalization infeasible in practice. Therefore, naive diagonalization of non-normal matrices in SSMs is not a viable solution for efficient computation. ### The S4 Parameterization: Normal Plus Low-Rank (NPLR) -S4 overcomes the challenges of directly diagonalizing non-normal matrices by introducing a novel parameterization [@gu2022parameterization]. It decomposes the state matrix *A* into a sum of a **normal matrix** and a **low-rank term**. This decomposition allows for efficient computation while preserving the structure necessary to handle long-range dependencies. The S4 parameterization is expressed as follows: +S4 overcomes the challenges of directly diagonalizing non-normal matrices by introducing a novel parameterization [@gu2022parameterization]. It decomposes the state matrix _A_ into a sum of a **normal matrix** and a **low-rank term**. This decomposition allows for efficient computation while preserving the structure necessary to handle long-range dependencies. The S4 parameterization is expressed as follows: -* SSM convolution kernel +- SSM convolution kernel $$ ~~~~~~~~ \overline K = \kappa _L(\overline A, \overline B, \overline C) \text{~~~for~~~} A = V \Lambda V^* − P Q^T$$ where: -* *V* is a unitary matrix that diagonalizes the normal matrix. -* *Λ* is a diagonal matrix containing the eigenvalues of the normal matrix. -* *P* and *Q* are low-rank matrices that capture the non-normal component. -* These matrices HiPPO - $LegS, LegT, LagT$ all satisfy $r$ = 1 or $r$ = 2. +- _V_ is a unitary matrix that diagonalizes the normal matrix. +- _Λ_ is a diagonal matrix containing the eigenvalues of the normal matrix. +- _P_ and _Q_ are low-rank matrices that capture the non-normal component. +- These matrices HiPPO - $LegS, LegT, LagT$ all satisfy $r$ = 1 or $r$ = 2. This decomposition allows for efficient computation because: -* **Normal matrices are efficiently diagonalizable:** Normal matrices can be diagonalized stably and efficiently using unitary transformations. -* **Low-rank corrections are tractable:** The low-rank term can be corrected using the Woodbury identity, a powerful tool for inverting matrices perturbed by low-rank terms. +- **Normal matrices are efficiently diagonalizable:** Normal matrices can be diagonalized stably and efficiently using unitary transformations. +- **Low-rank corrections are tractable:** The low-rank term can be corrected using the Woodbury identity, a powerful tool for inverting matrices perturbed by low-rank terms. ### S4 Algorithms and Complexity @@ -212,10 +220,12 @@ $$ Selective SSM +```{math} \begin{align*} y_t &= C_0 \overline{A}^t \overline{B}_0 x_0 + C_1 \overline{A}^{t-1} \overline{B}_1 x_1 + \ldots \\ &\quad \text{input-dependent } B \text{ and } C \text{ matrix} \end{align*} +``` By leveraging the parallel associative scan technique [@lim2024parallelizing], the selective SSM formulation can be efficiently implemented on parallel architectures, such as GPUs. This approach enables the exploitation of the inherent parallelism in the computation, leading to significant performance gains, particularly for large-scale applications and time-series data processing tasks. @@ -239,7 +249,7 @@ In summary, S4 offers a structured and efficient approach to SSMs, overcoming th ### Mamba Model Architecture -One Mamba Layer [@gu2023mamba] @fig:mamba is composed of a selective state-space module and several auxiliary layers. Initially, a linear layer doubles the dimensionality of the input token embedding, increasing the dimensionality from 64 to 128. This higher dimensionality provides the network with an expanded representational space, potentially enabling the separation of previously inseparable classes. +One Mamba Layer [@gu2023mamba] @fig:mamba is composed of a selective state-space module and several auxiliary layers. Initially, a linear layer doubles the dimensionality of the input token embedding, increasing the dimensionality from 64 to 128. This higher dimensionality provides the network with an expanded representational space, potentially enabling the separation of previously inseparable classes. Subsequently, a canonical 1D convolution layer processes the output of the previous layer, manipulating the dimensions within the linearly upscaled 128-dimensional vector. This convolution layer employs the **SiLU (Sigmoid-weighted Linear Unit)** activation function [@elfwing2017sigmoidweighted]. The output of the convolution is then processed by the selective state-space module, which operates akin to a linear recurrent neural network (RNN). :::{figure} mamba.svg @@ -259,12 +269,14 @@ Self attention, feed forward Neural Networks, normalization, residual layers and #### Architecture Overview ##### Transformer Architecture -Transformers @fig:transformer rely heavily on attention mechanisms to model dependencies between input and output sequences. A better understanding of the code will be of great help[@transformer_py]. + +Transformers @fig:transformer rely heavily on attention mechanisms to model dependencies between input and output sequences. A better understanding of the code will be of great help [@transformer_py]. The core components include: -* **Multi-Head Self-Attention**: Allows the model to focus on different parts of the input sequence. -* **Position-wise Feed-Forward Networks**: Applied to each position separately. -* **Positional Encoding**: Adds information about the position of each token in the sequence, as Transformers lack inherent sequential information due to the parallel nature of their processing. + +- **Multi-Head Self-Attention**: Allows the model to focus on different parts of the input sequence. +- **Position-wise Feed-Forward Networks**: Applied to each position separately. +- **Positional Encoding**: Adds information about the position of each token in the sequence, as Transformers lack inherent sequential information due to the parallel nature of their processing. :::{figure} transformer.webp :label: fig:transformer @@ -272,10 +284,12 @@ This diagram illustrates the transformer model architecture, featuring encoder a ::: ##### Mamba Architecture + Mamba models @fig:mamba are based on Selective State Space Models (SSMs), combining aspects of RNNs, CNNs, and classical state space models. Key features include: -* **Selective State Space Models**: Allow input-dependent parameterization to selectively propagate or forget information. -* **Recurrent Mode**: Efficient recurrent computations with linear scaling. -* **Hardware-aware Algorithm**: Optimized for modern hardware to avoid inefficiencies from the Flash Attention 2 Paper. [@] + +- **Selective State Space Models**: Allow input-dependent parameterization to selectively propagate or forget information. +- **Recurrent Mode**: Efficient recurrent computations with linear scaling. +- **Hardware-aware Algorithm**: Optimized for modern hardware to avoid inefficiencies from the Flash Attention 2 Paper. #### Key Differences @@ -292,10 +306,10 @@ Here, $ A $, $ B $, and $ C $ are state space parameters that vary with the inpu ##### 2. Computational Complexity -| Feature | Architecture | Complexity | Inference Speed | Training Speed | -|------------|:----------------|:-------------|:------------------|:-----------------| -| Transformer | Attention-based | High | O(n) | O(n²) | -| Mamba | SSM-based | Lower | O(1) | O(n) | +| Feature | Architecture | Complexity | Inference Speed | Training Speed | +| ----------- | :-------------- | :--------- | :-------------- | :------------- | +| Transformer | Attention-based | High | O(n) | O(n²) | +| Mamba | SSM-based | Lower | O(1) | O(n) | ##### 3. Sequence Handling and Memory Efficiency @@ -308,14 +322,16 @@ Mamba integrates selective state spaces directly into the neural network archite There are other competing architectures that aim to replace or complement Transformers, such as Retentive Network [@sun2023retentive], Griffin [@de2024griffin], Hyena [@poli2023hyena], and RWKV [@peng2023rwkv]. These architectures propose alternative approaches to modeling sequential data, leveraging techniques like gated linear recurrences, local attention, and reinventing recurrent neural networks (RNNs) for the Transformer era. ### Mamba's Synergy with Scipy + Scipy [@scipy] provides a robust ecosystem for scientific computing in Python, offering a wide range of tools and libraries for numerical analysis, signal processing, optimization, and more. This ecosystem serves as a fertile ground for the development and integration of Mamba, facilitating its training, evaluation, and deployment in scientific applications. Leveraging Scipy's powerful data manipulation and visualization capabilities, Mamba models can be seamlessly integrated into scientific workflows, enabling in-depth analysis, rigorous statistical testing, and clear visualization of results. The combination of Mamba's language understanding capabilities and Scipy's scientific computing tools opens up new avenues for exploring large-scale scientific datasets commonly encountered in scientific research domains such as astronomy, medicine, and beyond, extracting insights, and advancing scientific discoveries. #### Potential Applications and Future Directions: -* **Efficient Processing of Large Scientific Datasets:** Mamba's ability to handle long-range dependencies makes it well-suited for analyzing and summarizing vast amounts of scientific data, such as astronomical observations, medical records, or experimental results, thereby reducing the complexity and enabling more efficient analysis. -* **Enhancing Model Efficiency and Scalability:** Integrating Mamba with Scipy's optimization and parallelization techniques can potentially improve the efficiency and scalability of language models, enabling them to handle increasingly larger datasets and more complex scientific problems. -* **Advancing Scientific Computing through Interdisciplinary Collaboration:** The synergy between Mamba and Scipy fosters interdisciplinary collaboration between natural language processing researchers, scientific computing experts, and domain-specific scientists, paving the way for novel applications and pushing the boundaries of scientific computing. + +- **Efficient Processing of Large Scientific Datasets:** Mamba's ability to handle long-range dependencies makes it well-suited for analyzing and summarizing vast amounts of scientific data, such as astronomical observations, medical records, or experimental results, thereby reducing the complexity and enabling more efficient analysis. +- **Enhancing Model Efficiency and Scalability:** Integrating Mamba with Scipy's optimization and parallelization techniques can potentially improve the efficiency and scalability of language models, enabling them to handle increasingly larger datasets and more complex scientific problems. +- **Advancing Scientific Computing through Interdisciplinary Collaboration:** The synergy between Mamba and Scipy fosters interdisciplinary collaboration between natural language processing researchers, scientific computing experts, and domain-specific scientists, paving the way for novel applications and pushing the boundaries of scientific computing. The diverse range of models as U-Mamba [@ma2024umamba], Vision Mamba[@zhu2024vision], VMamba [@liu2024vmamba], MambaByte [@wang2024mambabyte]and Jamba [@lieber2024jamba], highlights the versatility and adaptability of the Mamba architecture. These variants have been designed to enhance efficiency, improve long-range dependency modeling, incorporate visual representations, explore token-free approaches, integrate Fourier learning, and hybridize with Transformer components. diff --git a/papers/Suvrakamal_Das/myst.yml b/papers/Suvrakamal_Das/myst.yml index 3f9309f2be..662ee76d6a 100644 --- a/papers/Suvrakamal_Das/myst.yml +++ b/papers/Suvrakamal_Das/myst.yml @@ -1,10 +1,13 @@ version: 1 +extends: ../proceedings.yml project: + doi: 10.25080/XHDR4700 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-Suvrakamal_Das # Ensure your title is the same as in your `main.md` title: Mamba Models a possible replacement for Transformers? - subtitle: A Memory-Efficient Approach for Scientific Computing (SciPy'24 Paper) + subtitle: A Memory-Efficient Approach for Scientific Computing + description: The quest for more efficient and faster deep learning models has led to the development of various alternatives to Transformers, one of which is the Mamba model. This paper provides a comprehensive comparison between Mamba models and Transformers, focusing on their architectural differences, performance metrics, and underlying mechanisms. # Authors should have affiliations, emails and ORCIDs if available authors: - name: Suvrakamal Das @@ -12,13 +15,13 @@ project: orcid: 0009-0002-4791-9244 affiliations: - Maulana Abul Kalam Azad University Institute of Technology, West Bengal - + - name: Rounak Sen email: rony000013@gmail.com orcid: 0009-0003-9327-4712 affiliations: - Maulana Abul Kalam Azad University Institute of Technology, West Bengal - + - name: Saikrishna Devendiran email: dsaikrishna200r@gmail.com orcid: 0009-0003-6153-3177 @@ -55,13 +58,5 @@ project: - mamba_github - mamba_s4 - transformer_py - # A banner will be generated for you on publication, this is a placeholder - banner: banner.jpg - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 site: template: article-theme diff --git a/papers/Suvrakamal_Das/thumbnail.png b/papers/Suvrakamal_Das/thumbnail.png new file mode 100644 index 0000000000..ca2072a51d Binary files /dev/null and b/papers/Suvrakamal_Das/thumbnail.png differ diff --git a/papers/Valeria_Martin/banner.png b/papers/Valeria_Martin/banner.png index c5dd028e26..7c3ec2ae04 100644 Binary files a/papers/Valeria_Martin/banner.png and b/papers/Valeria_Martin/banner.png differ diff --git a/papers/Valeria_Martin/main.tex b/papers/Valeria_Martin/main.tex index 844dec52f9..0c376666ec 100644 --- a/papers/Valeria_Martin/main.tex +++ b/papers/Valeria_Martin/main.tex @@ -1,107 +1,107 @@ -\documentclass{article} -\usepackage{graphicx} % Required for inserting images -\usepackage{listings} -\usepackage{float} -\usepackage{color} -\usepackage{booktabs} -\definecolor{codegreen}{rgb}{0,0.6,0} -\definecolor{codegray}{rgb}{0.5,0.5,0.5} -\definecolor{codepurple}{rgb}{0.58,0,0.82} -\definecolor{backcolour}{rgb}{0.95,0.95,0.92} -\usepackage[numbers]{natbib} -\usepackage{hyperref} - -\lstdefinestyle{mystyle}{ - backgroundcolor=\color{backcolour}, - commentstyle=\color{codegreen}, - keywordstyle=\color{magenta}, - numberstyle=\tiny\color{codegray}, - stringstyle=\color{codepurple}, - basicstyle=\footnotesize, - breakatwhitespace=false, - breaklines=true, - captionpos=b, - keepspaces=true, - numbers=left, - numbersep=5pt, - showspaces=false, - showstringspaces=false, - showtabs=false, - tabsize=2 -} - -\lstset{style=mystyle} - - -\begin{abstract} - -In recent years, leveraging satellite imagery with deep learning (DL) architectures has become an effective approach for environmental monitoring tasks, including forest wildfire detection. Nevertheless, this integration requires substantial high-quality labeled data to train the DL models accurately. Leveraging the capabilities of multiple Python libraries, such as rasterio and GeoPandas, and Google Earth Engine’s Python API, this study introduces a streamlined methodology to efficiently gather, label, augment, process, and evaluate a large-scale bi-temporal high-resolution satellite imagery dataset for DL-driven forest wildfire detection. Known as the California Wildfire GeoImaging Dataset (CWGID), this dataset comprises over 100,000 labeled 'before' and 'after' wildfire image pairs, created from pre-existing satellite imagery. An analysis of the dataset using pre-trained and adapted Convolutional Neural Network (CNN) architectures, such as VGG16 and EfficientNet, achieved accuracies of respectively 76\% and 93\%. The pipeline outlined in this paper demonstrates how Python can be used to gather and process high-resolution satellite imagery datasets, leading to accurate wildfire detection and providing a tool for broader environmental monitoring. - -\end{abstract} - -\section{Introduction}\label{introduction} - - +\documentclass{article} +\usepackage{graphicx} % Required for inserting images +\usepackage{listings} +\usepackage{float} +\usepackage{color} +\usepackage{booktabs} +\definecolor{codegreen}{rgb}{0,0.6,0} +\definecolor{codegray}{rgb}{0.5,0.5,0.5} +\definecolor{codepurple}{rgb}{0.58,0,0.82} +\definecolor{backcolour}{rgb}{0.95,0.95,0.92} +\usepackage[numbers]{natbib} +\usepackage{hyperref} + +\lstdefinestyle{mystyle}{ + backgroundcolor=\color{backcolour}, + commentstyle=\color{codegreen}, + keywordstyle=\color{magenta}, + numberstyle=\tiny\color{codegray}, + stringstyle=\color{codepurple}, + basicstyle=\footnotesize, + breakatwhitespace=false, + breaklines=true, + captionpos=b, + keepspaces=true, + numbers=left, + numbersep=5pt, + showspaces=false, + showstringspaces=false, + showtabs=false, + tabsize=2 +} + +\lstset{style=mystyle} + + +\begin{abstract} + +In recent years, leveraging satellite imagery with deep learning (DL) architectures has become an effective approach for environmental monitoring tasks, including forest wildfire detection. Nevertheless, this integration requires substantial high-quality labeled data to train the DL models accurately. Leveraging the capabilities of multiple Python libraries, such as rasterio and GeoPandas, and Google Earth Engine’s Python API, this study introduces a streamlined methodology to efficiently gather, label, augment, process, and evaluate a large-scale bi-temporal high-resolution satellite imagery dataset for DL-driven forest wildfire detection. Known as the California Wildfire GeoImaging Dataset (CWGID), this dataset comprises over 100,000 labeled 'before' and 'after' wildfire image pairs, created from pre-existing satellite imagery. An analysis of the dataset using pre-trained and adapted Convolutional Neural Network (CNN) architectures, such as VGG16 and EfficientNet, achieved accuracies of respectively 76\% and 93\%. The pipeline outlined in this paper demonstrates how Python can be used to gather and process high-resolution satellite imagery datasets, leading to accurate wildfire detection and providing a tool for broader environmental monitoring. + +\end{abstract} + +\section{Introduction}\label{introduction} + + This paper presents a Python-based methodology for gathering and using a labeled high-resolution satellite imagery dataset for forest wildfire detection. Forests are important ecosystems found globally. They are made up of trees, plants, and other various types of vegetation. Forests host many species and are crucial for maintaining environmental health, as they support biodiversity, climate regulation, and oxygen production. Moreover, they bring economic and social benefits, including energy production, job opportunities, and spaces for leisure and tourism. Protecting forests and tackling forest loss is a current global priority \citep{IUCN2021}. -With the development of Earth Observation (EO) systems, remote sensing became a time-efficient and cost-effective method for monitoring and detecting forest change \citep{Massey2023}. Moreover, recent advancements in satellite technology have significantly enhanced forest monitoring capabilities by providing high-resolution imagery and increasing the frequency of observations. - -Satellite imagery-based change detection and forest monitoring have traditionally relied on manually identifying specific features and using predefined algorithms and models, such as differential analysis, thresholding techniques, and clustering and classification algorithms. This approach requires considerable domain expertise and such algorithms and models may not capture the full complexity of the studied data \citep{rs14071552}. - +With the development of Earth Observation (EO) systems, remote sensing became a time-efficient and cost-effective method for monitoring and detecting forest change \citep{Massey2023}. Moreover, recent advancements in satellite technology have significantly enhanced forest monitoring capabilities by providing high-resolution imagery and increasing the frequency of observations. + +Satellite imagery-based change detection and forest monitoring have traditionally relied on manually identifying specific features and using predefined algorithms and models, such as differential analysis, thresholding techniques, and clustering and classification algorithms. This approach requires considerable domain expertise and such algorithms and models may not capture the full complexity of the studied data \citep{rs14071552}. + With the emergence of deep learning (DL) algorithms, specifically computer vision methods, such as Convolutional Neural Networks (CNNs) \citep{lecun} and Fully Convolutional Neural Networks (e.g., U-Nets \citep{DBLP:RonnebergerFB15}), there is a significant opportunity to enhance and facilitate forest change detection efforts. These advanced computational methods can rapidly identify complex patterns within vast datasets. Furthermore, when integrated with EO systems they can facilitate near real-time monitoring and detection of multiple forest loss causes, assess their extent, or even predict and evaluate their spread \citep{eleo,al-dabbagh2023uni}. Thus, integrating DL methods with satellite imagery offers a more dynamic and precise approach, capable of handling the patterns and variability associated with imagery data. For instance, DL models can automatically learn complex patterns related to wildfire spread from labeled examples, whereas traditional methods might miss subtle but important indicators. However, DL algorithms require a substantial amount of labeled data to effectively learn and identify change \citep{Alzubaidi2021ReviewOD}. Therefore, the development of labeled high-resolution satellite imagery datasets is important and relevant for addressing environmental problems. Currently, the availability of labeled high-quality satellite imagery datasets is an obstacle to developing DL models for environmental change detection \citep{Adegun2023}. Generally, building a satellite imagery dataset is a time-intensive process. However, Google Earth Engine (GEE) \citep{gorelick2017google} has recently revolutionized this process by providing an extensive, cloud-based platform for the efficient collection, processing, and analysis of satellite imagery. GEE’s Python API allows its users to programmatically query their platform and download cloud-free large-scale satellite imagery datasets from multiple satellite collections, such as Sentinel-2. -For example, while traditional methods might require manual search and download of images, GEE can automate this process, significantly reducing the time needed to find suitable satellite images. This technology makes data collection and processing faster and easier, facilitating environmental monitoring by providing reliable and easily accessible high-quality satellite imagery. - -Furthermore, Python facilitates the use of DL in environmental monitoring by providing a rich ecosystem of libraries and tools, such as TensorFlow \citep{tensorflow2015-whitepaper}, which contains multiple existing DL architectures that can be adapted and used with satellite imagery. Nevertheless, integrating DL and remotely sensed images requires multiple processing steps, such as having smaller imagery tiles and adapting the models to use GeoTIFF data, among others. These steps include: -\begin{enumerate} - \item Data Acquisition: Collect satellite imagery from sources such as Google Earth Engine or other satellite image providers. - \item Image Tiling: Divide large satellite images into smaller tiles to fit the input image size used in DL models. - \item Data Annotation: Label the images to create a ground truth dataset for training DL models. - \item Data Augmentation: Applying transformations such as rotations and flips to increase the diversity of the training dataset. +For example, while traditional methods might require manual search and download of images, GEE can automate this process, significantly reducing the time needed to find suitable satellite images. This technology makes data collection and processing faster and easier, facilitating environmental monitoring by providing reliable and easily accessible high-quality satellite imagery. + +Furthermore, Python facilitates the use of DL in environmental monitoring by providing a rich ecosystem of libraries and tools, such as TensorFlow \citep{tensorflow2015-whitepaper}, which contains multiple existing DL architectures that can be adapted and used with satellite imagery. Nevertheless, integrating DL and remotely sensed images requires multiple processing steps, such as having smaller imagery tiles and adapting the models to use GeoTIFF data, among others. These steps include: +\begin{enumerate} + \item Data Acquisition: Collect satellite imagery from sources such as Google Earth Engine or other satellite image providers. + \item Image Tiling: Divide large satellite images into smaller tiles to fit the input image size used in DL models. + \item Data Annotation: Label the images to create a ground truth dataset for training DL models. + \item Data Augmentation: Applying transformations such as rotations and flips to increase the diversity of the training dataset. \end{enumerate} -Python’s tools help manage these steps. For example, libraries like rasterio, Tifffile and GeoPandas, can be used to process and transform satellite imagery data into formats suitable for DL models. - - -This paper presents a methodology, implemented in Python, to streamline the creation and evaluation, via DL, of satellite imagery datasets. The methodology covers the entire workflow: from data acquisition, labelling, and preprocessing to model adaptation, training, and evaluation. Specifically, this approach is applied to gather and validate a high-resolution dataset for forest wildfire detection, the California Wildfire GeoImaging Dataset (CWGID). Additionally, this methodology can be adapted for various environmental monitoring tasks, showing its versatility in studying and responding to different environmental changes. - - -\section{Building a Sentinel-2 Satellite Imagery Dataset} -To construct the CWGID, a multi-step process is needed. -\subsection{Gathering and Refining Historic Wildfire Polygon Data from California} -The initial step is to gather georeferenced forest wildfire polygon data from California, sourced from the Fire and Resource Assessment Program (FRAP) maintained by the California Department of Forestry and Fire Protection \citep{california_department_of_forestry_and_fire_protection_2024}. This FRAP data includes perimeters of past wildfires and serves as the geographic reference needed to select satellite imagery with GEE. Figure \ref{fig2} illustrates the polygons from the FRAP. The polygons delineated in purple represent areas affected by wildfires in forested regions. These delineated polygons are used to create the CWGID. - - -\begin{figure}[h] - \centering - \includegraphics{polygons.png} - \caption{Representation of the Polygon Data from the FRAP. Polygons delineated in purple represent wildfires that occurred in forested areas, used for the California Wildfire GeoImaging Dataset (CWGID). \label{fig2}} -\end{figure} - -Then, in Python, the Pandas library \citep{pandas1} is used to organize the forest wildfire attribute data into a Pandas DataFrame, which is then filtered to align with the launch date and operational phase of the Sentinel-2 satellites, selected for their open-source, high-resolution imagery capabilities \citep{DRUSCH201225}. Additionally, the dates are adjusted to fall within the green-up period, avoiding the winter and fall seasons where snow cover could interfere with identifying burnt areas. - -Next, the data is formatted to meet GEE’s querying specifications: -\begin{itemize} - \item A 15-day range for pre- and post-wildfire dates is generated and added to the DataFrame. - \item Using Pandas, the date ranges are formatted to meet GEE's requirements. - \item Using the pyproj library \citep{pyproj2023}, the recorded point coordinates are converted from NAD83 to WGS84 to facilitate the querying process. - \item With the geopy \citep{geopy} library, the coordinates of the squared region of interest are calculated, featuring a side length of 15 miles. -\end{itemize} - - -\subsection{Downloading the Imagery Data Using GEE's Python API} -GEE is a cloud-based platform for global environmental data analysis. It combines an extensive archive of satellite imagery and geospatial datasets with powerful computational resources to enable researchers to detect and quantify changes on the Earth’s surface. GEE’s Python API offers an accessible interface for automating the process of satellite imagery downloads, making it a popular tool for environmental monitoring and research projects. - -Multiple steps are needed to set up the GEE's Python API. First, a project is created in Google Cloud Console and the Earth Engine API is enabled. Authentication and Google Drive editing rights are configured to effectively manage and store the downloaded imagery. Following the setup, the Earth Engine Python API is installed on a local machine, and the necessary authentications are performed to initialize the API. - -Then, a Python script is developed to automate the download of images depicting the pre- or post-wildfire data using GEE's Python API (see Code \ref{download}). To download three-channel RGB GeoTIFF imagery, the bands B4 (red), B3(green), and B2(blue) need to be specified (different band compositions can be selected in this step). In satellite imagery, bands refer to specific wavelength ranges captured by the satellite sensors, and they are used to create composite images that highlight different features of the Earth's surface. These bands correspond to the visible spectrum, which is useful for visual interpretation and analysis. -The script to download the satellite imagery needed using GEE (see Code \ref{download}) is configured with a for loop to iterate through each entry in the DataFrame, extracting necessary parameters such as date ranges, region of interest (ROI) coordinates, and center coordinates of each wildfire polygon. Also, the script is designed to specify parameters such as the desired image collection and a threshold for cloud coverage. Tiles exhibiting more than 10\% cloud coverage are automatically excluded to maintain data quality. Finally, the images are downloaded and exported to Google Drive in a GeoTIFF format. - +Python’s tools help manage these steps. For example, libraries like rasterio, Tifffile and GeoPandas, can be used to process and transform satellite imagery data into formats suitable for DL models. + + +This paper presents a methodology, implemented in Python, to streamline the creation and evaluation, via DL, of satellite imagery datasets. The methodology covers the entire workflow: from data acquisition, labelling, and preprocessing to model adaptation, training, and evaluation. Specifically, this approach is applied to gather and validate a high-resolution dataset for forest wildfire detection, the California Wildfire GeoImaging Dataset (CWGID). Additionally, this methodology can be adapted for various environmental monitoring tasks, showing its versatility in studying and responding to different environmental changes. + + +\section{Building a Sentinel-2 Satellite Imagery Dataset} +To construct the CWGID, a multi-step process is needed. +\subsection{Gathering and Refining Historic Wildfire Polygon Data from California} +The initial step is to gather georeferenced forest wildfire polygon data from California, sourced from the Fire and Resource Assessment Program (FRAP) maintained by the California Department of Forestry and Fire Protection \citep{california_department_of_forestry_and_fire_protection_2024}. This FRAP data includes perimeters of past wildfires and serves as the geographic reference needed to select satellite imagery with GEE. Figure \ref{fig2} illustrates the polygons from the FRAP. The polygons delineated in purple represent areas affected by wildfires in forested regions. These delineated polygons are used to create the CWGID. + + +\begin{figure}[h] + \centering + \includegraphics{polygons.png} + \caption{Representation of the Polygon Data from the FRAP. Polygons delineated in purple represent wildfires that occurred in forested areas, used for the California Wildfire GeoImaging Dataset (CWGID). \label{fig2}} +\end{figure} + +Then, in Python, the Pandas library \citep{pandas1} is used to organize the forest wildfire attribute data into a Pandas DataFrame, which is then filtered to align with the launch date and operational phase of the Sentinel-2 satellites, selected for their open-source, high-resolution imagery capabilities \citep{DRUSCH201225}. Additionally, the dates are adjusted to fall within the green-up period, avoiding the winter and fall seasons where snow cover could interfere with identifying burnt areas. + +Next, the data is formatted to meet GEE’s querying specifications: +\begin{itemize} + \item A 15-day range for pre- and post-wildfire dates is generated and added to the DataFrame. + \item Using Pandas, the date ranges are formatted to meet GEE's requirements. + \item Using the pyproj library \citep{pyproj2023}, the recorded point coordinates are converted from NAD83 to WGS84 to facilitate the querying process. + \item With the geopy \citep{geopy} library, the coordinates of the squared region of interest are calculated, featuring a side length of 15 miles. +\end{itemize} + + +\subsection{Downloading the Imagery Data Using GEE's Python API} +GEE is a cloud-based platform for global environmental data analysis. It combines an extensive archive of satellite imagery and geospatial datasets with powerful computational resources to enable researchers to detect and quantify changes on the Earth’s surface. GEE’s Python API offers an accessible interface for automating the process of satellite imagery downloads, making it a popular tool for environmental monitoring and research projects. + +Multiple steps are needed to set up the GEE's Python API. First, a project is created in Google Cloud Console and the Earth Engine API is enabled. Authentication and Google Drive editing rights are configured to effectively manage and store the downloaded imagery. Following the setup, the Earth Engine Python API is installed on a local machine, and the necessary authentications are performed to initialize the API. + +Then, a Python script is developed to automate the download of images depicting the pre- or post-wildfire data using GEE's Python API (see Code \ref{download}). To download three-channel RGB GeoTIFF imagery, the bands B4 (red), B3(green), and B2(blue) need to be specified (different band compositions can be selected in this step). In satellite imagery, bands refer to specific wavelength ranges captured by the satellite sensors, and they are used to create composite images that highlight different features of the Earth's surface. These bands correspond to the visible spectrum, which is useful for visual interpretation and analysis. +The script to download the satellite imagery needed using GEE (see Code \ref{download}) is configured with a for loop to iterate through each entry in the DataFrame, extracting necessary parameters such as date ranges, region of interest (ROI) coordinates, and center coordinates of each wildfire polygon. Also, the script is designed to specify parameters such as the desired image collection and a threshold for cloud coverage. Tiles exhibiting more than 10\% cloud coverage are automatically excluded to maintain data quality. Finally, the images are downloaded and exported to Google Drive in a GeoTIFF format. + \begin{lstlisting}[language=Python, label=download, caption=Script to automate the download of pre- or post-wildfire images using GEE’s Python API. It iterates through a DataFrame, extracting relevant parameters and downloading images with less than 10\% cloud coverage. The images are exported as GeoTIFF files to Google Drive.] # Authenticate into EE ee.Authenticate() @@ -174,161 +174,161 @@ \subsection{Downloading the Imagery Data Using GEE's Python API} # Skip images with more than 10% cloud coverage print(f"Skipping image {i} due to cloudy percentage ({cloudy_percentage} %) > 10 %") \end{lstlisting} - - -Figure \ref{fig3} presents an example of a pre-and post-wildfire imagery pair downloaded from GEE to Google Drive using code \ref{download} . - -\begin{figure}[H] - \centering - \includegraphics{prepost1.png} - \caption{Example of a pre and post-wildfire RGB image pair of a forested area downloaded using GEE's Python API. \label{fig3}} -\end{figure} -\subsection{Creating the Ground Truth Wildfire Labels} -Ground truth masks are essential in forest wildfire detection and general land cover classifications \citep{8113128}. In this project, these type of masks are generated to label the data. - -First, Python is used to rasterize the combined geometry of the forest wildfire polygon data and the downloaded post-wildfire RGB satellite imagery. Specifically, the forest wildfire polygons are accessed in Python using GeoPandas \citep{geopandas} and reprojected to match the coordinate system of the satellite imagery (EPSG:4326). Then, each post-wildfire RGB image is locally and temporarily downloaded from Google Drive, with essential properties such as width, height, transform, and bounds extracted using the rasterio library \citep{rasterio}. Next, the geometry column from the forest wildfire polygon data is extracted and intersected with each image bound using Python's shapely \citep{shapely} library. Finally, binary masks are created by rasterizing the combined geometries. These masks match the dimensions of the satellite images, ensuring that each pixel labeled as wildfire damage corresponds directly to the polygon data (see Code \ref{lst:fire_polygons_processing}). The binary masks are saved temporarily in GeoTIFF format and are uploaded to a dedicated Google Drive folder. All the temporary local files were deleted to clear space and maintain system efficiency. -\begin{lstlisting}[language=Python, label= lst:fire_polygons_processing, caption = Building Ground Truth Masks] -import geopandas as gpd -import numpy as np -import rasterio -from shapely.geometry import box, shape -import os - -# Define the path to your Shapefile - replace with your specific path -shapefile_path = "YourShapefileDirectory/FirePolygons.shp" - -# Read the Shapefile using geopandas -gdf = gpd.read_file(shapefile_path) - -# Reproject the shapefile to EPSG:4326 to match the satellite imagery coordinate system -gdf = gdf.to_crs(epsg=4326) - -# Create a directory to store the output raster masks if it doesn't already exist -output_dir = "raster_masks" -if not os.path.exists(output_dir): - os.makedirs(output_dir) - -# Iterate through the rows in the attribute table -for index, row in gdf.iterrows(): - object_id = row["OBJECTID"] - image_path = f"YourImageDirectory/RGB_AfterFire{object_id}.tif" - # Check if the image exists - if os.path.exists(image_path): - # Open the image using rasterio - with rasterio.open(image_path) as src: - image_width = src.width - image_height = src.height - image_transform = src.transform - image_bounds = box( - src.bounds[0], src.bounds[1], - src.bounds[2], src.bounds[3] - ) - - # Extract the geometry column - geom = shape(row["geometry"]) - clipped_geom = geom.intersection(image_bounds.envelope) - - if not clipped_geom.is_empty: - # Create a two-dimensional label by rasterizing the - # clipped geometry - clipped_mask = rasterio.features.geometry_mask( - [clipped_geom], - out_shape=(image_height, image_width), - transform=image_transform, - invert=True, - ) - - # Save the image with the two-dimensional label overlay - output_image_path - = f"{output_dir}/Masked_{object_id}.tif" - with rasterio.open( - output_image_path, - "w", - driver="GTiff", - width=image_width, - height=image_height, - count=1, - dtype=np.uint8, - crs=src.crs, - transform=image_transform, - ) as dst: - dst.write(clipped_mask.astype(np.uint8), 1) -\end{lstlisting} - -In the resulting ground truth masks, the pixel values are set to zero if they are outside of a wildfire polygon, indicating unaffected areas, and set to one if they are within the polygon boundaries, indicating areas affected by a forest wildfire. Figure \ref{fig4} displays an example of a resulting ground truth mask. - - -\begin{figure}[H] - \centering - \includegraphics{label1.png} - \caption{Example of a resulting ground truth mask in a forested area affected by wildfires. The mask highlights wildfire-affected areas in yellow and unaffected areas in purple. This binary mask is used to train and validate deep learning models for accurate wildfire detection \label{fig4}} -\end{figure} - - -\subsection{Image Segmentation and Data Preparation for Deep Learning Architectures} -Next, the satellite images and their corresponding ground-truth masks are cropped into smaller tiles that maintain the imagery's spatial resolution (10m). Often, satellite imagery needs to be resized and downscaled to accommodate deep learning (DL) architectures, which can result in the loss of essential details such as subtle indicators of early-stage wildfires. Moreover, using smaller images enhances the efficiency of DL models by lowering computational demands and speeding up training times \citep{hu2015transferring,marmanis2016deep}. - -To do this, a tile size of 256x256 pixels is specified and each RGB image is downloaded individually from Google Drive to a temporary local folder. Using Python’s rasterio library, the original RGB images are opened to obtain their dimensions. Then, the number of rows and columns for the tiles is calculated based on the chosen tile size. Next, a rasterio Window object is used to extract the corresponding portion from the original image and read the RGB data, ensuring the order of the bands (B4, B3, and B2). - -The segmented RGB tiles are then saved as GeoTIFF files using the tifffile Python library \citep{tifffile}. This is a critical step to maintain the integrity of the three-channel RGB data, as the rasterio library alone can alter the color of the images during saving. Additionally, the metadata of the saved tiles is updated to include georeferencing information and to modify parameters such as width, height, and transform (see Code \ref{lst:save_rgb_tiles}). - -A similar approach is used to segment the binary masks, specifying that the images contain only one band. -\begin{lstlisting}[language=Python, caption={Function to Crop RGB Image Tiles}, label= lst:save_rgb_tiles] -from rasterio import Window -from tifffile import imwrite - -# Function to save image tiles without changing the data type -def save_rgb_tiles(image_path, output_folder, tile_size, parent_name): - # Open the source image file - with rasterio.open(image_path) as src: - height = src.height # Get the height of the source image - width = src.width # Get the width of the source image - - # Calculate the number of tiles in both dimensions - num_rows = height // tile_size - num_cols = width // tile_size - - tile_counter = 1 # Initialize the tile counter - - # Iterate over the number of rows of tiles - for i in range(num_rows): - # Iterate over the number of columns of tiles - for j in range(num_cols): - # Define the window for the current tile - window = Window(j * tile_size, i * tile_size, tile_size, tile_size) - - # Read the original data without modifications - # Assuming band order B4, B3, B2 - tile = src.read((1, 2, 3), window=window) - # Create a unique name for the tile - tile_name = f"{parent_name}_tile_{tile_counter}.tif" - # Define the path to save the tile - tile_path = os.path.join(output_folder, tile_name) - - # Save the tile using tifffile without changing - # data type - imwrite(tile_path, tile) - - # Copy the metadata from the source image - meta = src.meta.copy() - # Get the transformation matrix for the current window - transform = src.window_transform(window) - # Update the metadata with the new dimensions and transformation - meta.update({ - 'width': tile_size, - 'height': tile_size, - 'transform': transform - }) - # Save the tile with updated metadata using rasterio - with rasterio.open(tile_path, 'w', **meta) as dst: - dst.write(tile) - tile_counter += 1 # Increment the tile counter -\end{lstlisting} + + +Figure \ref{fig3} presents an example of a pre-and post-wildfire imagery pair downloaded from GEE to Google Drive using code \ref{download} . + +\begin{figure}[H] + \centering + \includegraphics{prepost1.png} + \caption{Example of a pre and post-wildfire RGB image pair of a forested area downloaded using GEE's Python API. \label{fig3}} +\end{figure} +\subsection{Creating the Ground Truth Wildfire Labels} +Ground truth masks are essential in forest wildfire detection and general land cover classifications \citep{8113128}. In this project, these type of masks are generated to label the data. + +First, Python is used to rasterize the combined geometry of the forest wildfire polygon data and the downloaded post-wildfire RGB satellite imagery. Specifically, the forest wildfire polygons are accessed in Python using GeoPandas \citep{geopandas} and reprojected to match the coordinate system of the satellite imagery (EPSG:4326). Then, each post-wildfire RGB image is locally and temporarily downloaded from Google Drive, with essential properties such as width, height, transform, and bounds extracted using the rasterio library \citep{rasterio}. Next, the geometry column from the forest wildfire polygon data is extracted and intersected with each image bound using Python's shapely \citep{shapely} library. Finally, binary masks are created by rasterizing the combined geometries. These masks match the dimensions of the satellite images, ensuring that each pixel labeled as wildfire damage corresponds directly to the polygon data (see Code \ref{lst:fire_polygons_processing}). The binary masks are saved temporarily in GeoTIFF format and are uploaded to a dedicated Google Drive folder. All the temporary local files were deleted to clear space and maintain system efficiency. +\begin{lstlisting}[language=Python, label= lst:fire_polygons_processing, caption = Building Ground Truth Masks] +import geopandas as gpd +import numpy as np +import rasterio +from shapely.geometry import box, shape +import os + +# Define the path to your Shapefile - replace with your specific path +shapefile_path = "YourShapefileDirectory/FirePolygons.shp" + +# Read the Shapefile using geopandas +gdf = gpd.read_file(shapefile_path) + +# Reproject the shapefile to EPSG:4326 to match the satellite imagery coordinate system +gdf = gdf.to_crs(epsg=4326) + +# Create a directory to store the output raster masks if it doesn't already exist +output_dir = "raster_masks" +if not os.path.exists(output_dir): + os.makedirs(output_dir) + +# Iterate through the rows in the attribute table +for index, row in gdf.iterrows(): + object_id = row["OBJECTID"] + image_path = f"YourImageDirectory/RGB_AfterFire{object_id}.tif" + # Check if the image exists + if os.path.exists(image_path): + # Open the image using rasterio + with rasterio.open(image_path) as src: + image_width = src.width + image_height = src.height + image_transform = src.transform + image_bounds = box( + src.bounds[0], src.bounds[1], + src.bounds[2], src.bounds[3] + ) + + # Extract the geometry column + geom = shape(row["geometry"]) + clipped_geom = geom.intersection(image_bounds.envelope) + + if not clipped_geom.is_empty: + # Create a two-dimensional label by rasterizing the + # clipped geometry + clipped_mask = rasterio.features.geometry_mask( + [clipped_geom], + out_shape=(image_height, image_width), + transform=image_transform, + invert=True, + ) + + # Save the image with the two-dimensional label overlay + output_image_path + = f"{output_dir}/Masked_{object_id}.tif" + with rasterio.open( + output_image_path, + "w", + driver="GTiff", + width=image_width, + height=image_height, + count=1, + dtype=np.uint8, + crs=src.crs, + transform=image_transform, + ) as dst: + dst.write(clipped_mask.astype(np.uint8), 1) +\end{lstlisting} + +In the resulting ground truth masks, the pixel values are set to zero if they are outside of a wildfire polygon, indicating unaffected areas, and set to one if they are within the polygon boundaries, indicating areas affected by a forest wildfire. Figure \ref{fig4} displays an example of a resulting ground truth mask. + + +\begin{figure}[H] + \centering + \includegraphics{label1.png} + \caption{Example of a resulting ground truth mask in a forested area affected by wildfires. The mask highlights wildfire-affected areas in yellow and unaffected areas in purple. This binary mask is used to train and validate deep learning models for accurate wildfire detection. \label{fig4}} +\end{figure} + + +\subsection{Image Segmentation and Data Preparation for Deep Learning Architectures} +Next, the satellite images and their corresponding ground-truth masks are cropped into smaller tiles that maintain the imagery's spatial resolution (10m). Often, satellite imagery needs to be resized and downscaled to accommodate deep learning (DL) architectures, which can result in the loss of essential details such as subtle indicators of early-stage wildfires. Moreover, using smaller images enhances the efficiency of DL models by lowering computational demands and speeding up training times \citep{hu2015transferring,marmanis2016deep}. + +To do this, a tile size of 256x256 pixels is specified and each RGB image is downloaded individually from Google Drive to a temporary local folder. Using Python’s rasterio library, the original RGB images are opened to obtain their dimensions. Then, the number of rows and columns for the tiles is calculated based on the chosen tile size. Next, a rasterio Window object is used to extract the corresponding portion from the original image and read the RGB data, ensuring the order of the bands (B4, B3, and B2). + +The segmented RGB tiles are then saved as GeoTIFF files using the tifffile Python library \citep{tifffile}. This is a critical step to maintain the integrity of the three-channel RGB data, as the rasterio library alone can alter the color of the images during saving. Additionally, the metadata of the saved tiles is updated to include georeferencing information and to modify parameters such as width, height, and transform (see Code \ref{lst:save_rgb_tiles}). + +A similar approach is used to segment the binary masks, specifying that the images contain only one band. +\begin{lstlisting}[language=Python, caption={Function to Crop RGB Image Tiles}, label= lst:save_rgb_tiles] +from rasterio import Window +from tifffile import imwrite + +# Function to save image tiles without changing the data type +def save_rgb_tiles(image_path, output_folder, tile_size, parent_name): + # Open the source image file + with rasterio.open(image_path) as src: + height = src.height # Get the height of the source image + width = src.width # Get the width of the source image + + # Calculate the number of tiles in both dimensions + num_rows = height // tile_size + num_cols = width // tile_size + + tile_counter = 1 # Initialize the tile counter + + # Iterate over the number of rows of tiles + for i in range(num_rows): + # Iterate over the number of columns of tiles + for j in range(num_cols): + # Define the window for the current tile + window = Window(j * tile_size, i * tile_size, tile_size, tile_size) + + # Read the original data without modifications + # Assuming band order B4, B3, B2 + tile = src.read((1, 2, 3), window=window) + # Create a unique name for the tile + tile_name = f"{parent_name}_tile_{tile_counter}.tif" + # Define the path to save the tile + tile_path = os.path.join(output_folder, tile_name) + + # Save the tile using tifffile without changing + # data type + imwrite(tile_path, tile) + + # Copy the metadata from the source image + meta = src.meta.copy() + # Get the transformation matrix for the current window + transform = src.window_transform(window) + # Update the metadata with the new dimensions and transformation + meta.update({ + 'width': tile_size, + 'height': tile_size, + 'transform': transform + }) + # Save the tile with updated metadata using rasterio + with rasterio.open(tile_path, 'w', **meta) as dst: + dst.write(tile) + tile_counter += 1 # Increment the tile counter +\end{lstlisting} \begin{figure}[H] \centering \includegraphics[width=0.9\textwidth]{crop.png} - \caption{Example of cropped pre- and post-wildfire images and their corresponding label}. \label{figcrop} + \caption{Example of cropped pre- and post-wildfire images and their corresponding label.} \label{figcrop} \end{figure} By combining the capabilities of rasterio for efficient geospatial data handling and the tifffile library for preserving the RGB data during saving, the original images are cropped into smaller RGB tiles. This approach preserves the resolution and the georeferencing information of the images, preparing them to train DL applications. @@ -371,7 +371,7 @@ \subsection{VGG16 Implementation} \end{itemize} -\begin{lstlisting}[language=Python, caption={Custom Function to Feed GeoTIFF Files to the VGG16 Model: The function reads batches of 32 GeoTIFF files, shuffles them, and processes them into a three-dimensional array compatible with VGG16.}, label= lst:custom_generator] +\begin{lstlisting}[language=Python, caption={Custom Function to Feed GeoTIFF Files to the VGG16 Model: The function reads batches of 32 GeoTIFF files - shuffles them - and processes them into a three-dimensional array compatible with VGG16.}, label= lst:custom_generator] from sklearn.utils import shuffle # Define the base paths for training and testing @@ -471,7 +471,7 @@ \subsection{EfficientNet Implementation} \begin{figure}[H] \centering \includegraphics[width=0.9\textwidth]{6channel.png} - \caption{Representation of a 6 Channel RGB GeoTIFF Input. A: Representation of a 3-channel RGB GeoTIFF forested area \textit{before} a wildfire B: Visual example of a 3-channel RGB GeoTIFF forested area \textit{after} a wildfire.}. \label{fig1} + \caption{Representation of a 6 Channel RGB GeoTIFF Input. A: Representation of a 3-channel RGB GeoTIFF forested area \textit{before} a wildfire B: Visual example of a 3-channel RGB GeoTIFF forested area \textit{after} a wildfire.} \label{fig1} \end{figure} EfficientNet \citep{tan2019} is a CNN architecture that uniformly scales network width, depth, and resolution with a fixed set of scaling coefficients. EfficientNet’s architecture begins with a base model, EfficientNet-B0, designed to find the optimal baseline network configuration. The following versions of the network are further scaled versions of B0, offering multiple models for different computational budgets. @@ -667,6 +667,6 @@ \section{Conclusion} \bibliography{mybib} \bibliographystyle{unsrtnat} - - + + diff --git a/papers/Valeria_Martin/myst.yml b/papers/Valeria_Martin/myst.yml index 62cb8da3ef..1d8ca321a3 100644 --- a/papers/Valeria_Martin/myst.yml +++ b/papers/Valeria_Martin/myst.yml @@ -1,9 +1,11 @@ version: 1 +extends: ../proceedings.yml project: + doi: 10.25080/YADT7194 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-Valeria_Martin title: Python-Based GeoImagery Dataset Development for Deep Learning-Driven Forest Wildfire Detection - subtitle: + description: In recent years, leveraging satellite imagery with deep learning architectures has become an effective approach for environmental monitoring tasks, including forest wildfire detection. This paper presents a Python-based methodology for gathering and using a labeled high-resolution satellite imagery dataset for forest wildfire detection. # Authors should have affiliations, emails and ORCIDs if available authors: - name: Valeria Martin @@ -15,7 +17,7 @@ project: email: jmorgan3@uwf.edu orcid: 0000-0003-2321-3765 affiliations: - - University of West Florida + - University of West Florida - name: K. Brent Venable email: bvenable@uwf.edu orcid: 0000-0002-1092-9759 @@ -84,16 +86,8 @@ project: - al-dabbagh2023uni - SEYDI2022108999 - Hunan - - 8113128 + - '8113128' - DBLP:RonnebergerFB15 - lecun - # A banner will be generated for you on publication, this is a placeholder - banner: banner.png - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 site: - template: article-theme \ No newline at end of file + template: article-theme diff --git a/papers/Valeria_Martin/polygons.png b/papers/Valeria_Martin/polygons.png index e9f8036999..4f328ec4c9 100644 Binary files a/papers/Valeria_Martin/polygons.png and b/papers/Valeria_Martin/polygons.png differ diff --git a/papers/Valeria_Martin/thumbnail.png b/papers/Valeria_Martin/thumbnail.png new file mode 100644 index 0000000000..71a87c3515 Binary files /dev/null and b/papers/Valeria_Martin/thumbnail.png differ diff --git a/papers/alan_lujan/banner.png b/papers/alan_lujan/banner.png index e6a793bd6c..391831e724 100644 Binary files a/papers/alan_lujan/banner.png and b/papers/alan_lujan/banner.png differ diff --git a/papers/alan_lujan/main.md b/papers/alan_lujan/main.md index c6a88643fd..ffacee30d3 100644 --- a/papers/alan_lujan/main.md +++ b/papers/alan_lujan/main.md @@ -46,17 +46,17 @@ The arrangement of known data points, called the grid, significantly influences :label: tbl:grids :header-rows: 1 * - Grid - - Structure - - Geometry + - Structure + - Geometry * - Rectilinear - - Regular - - Rectangular mesh + - Regular + - Rectangular mesh * - Curvilinear - - Regular - - Quadrilateral mesh + - Regular + - Quadrilateral mesh * - Unstructured - - Irregular - - Random + - Irregular + - Random ``` ### Existing Interpolation Methods diff --git a/papers/alan_lujan/myst.yml b/papers/alan_lujan/myst.yml index 3b825b099e..64cbf6ffe5 100644 --- a/papers/alan_lujan/myst.yml +++ b/papers/alan_lujan/myst.yml @@ -1,12 +1,15 @@ version: 1 +extends: ../proceedings.yml site: template: article-theme project: + doi: 10.25080/FGCJ9164 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-alan_lujan # Ensure your title is the same as in your `main.md` title: multinterp subtitle: A Unified Interface for Multivariate Interpolation in the Scientific Python Ecosystem + description: Multivariate interpolation is a fundamental tool in scientific computing used to approximate the values of a function between known data points in multiple dimensions. Despite its importance, the Python ecosystem offers a fragmented landscape of specialized tools for this task; the multinterp package was developed to address this challenge. # Authors should have affiliations, emails and ORCIDs if available authors: - name: Alan Lujan @@ -15,6 +18,7 @@ project: affiliations: - institution: Johns Hopkins University department: Department of Economics + ror: https://ror.org/00za53h95 - institution: Econ-ARK url: https://econ-ark.org/ corresponding: true @@ -29,6 +33,10 @@ project: # Add the abbreviations that you use in your paper here abbreviations: MyST: Markedly Structured Text + CPU: Central Processing Unit + GPU: Graphics Processing Unit + API: Application Programming Interface + RBF: Radial Basis Function # It is possible to explicitly ignore the `doi-exists` check for certain citation keys error_rules: - rule: doi-exists @@ -43,11 +51,3 @@ project: - Bradbury2018 - Pedregosa2011 - Paszke2019 - # A banner will be generated for you on publication, this is a placeholder - banner: banner.png - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 diff --git a/papers/alan_lujan/thumbnail.png b/papers/alan_lujan/thumbnail.png new file mode 100644 index 0000000000..ac3eaf4a14 Binary files /dev/null and b/papers/alan_lujan/thumbnail.png differ diff --git a/papers/aleksandar_makelov/arxiv_template.bib b/papers/aleksandar_makelov/arxiv_template.bib deleted file mode 100644 index 95744c20fc..0000000000 --- a/papers/aleksandar_makelov/arxiv_template.bib +++ /dev/null @@ -1,11 +0,0 @@ -@inproceedings{Vaswani+2017, - author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, \L ukasz and Polosukhin, Illia}, - booktitle = {Advances in Neural Information Processing Systems}, - pages = {}, - publisher = {Curran Associates, Inc.}, - title = {Attention is All you Need}, - url = {https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf}, - volume = {30}, - year = {2017} -} - diff --git a/papers/aleksandar_makelov/arxiv_template.bst b/papers/aleksandar_makelov/arxiv_template.bst deleted file mode 100644 index a85a0087d1..0000000000 --- a/papers/aleksandar_makelov/arxiv_template.bst +++ /dev/null @@ -1,1440 +0,0 @@ -%% File: `iclr2024.bst' -%% A copy of iclm2010.bst, which is a modification of `plainnl.bst' for use with natbib package -%% -%% Copyright 2010 Hal Daum\'e III -%% Modified by J. Fürnkranz -%% - Changed labels from (X and Y, 2000) to (X & Y, 2000) -%% -%% Copyright 1993-2007 Patrick W Daly -%% Max-Planck-Institut f\"ur Sonnensystemforschung -%% Max-Planck-Str. 2 -%% D-37191 Katlenburg-Lindau -%% Germany -%% E-mail: daly@mps.mpg.de -%% -%% This program can be redistributed and/or modified under the terms -%% of the LaTeX Project Public License Distributed from CTAN -%% archives in directory macros/latex/base/lppl.txt; either -%% version 1 of the License, or any later version. -%% - % Version and source file information: - % \ProvidesFile{icml2010.mbs}[2007/11/26 1.93 (PWD)] - % - % BibTeX `plainnat' family - % version 0.99b for BibTeX versions 0.99a or later, - % for LaTeX versions 2.09 and 2e. - % - % For use with the `natbib.sty' package; emulates the corresponding - % member of the `plain' family, but with author-year citations. - % - % With version 6.0 of `natbib.sty', it may also be used for numerical - % citations, while retaining the commands \citeauthor, \citefullauthor, - % and \citeyear to print the corresponding information. - % - % For version 7.0 of `natbib.sty', the KEY field replaces missing - % authors/editors, and the date is left blank in \bibitem. - % - % Includes field EID for the sequence/citation number of electronic journals - % which is used instead of page numbers. - % - % Includes fields ISBN and ISSN. - % - % Includes field URL for Internet addresses. - % - % Includes field DOI for Digital Object Idenfifiers. - % - % Works best with the url.sty package of Donald Arseneau. - % - % Works with identical authors and year are further sorted by - % citation key, to preserve any natural sequence. - % -ENTRY - { address - author - booktitle - chapter - doi - eid - edition - editor - howpublished - institution - isbn - issn - journal - key - month - note - number - organization - pages - publisher - school - series - title - type - url - volume - year - } - {} - { label extra.label sort.label short.list } - -INTEGERS { output.state before.all mid.sentence after.sentence after.block } - -FUNCTION {init.state.consts} -{ #0 'before.all := - #1 'mid.sentence := - #2 'after.sentence := - #3 'after.block := -} - -STRINGS { s t } - -FUNCTION {output.nonnull} -{ 's := - output.state mid.sentence = - { ", " * write$ } - { output.state after.block = - { add.period$ write$ - newline$ - "\newblock " write$ - } - { output.state before.all = - 'write$ - { add.period$ " " * write$ } - if$ - } - if$ - mid.sentence 'output.state := - } - if$ - s -} - -FUNCTION {output} -{ duplicate$ empty$ - 'pop$ - 'output.nonnull - if$ -} - -FUNCTION {output.check} -{ 't := - duplicate$ empty$ - { pop$ "empty " t * " in " * cite$ * warning$ } - 'output.nonnull - if$ -} - -FUNCTION {fin.entry} -{ add.period$ - write$ - newline$ -} - -FUNCTION {new.block} -{ output.state before.all = - 'skip$ - { after.block 'output.state := } - if$ -} - -FUNCTION {new.sentence} -{ output.state after.block = - 'skip$ - { output.state before.all = - 'skip$ - { after.sentence 'output.state := } - if$ - } - if$ -} - -FUNCTION {not} -{ { #0 } - { #1 } - if$ -} - -FUNCTION {and} -{ 'skip$ - { pop$ #0 } - if$ -} - -FUNCTION {or} -{ { pop$ #1 } - 'skip$ - if$ -} - -FUNCTION {new.block.checka} -{ empty$ - 'skip$ - 'new.block - if$ -} - -FUNCTION {new.block.checkb} -{ empty$ - swap$ empty$ - and - 'skip$ - 'new.block - if$ -} - -FUNCTION {new.sentence.checka} -{ empty$ - 'skip$ - 'new.sentence - if$ -} - -FUNCTION {new.sentence.checkb} -{ empty$ - swap$ empty$ - and - 'skip$ - 'new.sentence - if$ -} - -FUNCTION {field.or.null} -{ duplicate$ empty$ - { pop$ "" } - 'skip$ - if$ -} - -FUNCTION {emphasize} -{ duplicate$ empty$ - { pop$ "" } - { "\emph{" swap$ * "}" * } - if$ -} - -INTEGERS { nameptr namesleft numnames } - -FUNCTION {format.names} -{ 's := - #1 'nameptr := - s num.names$ 'numnames := - numnames 'namesleft := - { namesleft #0 > } - { s nameptr "{ff~}{vv~}{ll}{, jj}" format.name$ 't := - nameptr #1 > - { namesleft #1 > - { ", " * t * } - { numnames #2 > - { "," * } - 'skip$ - if$ - t "others" = - { " et~al." * } - { " and " * t * } - if$ - } - if$ - } - 't - if$ - nameptr #1 + 'nameptr := - namesleft #1 - 'namesleft := - } - while$ -} - -FUNCTION {format.key} -{ empty$ - { key field.or.null } - { "" } - if$ -} - -FUNCTION {format.authors} -{ author empty$ - { "" } - { author format.names } - if$ -} - -FUNCTION {format.editors} -{ editor empty$ - { "" } - { editor format.names - editor num.names$ #1 > - { " (eds.)" * } - { " (ed.)" * } - if$ - } - if$ -} - -FUNCTION {format.isbn} -{ isbn empty$ - { "" } - { new.block "ISBN " isbn * } - if$ -} - -FUNCTION {format.issn} -{ issn empty$ - { "" } - { new.block "ISSN " issn * } - if$ -} - -FUNCTION {format.url} -{ url empty$ - { "" } - { new.block "URL \url{" url * "}" * } - if$ -} - -FUNCTION {format.doi} -{ doi empty$ - { "" } - { new.block "\doi{" doi * "}" * } - if$ -} - -FUNCTION {format.title} -{ title empty$ - { "" } - { title "t" change.case$ } - if$ -} - -FUNCTION {format.full.names} -{'s := - #1 'nameptr := - s num.names$ 'numnames := - numnames 'namesleft := - { namesleft #0 > } - { s nameptr - "{vv~}{ll}" format.name$ 't := - nameptr #1 > - { - namesleft #1 > - { ", " * t * } - { - numnames #2 > - { "," * } - 'skip$ - if$ - t "others" = - { " et~al." * } - { " and " * t * } - if$ - } - if$ - } - 't - if$ - nameptr #1 + 'nameptr := - namesleft #1 - 'namesleft := - } - while$ -} - -FUNCTION {author.editor.full} -{ author empty$ - { editor empty$ - { "" } - { editor format.full.names } - if$ - } - { author format.full.names } - if$ -} - -FUNCTION {author.full} -{ author empty$ - { "" } - { author format.full.names } - if$ -} - -FUNCTION {editor.full} -{ editor empty$ - { "" } - { editor format.full.names } - if$ -} - -FUNCTION {make.full.names} -{ type$ "book" = - type$ "inbook" = - or - 'author.editor.full - { type$ "proceedings" = - 'editor.full - 'author.full - if$ - } - if$ -} - -FUNCTION {output.bibitem} -{ newline$ - "\bibitem[" write$ - label write$ - ")" make.full.names duplicate$ short.list = - { pop$ } - { * } - if$ - "]{" * write$ - cite$ write$ - "}" write$ - newline$ - "" - before.all 'output.state := -} - -FUNCTION {n.dashify} -{ 't := - "" - { t empty$ not } - { t #1 #1 substring$ "-" = - { t #1 #2 substring$ "--" = not - { "--" * - t #2 global.max$ substring$ 't := - } - { { t #1 #1 substring$ "-" = } - { "-" * - t #2 global.max$ substring$ 't := - } - while$ - } - if$ - } - { t #1 #1 substring$ * - t #2 global.max$ substring$ 't := - } - if$ - } - while$ -} - -FUNCTION {format.date} -{ year duplicate$ empty$ - { "empty year in " cite$ * warning$ - pop$ "" } - 'skip$ - if$ - month empty$ - 'skip$ - { month - " " * swap$ * - } - if$ - extra.label * -} - -FUNCTION {format.btitle} -{ title emphasize -} - -FUNCTION {tie.or.space.connect} -{ duplicate$ text.length$ #3 < - { "~" } - { " " } - if$ - swap$ * * -} - -FUNCTION {either.or.check} -{ empty$ - 'pop$ - { "can't use both " swap$ * " fields in " * cite$ * warning$ } - if$ -} - -FUNCTION {format.bvolume} -{ volume empty$ - { "" } - { "volume" volume tie.or.space.connect - series empty$ - 'skip$ - { " of " * series emphasize * } - if$ - "volume and number" number either.or.check - } - if$ -} - -FUNCTION {format.number.series} -{ volume empty$ - { number empty$ - { series field.or.null } - { output.state mid.sentence = - { "number" } - { "Number" } - if$ - number tie.or.space.connect - series empty$ - { "there's a number but no series in " cite$ * warning$ } - { " in " * series * } - if$ - } - if$ - } - { "" } - if$ -} - -FUNCTION {format.edition} -{ edition empty$ - { "" } - { output.state mid.sentence = - { edition "l" change.case$ " edition" * } - { edition "t" change.case$ " edition" * } - if$ - } - if$ -} - -INTEGERS { multiresult } - -FUNCTION {multi.page.check} -{ 't := - #0 'multiresult := - { multiresult not - t empty$ not - and - } - { t #1 #1 substring$ - duplicate$ "-" = - swap$ duplicate$ "," = - swap$ "+" = - or or - { #1 'multiresult := } - { t #2 global.max$ substring$ 't := } - if$ - } - while$ - multiresult -} - -FUNCTION {format.pages} -{ pages empty$ - { "" } - { pages multi.page.check - { "pp.\ " pages n.dashify tie.or.space.connect } - { "pp.\ " pages tie.or.space.connect } - if$ - } - if$ -} - -FUNCTION {format.eid} -{ eid empty$ - { "" } - { "art." eid tie.or.space.connect } - if$ -} - -FUNCTION {format.vol.num.pages} -{ volume field.or.null - number empty$ - 'skip$ - { "\penalty0 (" number * ")" * * - volume empty$ - { "there's a number but no volume in " cite$ * warning$ } - 'skip$ - if$ - } - if$ - pages empty$ - 'skip$ - { duplicate$ empty$ - { pop$ format.pages } - { ":\penalty0 " * pages n.dashify * } - if$ - } - if$ -} - -FUNCTION {format.vol.num.eid} -{ volume field.or.null - number empty$ - 'skip$ - { "\penalty0 (" number * ")" * * - volume empty$ - { "there's a number but no volume in " cite$ * warning$ } - 'skip$ - if$ - } - if$ - eid empty$ - 'skip$ - { duplicate$ empty$ - { pop$ format.eid } - { ":\penalty0 " * eid * } - if$ - } - if$ -} - -FUNCTION {format.chapter.pages} -{ chapter empty$ - 'format.pages - { type empty$ - { "chapter" } - { type "l" change.case$ } - if$ - chapter tie.or.space.connect - pages empty$ - 'skip$ - { ", " * format.pages * } - if$ - } - if$ -} - -FUNCTION {format.in.ed.booktitle} -{ booktitle empty$ - { "" } - { editor empty$ - { "In " booktitle emphasize * } - { "In " format.editors * ", " * booktitle emphasize * } - if$ - } - if$ -} - -FUNCTION {empty.misc.check} -{ author empty$ title empty$ howpublished empty$ - month empty$ year empty$ note empty$ - and and and and and - key empty$ not and - { "all relevant fields are empty in " cite$ * warning$ } - 'skip$ - if$ -} - -FUNCTION {format.thesis.type} -{ type empty$ - 'skip$ - { pop$ - type "t" change.case$ - } - if$ -} - -FUNCTION {format.tr.number} -{ type empty$ - { "Technical Report" } - 'type - if$ - number empty$ - { "t" change.case$ } - { number tie.or.space.connect } - if$ -} - -FUNCTION {format.article.crossref} -{ key empty$ - { journal empty$ - { "need key or journal for " cite$ * " to crossref " * crossref * - warning$ - "" - } - { "In \emph{" journal * "}" * } - if$ - } - { "In " } - if$ - " \citet{" * crossref * "}" * -} - -FUNCTION {format.book.crossref} -{ volume empty$ - { "empty volume in " cite$ * "'s crossref of " * crossref * warning$ - "In " - } - { "Volume" volume tie.or.space.connect - " of " * - } - if$ - editor empty$ - editor field.or.null author field.or.null = - or - { key empty$ - { series empty$ - { "need editor, key, or series for " cite$ * " to crossref " * - crossref * warning$ - "" * - } - { "\emph{" * series * "}" * } - if$ - } - 'skip$ - if$ - } - 'skip$ - if$ - " \citet{" * crossref * "}" * -} - -FUNCTION {format.incoll.inproc.crossref} -{ editor empty$ - editor field.or.null author field.or.null = - or - { key empty$ - { booktitle empty$ - { "need editor, key, or booktitle for " cite$ * " to crossref " * - crossref * warning$ - "" - } - { "In \emph{" booktitle * "}" * } - if$ - } - { "In " } - if$ - } - { "In " } - if$ - " \citet{" * crossref * "}" * -} - -FUNCTION {article} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.title "title" output.check - new.block - crossref missing$ - { journal emphasize "journal" output.check - eid empty$ - { format.vol.num.pages output } - { format.vol.num.eid output } - if$ - format.date "year" output.check - } - { format.article.crossref output.nonnull - eid empty$ - { format.pages output } - { format.eid output } - if$ - } - if$ - format.issn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {book} -{ output.bibitem - author empty$ - { format.editors "author and editor" output.check - editor format.key output - } - { format.authors output.nonnull - crossref missing$ - { "author and editor" editor either.or.check } - 'skip$ - if$ - } - if$ - new.block - format.btitle "title" output.check - crossref missing$ - { format.bvolume output - new.block - format.number.series output - new.sentence - publisher "publisher" output.check - address output - } - { new.block - format.book.crossref output.nonnull - } - if$ - format.edition output - format.date "year" output.check - format.isbn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {booklet} -{ output.bibitem - format.authors output - author format.key output - new.block - format.title "title" output.check - howpublished address new.block.checkb - howpublished output - address output - format.date output - format.isbn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {inbook} -{ output.bibitem - author empty$ - { format.editors "author and editor" output.check - editor format.key output - } - { format.authors output.nonnull - crossref missing$ - { "author and editor" editor either.or.check } - 'skip$ - if$ - } - if$ - new.block - format.btitle "title" output.check - crossref missing$ - { format.bvolume output - format.chapter.pages "chapter and pages" output.check - new.block - format.number.series output - new.sentence - publisher "publisher" output.check - address output - } - { format.chapter.pages "chapter and pages" output.check - new.block - format.book.crossref output.nonnull - } - if$ - format.edition output - format.date "year" output.check - format.isbn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {incollection} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.title "title" output.check - new.block - crossref missing$ - { format.in.ed.booktitle "booktitle" output.check - format.bvolume output - format.number.series output - format.chapter.pages output - new.sentence - publisher "publisher" output.check - address output - format.edition output - format.date "year" output.check - } - { format.incoll.inproc.crossref output.nonnull - format.chapter.pages output - } - if$ - format.isbn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {inproceedings} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.title "title" output.check - new.block - crossref missing$ - { format.in.ed.booktitle "booktitle" output.check - format.bvolume output - format.number.series output - format.pages output - address empty$ - { organization publisher new.sentence.checkb - organization output - publisher output - format.date "year" output.check - } - { address output.nonnull - format.date "year" output.check - new.sentence - organization output - publisher output - } - if$ - } - { format.incoll.inproc.crossref output.nonnull - format.pages output - } - if$ - format.isbn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {conference} { inproceedings } - -FUNCTION {manual} -{ output.bibitem - format.authors output - author format.key output - new.block - format.btitle "title" output.check - organization address new.block.checkb - organization output - address output - format.edition output - format.date output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {mastersthesis} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.title "title" output.check - new.block - "Master's thesis" format.thesis.type output.nonnull - school "school" output.check - address output - format.date "year" output.check - format.url output - new.block - note output - fin.entry -} - -FUNCTION {misc} -{ output.bibitem - format.authors output - author format.key output - title howpublished new.block.checkb - format.title output - howpublished new.block.checka - howpublished output - format.date output - format.issn output - format.url output - new.block - note output - fin.entry - empty.misc.check -} - -FUNCTION {phdthesis} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.btitle "title" output.check - new.block - "PhD thesis" format.thesis.type output.nonnull - school "school" output.check - address output - format.date "year" output.check - format.url output - new.block - note output - fin.entry -} - -FUNCTION {proceedings} -{ output.bibitem - format.editors output - editor format.key output - new.block - format.btitle "title" output.check - format.bvolume output - format.number.series output - address output - format.date "year" output.check - new.sentence - organization output - publisher output - format.isbn output - format.doi output - format.url output - new.block - note output - fin.entry -} - -FUNCTION {techreport} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.title "title" output.check - new.block - format.tr.number output.nonnull - institution "institution" output.check - address output - format.date "year" output.check - format.url output - new.block - note output - fin.entry -} - -FUNCTION {unpublished} -{ output.bibitem - format.authors "author" output.check - author format.key output - new.block - format.title "title" output.check - new.block - note "note" output.check - format.date output - format.url output - fin.entry -} - -FUNCTION {default.type} { misc } - - -MACRO {jan} {"January"} - -MACRO {feb} {"February"} - -MACRO {mar} {"March"} - -MACRO {apr} {"April"} - -MACRO {may} {"May"} - -MACRO {jun} {"June"} - -MACRO {jul} {"July"} - -MACRO {aug} {"August"} - -MACRO {sep} {"September"} - -MACRO {oct} {"October"} - -MACRO {nov} {"November"} - -MACRO {dec} {"December"} - - - -MACRO {acmcs} {"ACM Computing Surveys"} - -MACRO {acta} {"Acta Informatica"} - -MACRO {cacm} {"Communications of the ACM"} - -MACRO {ibmjrd} {"IBM Journal of Research and Development"} - -MACRO {ibmsj} {"IBM Systems Journal"} - -MACRO {ieeese} {"IEEE Transactions on Software Engineering"} - -MACRO {ieeetc} {"IEEE Transactions on Computers"} - -MACRO {ieeetcad} - {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"} - -MACRO {ipl} {"Information Processing Letters"} - -MACRO {jacm} {"Journal of the ACM"} - -MACRO {jcss} {"Journal of Computer and System Sciences"} - -MACRO {scp} {"Science of Computer Programming"} - -MACRO {sicomp} {"SIAM Journal on Computing"} - -MACRO {tocs} {"ACM Transactions on Computer Systems"} - -MACRO {tods} {"ACM Transactions on Database Systems"} - -MACRO {tog} {"ACM Transactions on Graphics"} - -MACRO {toms} {"ACM Transactions on Mathematical Software"} - -MACRO {toois} {"ACM Transactions on Office Information Systems"} - -MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"} - -MACRO {tcs} {"Theoretical Computer Science"} - - -READ - -FUNCTION {sortify} -{ purify$ - "l" change.case$ -} - -INTEGERS { len } - -FUNCTION {chop.word} -{ 's := - 'len := - s #1 len substring$ = - { s len #1 + global.max$ substring$ } - 's - if$ -} - -FUNCTION {format.lab.names} -{ 's := - s #1 "{vv~}{ll}" format.name$ - s num.names$ duplicate$ - #2 > - { pop$ " et~al." * } - { #2 < - 'skip$ - { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" = - { " et~al." * } - { " \& " * s #2 "{vv~}{ll}" format.name$ * } - if$ - } - if$ - } - if$ -} - -FUNCTION {author.key.label} -{ author empty$ - { key empty$ - { cite$ #1 #3 substring$ } - 'key - if$ - } - { author format.lab.names } - if$ -} - -FUNCTION {author.editor.key.label} -{ author empty$ - { editor empty$ - { key empty$ - { cite$ #1 #3 substring$ } - 'key - if$ - } - { editor format.lab.names } - if$ - } - { author format.lab.names } - if$ -} - -FUNCTION {author.key.organization.label} -{ author empty$ - { key empty$ - { organization empty$ - { cite$ #1 #3 substring$ } - { "The " #4 organization chop.word #3 text.prefix$ } - if$ - } - 'key - if$ - } - { author format.lab.names } - if$ -} - -FUNCTION {editor.key.organization.label} -{ editor empty$ - { key empty$ - { organization empty$ - { cite$ #1 #3 substring$ } - { "The " #4 organization chop.word #3 text.prefix$ } - if$ - } - 'key - if$ - } - { editor format.lab.names } - if$ -} - -FUNCTION {calc.short.authors} -{ type$ "book" = - type$ "inbook" = - or - 'author.editor.key.label - { type$ "proceedings" = - 'editor.key.organization.label - { type$ "manual" = - 'author.key.organization.label - 'author.key.label - if$ - } - if$ - } - if$ - 'short.list := -} - -FUNCTION {calc.label} -{ calc.short.authors - short.list - "(" - * - year duplicate$ empty$ - short.list key field.or.null = or - { pop$ "" } - 'skip$ - if$ - * - 'label := -} - -FUNCTION {sort.format.names} -{ 's := - #1 'nameptr := - "" - s num.names$ 'numnames := - numnames 'namesleft := - { namesleft #0 > } - { - s nameptr "{vv{ } }{ll{ }}{ ff{ }}{ jj{ }}" format.name$ 't := - nameptr #1 > - { - " " * - namesleft #1 = t "others" = and - { "zzzzz" * } - { numnames #2 > nameptr #2 = and - { "zz" * year field.or.null * " " * } - 'skip$ - if$ - t sortify * - } - if$ - } - { t sortify * } - if$ - nameptr #1 + 'nameptr := - namesleft #1 - 'namesleft := - } - while$ -} - -FUNCTION {sort.format.title} -{ 't := - "A " #2 - "An " #3 - "The " #4 t chop.word - chop.word - chop.word - sortify - #1 global.max$ substring$ -} - -FUNCTION {author.sort} -{ author empty$ - { key empty$ - { "to sort, need author or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { author sort.format.names } - if$ -} - -FUNCTION {author.editor.sort} -{ author empty$ - { editor empty$ - { key empty$ - { "to sort, need author, editor, or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { editor sort.format.names } - if$ - } - { author sort.format.names } - if$ -} - -FUNCTION {author.organization.sort} -{ author empty$ - { organization empty$ - { key empty$ - { "to sort, need author, organization, or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { "The " #4 organization chop.word sortify } - if$ - } - { author sort.format.names } - if$ -} - -FUNCTION {editor.organization.sort} -{ editor empty$ - { organization empty$ - { key empty$ - { "to sort, need editor, organization, or key in " cite$ * warning$ - "" - } - { key sortify } - if$ - } - { "The " #4 organization chop.word sortify } - if$ - } - { editor sort.format.names } - if$ -} - - -FUNCTION {presort} -{ calc.label - label sortify - " " - * - type$ "book" = - type$ "inbook" = - or - 'author.editor.sort - { type$ "proceedings" = - 'editor.organization.sort - { type$ "manual" = - 'author.organization.sort - 'author.sort - if$ - } - if$ - } - if$ - " " - * - year field.or.null sortify - * - " " - * - cite$ - * - #1 entry.max$ substring$ - 'sort.label := - sort.label * - #1 entry.max$ substring$ - 'sort.key$ := -} - -ITERATE {presort} - -SORT - -STRINGS { longest.label last.label next.extra } - -INTEGERS { longest.label.width last.extra.num number.label } - -FUNCTION {initialize.longest.label} -{ "" 'longest.label := - #0 int.to.chr$ 'last.label := - "" 'next.extra := - #0 'longest.label.width := - #0 'last.extra.num := - #0 'number.label := -} - -FUNCTION {forward.pass} -{ last.label label = - { last.extra.num #1 + 'last.extra.num := - last.extra.num int.to.chr$ 'extra.label := - } - { "a" chr.to.int$ 'last.extra.num := - "" 'extra.label := - label 'last.label := - } - if$ - number.label #1 + 'number.label := -} - -FUNCTION {reverse.pass} -{ next.extra "b" = - { "a" 'extra.label := } - 'skip$ - if$ - extra.label 'next.extra := - extra.label - duplicate$ empty$ - 'skip$ - { "{\natexlab{" swap$ * "}}" * } - if$ - 'extra.label := - label extra.label * 'label := -} - -EXECUTE {initialize.longest.label} - -ITERATE {forward.pass} - -REVERSE {reverse.pass} - -FUNCTION {bib.sort.order} -{ sort.label 'sort.key$ := -} - -ITERATE {bib.sort.order} - -SORT - -FUNCTION {begin.bib} -{ preamble$ empty$ - 'skip$ - { preamble$ write$ newline$ } - if$ - "\begin{thebibliography}{" number.label int.to.str$ * "}" * - write$ newline$ - "\providecommand{\natexlab}[1]{#1}" - write$ newline$ - "\providecommand{\url}[1]{\texttt{#1}}" - write$ newline$ - "\expandafter\ifx\csname urlstyle\endcsname\relax" - write$ newline$ - " \providecommand{\doi}[1]{doi: #1}\else" - write$ newline$ - " \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi" - write$ newline$ -} - -EXECUTE {begin.bib} - -EXECUTE {init.state.consts} - -ITERATE {call.type$} - -FUNCTION {end.bib} -{ newline$ - "\end{thebibliography}" write$ newline$ -} - -EXECUTE {end.bib} diff --git a/papers/aleksandar_makelov/arxiv_template.sty b/papers/aleksandar_makelov/arxiv_template.sty deleted file mode 100644 index 84c7b6b33e..0000000000 --- a/papers/aleksandar_makelov/arxiv_template.sty +++ /dev/null @@ -1,252 +0,0 @@ -%%%% COLM Macros (LaTex) -%%%% Adapted by Hugo Larochelle from the NIPS stylefile Macros -%%%% Style File -%%%% Dec 12, 1990 Rev Aug 14, 1991; Sept, 1995; April, 1997; April, 1999; October 2014 - -% This file can be used with Latex2e whether running in main mode, or -% 2.09 compatibility mode. -% -% If using main mode, you need to include the commands -% \documentclass{article} -% \usepackage{colm14submit_e} -% - -% Palatino font -\RequirePackage{tgpagella} % text only -\RequirePackage{mathpazo} % math & text - - -% Change the overall width of the page. If these parameters are -% changed, they will require corresponding changes in the -% maketitle section. -% -\usepackage{eso-pic} % used by \AddToShipoutPicture -\RequirePackage{fancyhdr} -\RequirePackage{natbib} - -% modification to natbib citations -\setcitestyle{authoryear,round,citesep={;},aysep={,},yysep={;}} - -\renewcommand{\topfraction}{0.95} % let figure take up nearly whole page -\renewcommand{\textfraction}{0.05} % let figure take up nearly whole page - -% Define colmfinal, set to true if colmfinalcopy is defined -\newif\ifcolmfinal -\colmfinalfalse -\def\colmfinalcopy{\colmfinaltrue} -\font\colmtenhv = phvb at 8pt - -% Specify the dimensions of each page - -\setlength{\paperheight}{11in} -\setlength{\paperwidth}{8.5in} - - -\oddsidemargin .5in % Note \oddsidemargin = \evensidemargin -\evensidemargin .5in -\marginparwidth 0.07 true in -%\marginparwidth 0.75 true in -%\topmargin 0 true pt % Nominal distance from top of page to top of -%\topmargin 0.125in -\topmargin -0.625in -\addtolength{\headsep}{0.25in} -\textheight 9.0 true in % Height of text (including footnotes & figures) -\textwidth 5.5 true in % Width of text line. -\widowpenalty=10000 -\clubpenalty=10000 - -% \thispagestyle{empty} \pagestyle{empty} -\flushbottom \sloppy - -% We're never going to need a table of contents, so just flush it to -% save space --- suggested by drstrip@sandia-2 -\def\addcontentsline#1#2#3{} - -% Title stuff, taken from deproc. -\def\maketitle{\par -\begingroup - \def\thefootnote{\fnsymbol{footnote}} - \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}} % for perfect author - % name centering -% The footnote-mark was overlapping the footnote-text, -% added the following to fix this problem (MK) - \long\def\@makefntext##1{\parindent 1em\noindent - \hbox to1.8em{\hss $\m@th ^{\@thefnmark}$}##1} - \@maketitle \@thanks -\endgroup -\setcounter{footnote}{0} -\let\maketitle\relax \let\@maketitle\relax -\gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax} - -% The toptitlebar has been raised to top-justify the first page - -\usepackage{fancyhdr} -\pagestyle{fancy} -% \renewcommand{\headrulewidth}{1.5pt} -\renewcommand{\headrulewidth}{0pt} -\fancyhead{} - -% Title (includes both anonimized and non-anonimized versions) -\def\@maketitle{\vbox{\hsize\textwidth -%\linewidth\hsize \vskip 0.1in \toptitlebar \centering -{\Large\bf \@title\par} -%\bottomtitlebar % \vskip 0.1in % minus -\ifcolmfinal - % \lhead{Preprint. Under review.} - \def\And{\end{tabular}\hfil\linebreak[0]\hfil - \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}% - \def\AND{\end{tabular}\hfil\linebreak[4]\hfil - \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}% - \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\@author\end{tabular}% -\else - \lhead{Under review as a conference paper at COLM 2024} - \def\And{\end{tabular}\hfil\linebreak[0]\hfil - \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}% - \def\AND{\end{tabular}\hfil\linebreak[4]\hfil - \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}\ignorespaces}% - \begin{tabular}[t]{l}\bf\rule{\z@}{24pt}Anonymous authors\\Paper under double-blind review\end{tabular}% -\fi -\vskip 0.3in minus 0.1in}} - -\renewenvironment{abstract}{\vskip.075in\centerline{\large\bf -Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex} - -% sections with less space -\def\section{\@startsection {section}{1}{\z@}{-2.0ex plus - -0.5ex minus -.2ex}{1.5ex plus 0.3ex -minus0.2ex}{\large\bf\raggedright}} - -\def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus --0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\raggedright}} -\def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex -plus -0.5ex minus -.2ex}{0.5ex plus -.2ex}{\normalsize\raggedright}} -\def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus -0.5ex minus .2ex}{-1em}{\normalsize\bf}} -\def\subparagraph{\@startsection{subparagraph}{5}{\z@}{1.5ex plus - 0.5ex minus .2ex}{-1em}{\normalsize}} -\def\subsubsubsection{\vskip -5pt{\noindent\normalsize\rm\raggedright}} - - -% Footnotes -\footnotesep 6.65pt % -\skip\footins 9pt plus 4pt minus 2pt -\def\footnoterule{\kern-3pt \hrule width 12pc \kern 2.6pt } -\setcounter{footnote}{0} - -% Lists and paragraphs -\parindent 0pt -\topsep 4pt plus 1pt minus 2pt -\partopsep 1pt plus 0.5pt minus 0.5pt -\itemsep 2pt plus 1pt minus 0.5pt -\parsep 2pt plus 1pt minus 0.5pt -\parskip .5pc - - -%\leftmargin2em -\leftmargin3pc -\leftmargini\leftmargin \leftmarginii 2em -\leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em - -%\labelsep \labelsep 5pt - -\def\@listi{\leftmargin\leftmargini} -\def\@listii{\leftmargin\leftmarginii - \labelwidth\leftmarginii\advance\labelwidth-\labelsep - \topsep 2pt plus 1pt minus 0.5pt - \parsep 1pt plus 0.5pt minus 0.5pt - \itemsep \parsep} -\def\@listiii{\leftmargin\leftmarginiii - \labelwidth\leftmarginiii\advance\labelwidth-\labelsep - \topsep 1pt plus 0.5pt minus 0.5pt - \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt - \itemsep \topsep} -\def\@listiv{\leftmargin\leftmarginiv - \labelwidth\leftmarginiv\advance\labelwidth-\labelsep} -\def\@listv{\leftmargin\leftmarginv - \labelwidth\leftmarginv\advance\labelwidth-\labelsep} -\def\@listvi{\leftmargin\leftmarginvi - \labelwidth\leftmarginvi\advance\labelwidth-\labelsep} - -\abovedisplayskip 7pt plus2pt minus5pt% -\belowdisplayskip \abovedisplayskip -\abovedisplayshortskip 0pt plus3pt% -\belowdisplayshortskip 4pt plus3pt minus3pt% - -% Less leading in most fonts (due to the narrow columns) -% The choices were between 1-pt and 1.5-pt leading -%\def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} % got rid of @ (MK) -\def\normalsize{\@setsize\normalsize{11pt}\xpt\@xpt} -\def\small{\@setsize\small{10pt}\ixpt\@ixpt} -\def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt} -\def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt} -\def\tiny{\@setsize\tiny{7pt}\vipt\@vipt} -\def\large{\@setsize\large{14pt}\xiipt\@xiipt} -\def\Large{\@setsize\Large{16pt}\xivpt\@xivpt} -\def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt} -\def\huge{\@setsize\huge{23pt}\xxpt\@xxpt} -\def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt} - -\def\toptitlebar{\hrule height4pt\vskip .25in\vskip-\parskip} - -\def\bottomtitlebar{\vskip .29in\vskip-\parskip\hrule height1pt\vskip -.09in} % -%Reduced second vskip to compensate for adding the strut in \@author - - -%% % Vertical Ruler -%% % This code is, largely, from the CVPR 2010 conference style file -%% % ----- define vruler -%% \makeatletter -%% \newbox\colmrulerbox -%% \newcount\colmrulercount -%% \newdimen\colmruleroffset -%% \newdimen\cv@lineheight -%% \newdimen\cv@boxheight -%% \newbox\cv@tmpbox -%% \newcount\cv@refno -%% \newcount\cv@tot -%% % NUMBER with left flushed zeros \fillzeros[] -%% \newcount\cv@tmpc@ \newcount\cv@tmpc -%% \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi -%% \cv@tmpc=1 % -%% \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi -%% \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat -%% \ifnum#2<0\advance\cv@tmpc1\relax-\fi -%% \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat -%% \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}% -%% % \makevruler[][][][][] -%% \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip -%% \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt% -%% \global\setbox\colmrulerbox=\vbox to \textheight{% -%% {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight -%% \cv@lineheight=#1\global\colmrulercount=#2% -%% \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2% -%% \cv@refno1\vskip-\cv@lineheight\vskip1ex% -%% \loop\setbox\cv@tmpbox=\hbox to0cm{{\colmtenhv\hfil\fillzeros[#4]\colmrulercount}}% -%% \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break -%% \advance\cv@refno1\global\advance\colmrulercount#3\relax -%% \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}% -%% \makeatother -%% % ----- end of vruler - -%% % \makevruler[][][][][] -%% \def\colmruler#1{\makevruler[12pt][#1][1][3][0.993\textheight]\usebox{\colmrulerbox}} -%% \AddToShipoutPicture{% -%% \ifcolmfinal\else -%% \colmruleroffset=\textheight -%% \advance\colmruleroffset by -3.7pt -%% \color[rgb]{.7,.7,.7} -%% \AtTextUpperLeft{% -%% \put(\LenToUnit{-35pt},\LenToUnit{-\colmruleroffset}){%left ruler -%% \colmruler{\colmrulercount}} -%% } -%% \fi -%% } -%%% To add a vertical bar on the side -%\AddToShipoutPicture{ -%\AtTextLowerLeft{ -%\hspace*{-1.8cm} -%\colorbox[rgb]{0.7,0.7,0.7}{\small \parbox[b][\textheight]{0.1cm}{}}} -%} diff --git a/papers/aleksandar_makelov/banner.png b/papers/aleksandar_makelov/banner.png index c5dd028e26..dc7a07a9ef 100644 Binary files a/papers/aleksandar_makelov/banner.png and b/papers/aleksandar_makelov/banner.png differ diff --git a/papers/aleksandar_makelov/fancyhdr.sty b/papers/aleksandar_makelov/fancyhdr.sty deleted file mode 100644 index 77ed4e3012..0000000000 --- a/papers/aleksandar_makelov/fancyhdr.sty +++ /dev/null @@ -1,485 +0,0 @@ -% fancyhdr.sty version 3.2 -% Fancy headers and footers for LaTeX. -% Piet van Oostrum, -% Dept of Computer and Information Sciences, University of Utrecht, -% Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands -% Telephone: +31 30 2532180. Email: piet@cs.uu.nl -% ======================================================================== -% LICENCE: -% This file may be distributed under the terms of the LaTeX Project Public -% License, as described in lppl.txt in the base LaTeX distribution. -% Either version 1 or, at your option, any later version. -% ======================================================================== -% MODIFICATION HISTORY: -% Sep 16, 1994 -% version 1.4: Correction for use with \reversemargin -% Sep 29, 1994: -% version 1.5: Added the \iftopfloat, \ifbotfloat and \iffloatpage commands -% Oct 4, 1994: -% version 1.6: Reset single spacing in headers/footers for use with -% setspace.sty or doublespace.sty -% Oct 4, 1994: -% version 1.7: changed \let\@mkboth\markboth to -% \def\@mkboth{\protect\markboth} to make it more robust -% Dec 5, 1994: -% version 1.8: corrections for amsbook/amsart: define \@chapapp and (more -% importantly) use the \chapter/sectionmark definitions from ps@headings if -% they exist (which should be true for all standard classes). -% May 31, 1995: -% version 1.9: The proposed \renewcommand{\headrulewidth}{\iffloatpage... -% construction in the doc did not work properly with the fancyplain style. -% June 1, 1995: -% version 1.91: The definition of \@mkboth wasn't restored on subsequent -% \pagestyle{fancy}'s. -% June 1, 1995: -% version 1.92: The sequence \pagestyle{fancyplain} \pagestyle{plain} -% \pagestyle{fancy} would erroneously select the plain version. -% June 1, 1995: -% version 1.93: \fancypagestyle command added. -% Dec 11, 1995: -% version 1.94: suggested by Conrad Hughes -% CJCH, Dec 11, 1995: added \footruleskip to allow control over footrule -% position (old hardcoded value of .3\normalbaselineskip is far too high -% when used with very small footer fonts). -% Jan 31, 1996: -% version 1.95: call \@normalsize in the reset code if that is defined, -% otherwise \normalsize. -% this is to solve a problem with ucthesis.cls, as this doesn't -% define \@currsize. Unfortunately for latex209 calling \normalsize doesn't -% work as this is optimized to do very little, so there \@normalsize should -% be called. Hopefully this code works for all versions of LaTeX known to -% mankind. -% April 25, 1996: -% version 1.96: initialize \headwidth to a magic (negative) value to catch -% most common cases that people change it before calling \pagestyle{fancy}. -% Note it can't be initialized when reading in this file, because -% \textwidth could be changed afterwards. This is quite probable. -% We also switch to \MakeUppercase rather than \uppercase and introduce a -% \nouppercase command for use in headers. and footers. -% May 3, 1996: -% version 1.97: Two changes: -% 1. Undo the change in version 1.8 (using the pagestyle{headings} defaults -% for the chapter and section marks. The current version of amsbook and -% amsart classes don't seem to need them anymore. Moreover the standard -% latex classes don't use \markboth if twoside isn't selected, and this is -% confusing as \leftmark doesn't work as expected. -% 2. include a call to \ps@empty in ps@@fancy. This is to solve a problem -% in the amsbook and amsart classes, that make global changes to \topskip, -% which are reset in \ps@empty. Hopefully this doesn't break other things. -% May 7, 1996: -% version 1.98: -% Added % after the line \def\nouppercase -% May 7, 1996: -% version 1.99: This is the alpha version of fancyhdr 2.0 -% Introduced the new commands \fancyhead, \fancyfoot, and \fancyhf. -% Changed \headrulewidth, \footrulewidth, \footruleskip to -% macros rather than length parameters, In this way they can be -% conditionalized and they don't consume length registers. There is no need -% to have them as length registers unless you want to do calculations with -% them, which is unlikely. Note that this may make some uses of them -% incompatible (i.e. if you have a file that uses \setlength or \xxxx=) -% May 10, 1996: -% version 1.99a: -% Added a few more % signs -% May 10, 1996: -% version 1.99b: -% Changed the syntax of \f@nfor to be resistent to catcode changes of := -% Removed the [1] from the defs of \lhead etc. because the parameter is -% consumed by the \@[xy]lhead etc. macros. -% June 24, 1997: -% version 1.99c: -% corrected \nouppercase to also include the protected form of \MakeUppercase -% \global added to manipulation of \headwidth. -% \iffootnote command added. -% Some comments added about \@fancyhead and \@fancyfoot. -% Aug 24, 1998 -% version 1.99d -% Changed the default \ps@empty to \ps@@empty in order to allow -% \fancypagestyle{empty} redefinition. -% Oct 11, 2000 -% version 2.0 -% Added LPPL license clause. -% -% A check for \headheight is added. An errormessage is given (once) if the -% header is too large. Empty headers don't generate the error even if -% \headheight is very small or even 0pt. -% Warning added for the use of 'E' option when twoside option is not used. -% In this case the 'E' fields will never be used. -% -% Mar 10, 2002 -% version 2.1beta -% New command: \fancyhfoffset[place]{length} -% defines offsets to be applied to the header/footer to let it stick into -% the margins (if length > 0). -% place is like in fancyhead, except that only E,O,L,R can be used. -% This replaces the old calculation based on \headwidth and the marginpar -% area. -% \headwidth will be dynamically calculated in the headers/footers when -% this is used. -% -% Mar 26, 2002 -% version 2.1beta2 -% \fancyhfoffset now also takes h,f as possible letters in the argument to -% allow the header and footer widths to be different. -% New commands \fancyheadoffset and \fancyfootoffset added comparable to -% \fancyhead and \fancyfoot. -% Errormessages and warnings have been made more informative. -% -% Dec 9, 2002 -% version 2.1 -% The defaults for \footrulewidth, \plainheadrulewidth and -% \plainfootrulewidth are changed from \z@skip to 0pt. In this way when -% someone inadvertantly uses \setlength to change any of these, the value -% of \z@skip will not be changed, rather an errormessage will be given. - -% March 3, 2004 -% Release of version 3.0 - -% Oct 7, 2004 -% version 3.1 -% Added '\endlinechar=13' to \fancy@reset to prevent problems with -% includegraphics in header when verbatiminput is active. - -% March 22, 2005 -% version 3.2 -% reset \everypar (the real one) in \fancy@reset because spanish.ldf does -% strange things with \everypar between << and >>. - -\def\ifancy@mpty#1{\def\temp@a{#1}\ifx\temp@a\@empty} - -\def\fancy@def#1#2{\ifancy@mpty{#2}\fancy@gbl\def#1{\leavevmode}\else - \fancy@gbl\def#1{#2\strut}\fi} - -\let\fancy@gbl\global - -\def\@fancyerrmsg#1{% - \ifx\PackageError\undefined - \errmessage{#1}\else - \PackageError{Fancyhdr}{#1}{}\fi} -\def\@fancywarning#1{% - \ifx\PackageWarning\undefined - \errmessage{#1}\else - \PackageWarning{Fancyhdr}{#1}{}\fi} - -% Usage: \@forc \var{charstring}{command to be executed for each char} -% This is similar to LaTeX's \@tfor, but expands the charstring. - -\def\@forc#1#2#3{\expandafter\f@rc\expandafter#1\expandafter{#2}{#3}} -\def\f@rc#1#2#3{\def\temp@ty{#2}\ifx\@empty\temp@ty\else - \f@@rc#1#2\f@@rc{#3}\fi} -\def\f@@rc#1#2#3\f@@rc#4{\def#1{#2}#4\f@rc#1{#3}{#4}} - -% Usage: \f@nfor\name:=list\do{body} -% Like LaTeX's \@for but an empty list is treated as a list with an empty -% element - -\newcommand{\f@nfor}[3]{\edef\@fortmp{#2}% - \expandafter\@forloop#2,\@nil,\@nil\@@#1{#3}} - -% Usage: \def@ult \cs{defaults}{argument} -% sets \cs to the characters from defaults appearing in argument -% or defaults if it would be empty. All characters are lowercased. - -\newcommand\def@ult[3]{% - \edef\temp@a{\lowercase{\edef\noexpand\temp@a{#3}}}\temp@a - \def#1{}% - \@forc\tmpf@ra{#2}% - {\expandafter\if@in\tmpf@ra\temp@a{\edef#1{#1\tmpf@ra}}{}}% - \ifx\@empty#1\def#1{#2}\fi} -% -% \if@in -% -\newcommand{\if@in}[4]{% - \edef\temp@a{#2}\def\temp@b##1#1##2\temp@b{\def\temp@b{##1}}% - \expandafter\temp@b#2#1\temp@b\ifx\temp@a\temp@b #4\else #3\fi} - -\newcommand{\fancyhead}{\@ifnextchar[{\f@ncyhf\fancyhead h}% - {\f@ncyhf\fancyhead h[]}} -\newcommand{\fancyfoot}{\@ifnextchar[{\f@ncyhf\fancyfoot f}% - {\f@ncyhf\fancyfoot f[]}} -\newcommand{\fancyhf}{\@ifnextchar[{\f@ncyhf\fancyhf{}}% - {\f@ncyhf\fancyhf{}[]}} - -% New commands for offsets added - -\newcommand{\fancyheadoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyheadoffset h}% - {\f@ncyhfoffs\fancyheadoffset h[]}} -\newcommand{\fancyfootoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyfootoffset f}% - {\f@ncyhfoffs\fancyfootoffset f[]}} -\newcommand{\fancyhfoffset}{\@ifnextchar[{\f@ncyhfoffs\fancyhfoffset{}}% - {\f@ncyhfoffs\fancyhfoffset{}[]}} - -% The header and footer fields are stored in command sequences with -% names of the form: \f@ncy with for [eo], from [lcr] -% and from [hf]. - -\def\f@ncyhf#1#2[#3]#4{% - \def\temp@c{}% - \@forc\tmpf@ra{#3}% - {\expandafter\if@in\tmpf@ra{eolcrhf,EOLCRHF}% - {}{\edef\temp@c{\temp@c\tmpf@ra}}}% - \ifx\@empty\temp@c\else - \@fancyerrmsg{Illegal char `\temp@c' in \string#1 argument: - [#3]}% - \fi - \f@nfor\temp@c{#3}% - {\def@ult\f@@@eo{eo}\temp@c - \if@twoside\else - \if\f@@@eo e\@fancywarning - {\string#1's `E' option without twoside option is useless}\fi\fi - \def@ult\f@@@lcr{lcr}\temp@c - \def@ult\f@@@hf{hf}{#2\temp@c}% - \@forc\f@@eo\f@@@eo - {\@forc\f@@lcr\f@@@lcr - {\@forc\f@@hf\f@@@hf - {\expandafter\fancy@def\csname - f@ncy\f@@eo\f@@lcr\f@@hf\endcsname - {#4}}}}}} - -\def\f@ncyhfoffs#1#2[#3]#4{% - \def\temp@c{}% - \@forc\tmpf@ra{#3}% - {\expandafter\if@in\tmpf@ra{eolrhf,EOLRHF}% - {}{\edef\temp@c{\temp@c\tmpf@ra}}}% - \ifx\@empty\temp@c\else - \@fancyerrmsg{Illegal char `\temp@c' in \string#1 argument: - [#3]}% - \fi - \f@nfor\temp@c{#3}% - {\def@ult\f@@@eo{eo}\temp@c - \if@twoside\else - \if\f@@@eo e\@fancywarning - {\string#1's `E' option without twoside option is useless}\fi\fi - \def@ult\f@@@lcr{lr}\temp@c - \def@ult\f@@@hf{hf}{#2\temp@c}% - \@forc\f@@eo\f@@@eo - {\@forc\f@@lcr\f@@@lcr - {\@forc\f@@hf\f@@@hf - {\expandafter\setlength\csname - f@ncyO@\f@@eo\f@@lcr\f@@hf\endcsname - {#4}}}}}% - \fancy@setoffs} - -% Fancyheadings version 1 commands. These are more or less deprecated, -% but they continue to work. - -\newcommand{\lhead}{\@ifnextchar[{\@xlhead}{\@ylhead}} -\def\@xlhead[#1]#2{\fancy@def\f@ncyelh{#1}\fancy@def\f@ncyolh{#2}} -\def\@ylhead#1{\fancy@def\f@ncyelh{#1}\fancy@def\f@ncyolh{#1}} - -\newcommand{\chead}{\@ifnextchar[{\@xchead}{\@ychead}} -\def\@xchead[#1]#2{\fancy@def\f@ncyech{#1}\fancy@def\f@ncyoch{#2}} -\def\@ychead#1{\fancy@def\f@ncyech{#1}\fancy@def\f@ncyoch{#1}} - -\newcommand{\rhead}{\@ifnextchar[{\@xrhead}{\@yrhead}} -\def\@xrhead[#1]#2{\fancy@def\f@ncyerh{#1}\fancy@def\f@ncyorh{#2}} -\def\@yrhead#1{\fancy@def\f@ncyerh{#1}\fancy@def\f@ncyorh{#1}} - -\newcommand{\lfoot}{\@ifnextchar[{\@xlfoot}{\@ylfoot}} -\def\@xlfoot[#1]#2{\fancy@def\f@ncyelf{#1}\fancy@def\f@ncyolf{#2}} -\def\@ylfoot#1{\fancy@def\f@ncyelf{#1}\fancy@def\f@ncyolf{#1}} - -\newcommand{\cfoot}{\@ifnextchar[{\@xcfoot}{\@ycfoot}} -\def\@xcfoot[#1]#2{\fancy@def\f@ncyecf{#1}\fancy@def\f@ncyocf{#2}} -\def\@ycfoot#1{\fancy@def\f@ncyecf{#1}\fancy@def\f@ncyocf{#1}} - -\newcommand{\rfoot}{\@ifnextchar[{\@xrfoot}{\@yrfoot}} -\def\@xrfoot[#1]#2{\fancy@def\f@ncyerf{#1}\fancy@def\f@ncyorf{#2}} -\def\@yrfoot#1{\fancy@def\f@ncyerf{#1}\fancy@def\f@ncyorf{#1}} - -\newlength{\fancy@headwidth} -\let\headwidth\fancy@headwidth -\newlength{\f@ncyO@elh} -\newlength{\f@ncyO@erh} -\newlength{\f@ncyO@olh} -\newlength{\f@ncyO@orh} -\newlength{\f@ncyO@elf} -\newlength{\f@ncyO@erf} -\newlength{\f@ncyO@olf} -\newlength{\f@ncyO@orf} -\newcommand{\headrulewidth}{0.4pt} -\newcommand{\footrulewidth}{0pt} -\newcommand{\footruleskip}{.3\normalbaselineskip} - -% Fancyplain stuff shouldn't be used anymore (rather -% \fancypagestyle{plain} should be used), but it must be present for -% compatibility reasons. - -\newcommand{\plainheadrulewidth}{0pt} -\newcommand{\plainfootrulewidth}{0pt} -\newif\if@fancyplain \@fancyplainfalse -\def\fancyplain#1#2{\if@fancyplain#1\else#2\fi} - -\headwidth=-123456789sp %magic constant - -% Command to reset various things in the headers: -% a.o. single spacing (taken from setspace.sty) -% and the catcode of ^^M (so that epsf files in the header work if a -% verbatim crosses a page boundary) -% It also defines a \nouppercase command that disables \uppercase and -% \Makeuppercase. It can only be used in the headers and footers. -\let\fnch@everypar\everypar% save real \everypar because of spanish.ldf -\def\fancy@reset{\fnch@everypar{}\restorecr\endlinechar=13 - \def\baselinestretch{1}% - \def\nouppercase##1{{\let\uppercase\relax\let\MakeUppercase\relax - \expandafter\let\csname MakeUppercase \endcsname\relax##1}}% - \ifx\undefined\@newbaseline% NFSS not present; 2.09 or 2e - \ifx\@normalsize\undefined \normalsize % for ucthesis.cls - \else \@normalsize \fi - \else% NFSS (2.09) present - \@newbaseline% - \fi} - -% Initialization of the head and foot text. - -% The default values still contain \fancyplain for compatibility. -\fancyhf{} % clear all -% lefthead empty on ``plain'' pages, \rightmark on even, \leftmark on odd pages -% evenhead empty on ``plain'' pages, \leftmark on even, \rightmark on odd pages -\if@twoside - \fancyhead[el,or]{\fancyplain{}{\sl\rightmark}} - \fancyhead[er,ol]{\fancyplain{}{\sl\leftmark}} -\else - \fancyhead[l]{\fancyplain{}{\sl\rightmark}} - \fancyhead[r]{\fancyplain{}{\sl\leftmark}} -\fi -\fancyfoot[c]{\rm\thepage} % page number - -% Use box 0 as a temp box and dimen 0 as temp dimen. -% This can be done, because this code will always -% be used inside another box, and therefore the changes are local. - -\def\@fancyvbox#1#2{\setbox0\vbox{#2}\ifdim\ht0>#1\@fancywarning - {\string#1 is too small (\the#1): ^^J Make it at least \the\ht0.^^J - We now make it that large for the rest of the document.^^J - This may cause the page layout to be inconsistent, however\@gobble}% - \dimen0=#1\global\setlength{#1}{\ht0}\ht0=\dimen0\fi - \box0} - -% Put together a header or footer given the left, center and -% right text, fillers at left and right and a rule. -% The \lap commands put the text into an hbox of zero size, -% so overlapping text does not generate an errormessage. -% These macros have 5 parameters: -% 1. LEFTSIDE BEARING % This determines at which side the header will stick -% out. When \fancyhfoffset is used this calculates \headwidth, otherwise -% it is \hss or \relax (after expansion). -% 2. \f@ncyolh, \f@ncyelh, \f@ncyolf or \f@ncyelf. This is the left component. -% 3. \f@ncyoch, \f@ncyech, \f@ncyocf or \f@ncyecf. This is the middle comp. -% 4. \f@ncyorh, \f@ncyerh, \f@ncyorf or \f@ncyerf. This is the right component. -% 5. RIGHTSIDE BEARING. This is always \relax or \hss (after expansion). - -\def\@fancyhead#1#2#3#4#5{#1\hbox to\headwidth{\fancy@reset - \@fancyvbox\headheight{\hbox - {\rlap{\parbox[b]{\headwidth}{\raggedright#2}}\hfill - \parbox[b]{\headwidth}{\centering#3}\hfill - \llap{\parbox[b]{\headwidth}{\raggedleft#4}}}\headrule}}#5} - -\def\@fancyfoot#1#2#3#4#5{#1\hbox to\headwidth{\fancy@reset - \@fancyvbox\footskip{\footrule - \hbox{\rlap{\parbox[t]{\headwidth}{\raggedright#2}}\hfill - \parbox[t]{\headwidth}{\centering#3}\hfill - \llap{\parbox[t]{\headwidth}{\raggedleft#4}}}}}#5} - -\def\headrule{{\if@fancyplain\let\headrulewidth\plainheadrulewidth\fi - \hrule\@height\headrulewidth\@width\headwidth \vskip-\headrulewidth}} - -\def\footrule{{\if@fancyplain\let\footrulewidth\plainfootrulewidth\fi - \vskip-\footruleskip\vskip-\footrulewidth - \hrule\@width\headwidth\@height\footrulewidth\vskip\footruleskip}} - -\def\ps@fancy{% -\@ifundefined{@chapapp}{\let\@chapapp\chaptername}{}%for amsbook -% -% Define \MakeUppercase for old LaTeXen. -% Note: we used \def rather than \let, so that \let\uppercase\relax (from -% the version 1 documentation) will still work. -% -\@ifundefined{MakeUppercase}{\def\MakeUppercase{\uppercase}}{}% -\@ifundefined{chapter}{\def\sectionmark##1{\markboth -{\MakeUppercase{\ifnum \c@secnumdepth>\z@ - \thesection\hskip 1em\relax \fi ##1}}{}}% -\def\subsectionmark##1{\markright {\ifnum \c@secnumdepth >\@ne - \thesubsection\hskip 1em\relax \fi ##1}}}% -{\def\chaptermark##1{\markboth {\MakeUppercase{\ifnum \c@secnumdepth>\m@ne - \@chapapp\ \thechapter. \ \fi ##1}}{}}% -\def\sectionmark##1{\markright{\MakeUppercase{\ifnum \c@secnumdepth >\z@ - \thesection. \ \fi ##1}}}}% -%\csname ps@headings\endcsname % use \ps@headings defaults if they exist -\ps@@fancy -\gdef\ps@fancy{\@fancyplainfalse\ps@@fancy}% -% Initialize \headwidth if the user didn't -% -\ifdim\headwidth<0sp -% -% This catches the case that \headwidth hasn't been initialized and the -% case that the user added something to \headwidth in the expectation that -% it was initialized to \textwidth. We compensate this now. This loses if -% the user intended to multiply it by a factor. But that case is more -% likely done by saying something like \headwidth=1.2\textwidth. -% The doc says you have to change \headwidth after the first call to -% \pagestyle{fancy}. This code is just to catch the most common cases were -% that requirement is violated. -% - \global\advance\headwidth123456789sp\global\advance\headwidth\textwidth -\fi} -\def\ps@fancyplain{\ps@fancy \let\ps@plain\ps@plain@fancy} -\def\ps@plain@fancy{\@fancyplaintrue\ps@@fancy} -\let\ps@@empty\ps@empty -\def\ps@@fancy{% -\ps@@empty % This is for amsbook/amsart, which do strange things with \topskip -\def\@mkboth{\protect\markboth}% -\def\@oddhead{\@fancyhead\fancy@Oolh\f@ncyolh\f@ncyoch\f@ncyorh\fancy@Oorh}% -\def\@oddfoot{\@fancyfoot\fancy@Oolf\f@ncyolf\f@ncyocf\f@ncyorf\fancy@Oorf}% -\def\@evenhead{\@fancyhead\fancy@Oelh\f@ncyelh\f@ncyech\f@ncyerh\fancy@Oerh}% -\def\@evenfoot{\@fancyfoot\fancy@Oelf\f@ncyelf\f@ncyecf\f@ncyerf\fancy@Oerf}% -} -% Default definitions for compatibility mode: -% These cause the header/footer to take the defined \headwidth as width -% And to shift in the direction of the marginpar area - -\def\fancy@Oolh{\if@reversemargin\hss\else\relax\fi} -\def\fancy@Oorh{\if@reversemargin\relax\else\hss\fi} -\let\fancy@Oelh\fancy@Oorh -\let\fancy@Oerh\fancy@Oolh - -\let\fancy@Oolf\fancy@Oolh -\let\fancy@Oorf\fancy@Oorh -\let\fancy@Oelf\fancy@Oelh -\let\fancy@Oerf\fancy@Oerh - -% New definitions for the use of \fancyhfoffset -% These calculate the \headwidth from \textwidth and the specified offsets. - -\def\fancy@offsolh{\headwidth=\textwidth\advance\headwidth\f@ncyO@olh - \advance\headwidth\f@ncyO@orh\hskip-\f@ncyO@olh} -\def\fancy@offselh{\headwidth=\textwidth\advance\headwidth\f@ncyO@elh - \advance\headwidth\f@ncyO@erh\hskip-\f@ncyO@elh} - -\def\fancy@offsolf{\headwidth=\textwidth\advance\headwidth\f@ncyO@olf - \advance\headwidth\f@ncyO@orf\hskip-\f@ncyO@olf} -\def\fancy@offself{\headwidth=\textwidth\advance\headwidth\f@ncyO@elf - \advance\headwidth\f@ncyO@erf\hskip-\f@ncyO@elf} - -\def\fancy@setoffs{% -% Just in case \let\headwidth\textwidth was used - \fancy@gbl\let\headwidth\fancy@headwidth - \fancy@gbl\let\fancy@Oolh\fancy@offsolh - \fancy@gbl\let\fancy@Oelh\fancy@offselh - \fancy@gbl\let\fancy@Oorh\hss - \fancy@gbl\let\fancy@Oerh\hss - \fancy@gbl\let\fancy@Oolf\fancy@offsolf - \fancy@gbl\let\fancy@Oelf\fancy@offself - \fancy@gbl\let\fancy@Oorf\hss - \fancy@gbl\let\fancy@Oerf\hss} - -\newif\iffootnote -\let\latex@makecol\@makecol -\def\@makecol{\ifvoid\footins\footnotetrue\else\footnotefalse\fi -\let\topfloat\@toplist\let\botfloat\@botlist\latex@makecol} -\def\iftopfloat#1#2{\ifx\topfloat\empty #2\else #1\fi} -\def\ifbotfloat#1#2{\ifx\botfloat\empty #2\else #1\fi} -\def\iffloatpage#1#2{\if@fcolmade #1\else #2\fi} - -\newcommand{\fancypagestyle}[2]{% - \@namedef{ps@#1}{\let\fancy@gbl\relax#2\relax\ps@fancy}} diff --git a/papers/aleksandar_makelov/main.tex b/papers/aleksandar_makelov/main.tex index 73e687bfe2..9872dbc4b3 100644 --- a/papers/aleksandar_makelov/main.tex +++ b/papers/aleksandar_makelov/main.tex @@ -1,74 +1,13 @@ -\documentclass{article} % For LaTeX2e -\usepackage{arxiv_template} - -% Optional math commands from https://github.com/goodfeli/dlbook_notation. -% \input{math_commands.tex} - -\usepackage{microtype} -\usepackage{hyperref} -\usepackage{url} -\usepackage{graphicx} - -\usepackage{pgfplots} -\usepackage{pgfplotstable} -\pgfplotsset{compat=1.3} -\usepackage{tikz} -\usetikzlibrary{arrows.meta} -\usetikzlibrary{pgfplots.groupplots} -\usepackage{ragged2e} -\definecolor{mydarkblue}{rgb}{0,0.08,0.85} -\definecolor{mylightblue}{rgb}{0.06,0.56,1.0} -\definecolor{mylightorange}{rgb}{1.0,0.62,0.12} -\definecolor{mylightred}{rgb}{0.99,0.00,0.04} -\definecolor{mygreen}{HTML}{2F9E44} -\definecolor{myred}{HTML}{E03131} -\definecolor{myblue}{HTML}{1971C2} - -\usepackage{subcaption} -\usepackage{booktabs} -\usepackage{wrapfig} -\usepackage{changes} -\definecolor{myred}{HTML}{E03131} -\makeatletter -\@namedef{Changes@AuthorColor}{myred} -\colorlet{Changes@Color}{myred} -\makeatother -% \usepackage{floatrow} -\colmfinalcopy - -% \def\l{\left} -% \def\r{\right} - -\title{\texttt{mandala}: Compositional Memoization for Simple \& -Powerful Scientific Data Management} - -\newcommand{\fix}{\marginpar{FIX}} -\newcommand{\new}{\marginpar{NEW}} -% \newfloatcommand{capbtabbox}{table}[][\FBwidth] - -\usepackage{soul} -\usepackage{amsthm} -\usepackage{mathrsfs} -% \usepackage[outputdir=/home/amakelov/vscode_output/latex-aux]{minted} -\newtheorem{theorem}{Theorem}[section] -\newtheorem{lemma}[theorem]{Lemma} - - - -\begin{document} -\maketitle - - \begin{abstract} We present - \texttt{mandala}\footnote{\url{https://github.com/amakelov/mandala}}, a Python + \href{https://github.com/amakelov/mandala}{\texttt{mandala}}, a Python library that largely eliminates the accidental complexity of scientific data management and incremental computing. While most traditional and/or popular data management solutions are based on \emph{logging}, \texttt{mandala} takes a fundamentally different approach, using \emph{memoization} of function calls as the fundamental unit of saving, - loading, querying and deleting computational artifacts. - + loading, querying and deleting computational artifacts. + It does so by implementing a \emph{compositional} form of memoization, which keeps track of how memoized functions compose with one another. In this way: (1) complex computations are effectively memoized end-to-end, and become @@ -105,7 +44,7 @@ \section{Introduction} \citep{sandve2013ten,wilkinson2016fair}, but still require manual effort, attention to extraneous details, and discipline to follow. Researchers often operate under time pressure and/or the need to quickly iterate on code, which -makes these best `practices' a rather \emph{impractical} time investment. +makes these best `practices' a rather \emph{impractical} time investment. Thus, ideally we would like a system that (1) does not get in the way by imposing a complex new language/semantics/syntax, (2) provides powerful @@ -158,7 +97,7 @@ \section{Introduction} \emph{accidental complexity} (the data management tools necessary to implement the solution) \citep{Brooks1987NoSB}. The rest of this paper presents the design and main functionalities of -\texttt{mandala}, and is organized as follows: +\texttt{mandala}, and is organized as follows: \begin{itemize} \item In Section \ref{section:core-concepts}, we describe how memoization is designed, how this allows memoized calls to be composed and memoized results to @@ -206,12 +145,12 @@ \subsection{Memoization and the Computational Graph} \texttt{Ref}s and \texttt{Call}s are the two atomic data structures in \texttt{mandala}'s model of computations. When a call to an \texttt{@op}-decorated function \texttt{f} is executed inside a storage context, -this results in the creation of +this results in the creation of \begin{itemize} \item A \texttt{Ref} object for each input to the call. These wrap the `raw' values passed as inputs together with content IDs (hashes of the Python objects) and history IDs (hashes of the memoized calls that produced these values, if -any). +any). \begin{itemize} \item If an input to the call is already a \texttt{Ref} object, it is passed through as is; @@ -234,7 +173,7 @@ \subsection{Memoization and the Computational Graph} % \centering % \includegraphics[width=\linewidth]{img/comp-graph.pdf} % \caption{A part of the computaitonal graph built up by the calls in Figure -% \ref{fig:basic-usage}. +% \ref{fig:basic-usage}. % % The nodes are \texttt{Call} and \texttt{Ref} objects, % % and the edges are the inputs/output names connecting them. % } @@ -253,7 +192,7 @@ \subsection{Memoization and the Computational Graph} user composes memoized calls. \subsection{Motivation for the Design of Memoization} -\label{subsection:} +\label{subsection:design} \paragraph{Why content and history IDs?} The simultaneous use of content and history IDs has a few subtle advantages. @@ -318,7 +257,7 @@ \subsection{Retracing as a Versatile Imperative Interface to the Stored Computat which means stepping through memoized code with the purpose of resuming from a failure, loading intermediate values, or continuing from a particular point with new computations. A small example of retracing is shown in Figure -\ref{fig:basic-usage} (c). +\ref{fig:basic-usage} (c). This pattern is simple yet powerful, as it allows the user to interact with the stored computation graph in a way that is adapted to their use case, and to @@ -344,9 +283,9 @@ \section{Computation Frames} computations found.} \label{fig:figure1} \end{subfigure} - + \vspace{1em} - + \begin{subfigure}[b]{\textwidth} \centering \includegraphics[width=\textwidth]{img/fig5.pdf} @@ -424,16 +363,16 @@ \subsection{Formal Definition} An example is shown in Figure \ref{fig:cf} (c); \item \textbf{Groups of \texttt{Ref}s and \texttt{Call}s}: for each variable $v\in V$, a set of (history IDs of) \texttt{Ref}s $R_v$, and for each function -$f\in F$ with underlying \texttt{@op} $o_f$, a set of (history IDs of) \texttt{Call}s $C_f$; +$f\in F$ with underlying \texttt{@op} $o_f$, a set of (history IDs of) \texttt{Call}s $C_f$; \end{itemize} subject to the constraint that: for every call $c\in C_f$, if there's an input/output edge labeled $l$ connecting $f$ to some variable $v$, then if $c$ has a \texttt{Ref} $r_l$ corresponding to input/output name $l$, we have $r_l\in -R_v$. +R_v$. In other words, when we look at all calls in $f\in F$, their inputs/outputs must be present in the variables connected to $f$ under the respective input/output -name. +name. \subsection{Basic Usage} \label{subsection:cf-basic-usage} @@ -478,7 +417,7 @@ \subsection{Data Structures} \texttt{MList[int]} inheriting from \texttt{List[int]}, \ldots. By applying this type annotation, individual elements as well as the collection itself are memoized as \texttt{Ref}s (with the collection merely pointing to the -\texttt{Ref}s of its elements to avoid duplication). +\texttt{Ref}s of its elements to avoid duplication). \begin{wrapfigure}[18]{l}{0.45\textwidth} \centering @@ -537,7 +476,7 @@ \section{Related Work} \textbf{Memoization.} There are several memoization solutions for Python that lack the compositional nature of \texttt{mandala}, as well as the versioning and querying tools: the builtin \texttt{functools} module provides decorators such as \texttt{lru\_cache} for memoization; the \texttt{incpy} project \citep{guo2011using} enables automatic -persistent memoization of Python functions directly on the interpreter level; +persistent memoization of Python functions directly on the interpreter level; the \texttt{funsies} project \citep{lavigne2021funsies} is a memoization-based distributed workflow executor that uses a similar hashing approach to keep track of which computations have already been done; \texttt{koji} \citep{maymounkov2018koji} is a design for an incremental computation data processing framework that unifies over different resource types (files or services), and uses an analogous notion of hashing to keep track of computations. @@ -566,17 +505,17 @@ \section{Related Work} organized in a bare-bones \texttt{git} repository \citep{git}: it is a content-addressed tree, where each edge tracks a diff from the content at one endpoint to that at the other. Additional metadata indicates equivalence classes -of semantically equivalent contents. -% Semantic versioning \citep{semver} is another popular code versioning system. +of semantically equivalent contents. +% Semantic versioning \citep{semver} is another popular code versioning system. % \texttt{mandala} is similar to semver in % that it allows you to make backward-compatible changes to the interface and % logic of dependencies. It is different in that versions are still labeled by -% content, instead of `non-canonical' numbers. +% content, instead of `non-canonical' numbers. \section{Limitations} \label{sec:limitations} -\textbf{Computing deterministic content IDs of any Python object is difficult.} +\textbf{Computing deterministic content IDs of any Python object is difficult.} \texttt{mandala} uses the \texttt{joblib} library to serialize Python objects into byte strings, and then hashes these strings to get the content ID. This approach is not perfect, as it is not always possible to serialize Python @@ -586,7 +525,7 @@ \section{Limitations} sensitive to small changes in the input, such as numerical precision in floating point numbers. Finally, complex Python objects may contain state that is not intrinsically part of the object's identity, such as resource utilization data -(e.g., memory addresses). This can lead to different content IDs before and +(e.g., memory addresses). This can lead to different content IDs before and after a round trip through the storage backend. These issues don't come up often as long as all initial \texttt{Ref}s are created from simple Python objects: complex objects are hashed and saved once when returned from an \texttt{@op}, @@ -617,12 +556,12 @@ \section{Conclusion} \section*{Acknowledgements} First and foremost, I would like to thank my friend Stefan Krastanov for many -valuable conversations throughout the evolution and development of +valuable conversations throughout the evolution and development of \texttt{mandala}. Nobody could ask for a more enthusiastic collaborator and champion of their work. Second, I would also like to thank Nicholas Schiefer for some helpful feedback on an earlier version of the library, as well as suggestions and implementations for features to make it work in a distributed -setting, and for advertising \texttt{mandala} at his workplace. +setting, and for advertising \texttt{mandala} at his workplace. There have been far too many people over the years who have listened patiently to me talk about this project in its earlier stages; in particular, I'm grateful @@ -633,9 +572,3 @@ \section*{Acknowledgements} to me talk about this project, and who have given me valuable feedback and encouragement. Finally, I would like to thank my reviewers at the SciPy conference, especially Andrei Paleyes, for their helpful feedback on this paper. - -% bibliography -\bibliography{scipy} -\bibliographystyle{arxiv_template} - -\end{document} \ No newline at end of file diff --git a/papers/aleksandar_makelov/myst.yml b/papers/aleksandar_makelov/myst.yml index 08e5a6fe7e..d8133a0d80 100644 --- a/papers/aleksandar_makelov/myst.yml +++ b/papers/aleksandar_makelov/myst.yml @@ -1,9 +1,11 @@ version: 1 +extends: ../proceedings.yml project: + doi: 10.25080/JHPV7385 # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-aleksandar_makelov - title: "Mandala: Compositional Memoization for Simple & Powerful Scientific Data Management" - subtitle: LaTeX edition + title: 'Mandala: Compositional Memoization for Simple & Powerful Scientific Data Management' + description: We present mandala, a Python library that largely eliminates the accidental complexity of scientific data management and incremental computing. While most traditional and/or popular data management solutions are based on logging, mandala takes a fundamentally different approach, using memoization of function calls as the fundamental unit of saving, loading, querying and deleting computational artifacts. # Authors should have affiliations, emails and ORCIDs if available authors: - name: Aleksandar Makelov @@ -32,15 +34,5 @@ project: - maymounkov2018koji - semver - lozano2017unison - # A banner will be generated for you on publication, this is a placeholder - banner: banner.png - # The rest of the information shouldn't be modified - subject: Research Article - open_access: true - license: CC-BY-4.0 - venue: Scipy 2024 - date: 2024-07-10 - numbering: - headings: true site: template: article-theme diff --git a/papers/aleksandar_makelov/natbib.sty b/papers/aleksandar_makelov/natbib.sty deleted file mode 100644 index ff0d0b91b6..0000000000 --- a/papers/aleksandar_makelov/natbib.sty +++ /dev/null @@ -1,1246 +0,0 @@ -%% -%% This is file `natbib.sty', -%% generated with the docstrip utility. -%% -%% The original source files were: -%% -%% natbib.dtx (with options: `package,all') -%% ============================================= -%% IMPORTANT NOTICE: -%% -%% This program can be redistributed and/or modified under the terms -%% of the LaTeX Project Public License Distributed from CTAN -%% archives in directory macros/latex/base/lppl.txt; either -%% version 1 of the License, or any later version. -%% -%% This is a generated file. -%% It may not be distributed without the original source file natbib.dtx. -%% -%% Full documentation can be obtained by LaTeXing that original file. -%% Only a few abbreviated comments remain here to describe the usage. -%% ============================================= -%% Copyright 1993-2009 Patrick W Daly -%% Max-Planck-Institut f\"ur Sonnensystemforschung -%% Max-Planck-Str. 2 -%% D-37191 Katlenburg-Lindau -%% Germany -%% E-mail: daly@mps.mpg.de -\NeedsTeXFormat{LaTeX2e}[1995/06/01] -\ProvidesPackage{natbib} - [2009/07/16 8.31 (PWD, AO)] - - % This package reimplements the LaTeX \cite command to be used for various - % citation styles, both author-year and numerical. It accepts BibTeX - % output intended for many other packages, and therefore acts as a - % general, all-purpose citation-style interface. - % - % With standard numerical .bst files, only numerical citations are - % possible. With an author-year .bst file, both numerical and - % author-year citations are possible. - % - % If author-year citations are selected, \bibitem must have one of the - % following forms: - % \bibitem[Jones et al.(1990)]{key}... - % \bibitem[Jones et al.(1990)Jones, Baker, and Williams]{key}... - % \bibitem[Jones et al., 1990]{key}... - % \bibitem[\protect\citeauthoryear{Jones, Baker, and Williams}{Jones - % et al.}{1990}]{key}... - % \bibitem[\protect\citeauthoryear{Jones et al.}{1990}]{key}... - % \bibitem[\protect\astroncite{Jones et al.}{1990}]{key}... - % \bibitem[\protect\citename{Jones et al., }1990]{key}... - % \harvarditem[Jones et al.]{Jones, Baker, and Williams}{1990}{key}... - % - % This is either to be made up manually, or to be generated by an - % appropriate .bst file with BibTeX. - % Author-year mode || Numerical mode - % Then, \citet{key} ==>> Jones et al. (1990) || Jones et al. [21] - % \citep{key} ==>> (Jones et al., 1990) || [21] - % Multiple citations as normal: - % \citep{key1,key2} ==>> (Jones et al., 1990; Smith, 1989) || [21,24] - % or (Jones et al., 1990, 1991) || [21,24] - % or (Jones et al., 1990a,b) || [21,24] - % \cite{key} is the equivalent of \citet{key} in author-year mode - % and of \citep{key} in numerical mode - % Full author lists may be forced with \citet* or \citep*, e.g. - % \citep*{key} ==>> (Jones, Baker, and Williams, 1990) - % Optional notes as: - % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2) - % \citep[e.g.,][]{key} ==>> (e.g., Jones et al., 1990) - % \citep[see][pg. 34]{key}==>> (see Jones et al., 1990, pg. 34) - % (Note: in standard LaTeX, only one note is allowed, after the ref. - % Here, one note is like the standard, two make pre- and post-notes.) - % \citealt{key} ==>> Jones et al. 1990 - % \citealt*{key} ==>> Jones, Baker, and Williams 1990 - % \citealp{key} ==>> Jones et al., 1990 - % \citealp*{key} ==>> Jones, Baker, and Williams, 1990 - % Additional citation possibilities (both author-year and numerical modes) - % \citeauthor{key} ==>> Jones et al. - % \citeauthor*{key} ==>> Jones, Baker, and Williams - % \citeyear{key} ==>> 1990 - % \citeyearpar{key} ==>> (1990) - % \citetext{priv. comm.} ==>> (priv. comm.) - % \citenum{key} ==>> 11 [non-superscripted] - % Note: full author lists depends on whether the bib style supports them; - % if not, the abbreviated list is printed even when full requested. - % - % For names like della Robbia at the start of a sentence, use - % \Citet{dRob98} ==>> Della Robbia (1998) - % \Citep{dRob98} ==>> (Della Robbia, 1998) - % \Citeauthor{dRob98} ==>> Della Robbia - % - % - % Citation aliasing is achieved with - % \defcitealias{key}{text} - % \citetalias{key} ==>> text - % \citepalias{key} ==>> (text) - % - % Defining the citation mode and punctual (citation style) - % \setcitestyle{} - % Example: \setcitestyle{square,semicolon} - % Alternatively: - % Use \bibpunct with 6 mandatory arguments: - % 1. opening bracket for citation - % 2. closing bracket - % 3. citation separator (for multiple citations in one \cite) - % 4. the letter n for numerical styles, s for superscripts - % else anything for author-year - % 5. punctuation between authors and date - % 6. punctuation between years (or numbers) when common authors missing - % One optional argument is the character coming before post-notes. It - % appears in square braces before all other arguments. May be left off. - % Example (and default) \bibpunct[, ]{(}{)}{;}{a}{,}{,} - % - % To make this automatic for a given bib style, named newbib, say, make - % a local configuration file, natbib.cfg, with the definition - % \newcommand{\bibstyle@newbib}{\bibpunct...} - % Then the \bibliographystyle{newbib} will cause \bibstyle@newbib to - % be called on THE NEXT LATEX RUN (via the aux file). - % - % Such preprogrammed definitions may be invoked anywhere in the text - % by calling \citestyle{newbib}. This is only useful if the style specified - % differs from that in \bibliographystyle. - % - % With \citeindextrue and \citeindexfalse, one can control whether the - % \cite commands make an automatic entry of the citation in the .idx - % indexing file. For this, \makeindex must also be given in the preamble. - % - % Package Options: (for selecting punctuation) - % round - round parentheses are used (default) - % square - square brackets are used [option] - % curly - curly braces are used {option} - % angle - angle brackets are used