diff --git a/.editorconfig b/.editorconfig
new file mode 100644
index 00000000..5bf4860b
--- /dev/null
+++ b/.editorconfig
@@ -0,0 +1,26 @@
+root = true
+
+[*]
+charset = utf-8
+insert_final_newline = true
+trim_trailing_whitespace = true
+
+[*.md]
+indent_size = 2
+indent_style = space
+max_line_length = 100 # Please keep this in sync with bin/lesson_check.py!
+trim_trailing_whitespace = false # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (
)
+
+[*.r]
+max_line_length = 80
+
+[*.py]
+indent_size = 4
+indent_style = space
+max_line_length = 79
+
+[*.sh]
+end_of_line = lf
+
+[Makefile]
+indent_style = tab
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
new file mode 100755
index 00000000..101967e4
--- /dev/null
+++ b/.github/workflows/README.md
@@ -0,0 +1,198 @@
+# Carpentries Workflows
+
+This directory contains workflows to be used for Lessons using the {sandpaper}
+lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml`
+and `pr-recieve.yaml`) and the rest are bots to handle pull request management.
+
+These workflows will likely change as {sandpaper} evolves, so it is important to
+keep them up-to-date. To do this in your lesson you can do the following in your
+R console:
+
+```r
+# Install/Update sandpaper
+options(repos = c(carpentries = "https://carpentries.r-universe.dev/",
+ CRAN = "https://cloud.r-project.org"))
+install.packages("sandpaper")
+
+# update the workflows in your lesson
+library("sandpaper")
+update_github_workflows()
+```
+
+Inside this folder, you will find a file called `sandpaper-version.txt`, which
+will contain a version number for sandpaper. This will be used in the future to
+alert you if a workflow update is needed.
+
+What follows are the descriptions of the workflow files:
+
+## Deployment
+
+### 01 Build and Deploy (sandpaper-main.yaml)
+
+This is the main driver that will only act on the main branch of the repository.
+This workflow does the following:
+
+ 1. checks out the lesson
+ 2. provisions the following resources
+ - R
+ - pandoc
+ - lesson infrastructure (stored in a cache)
+ - lesson dependencies if needed (stored in a cache)
+ 3. builds the lesson via `sandpaper:::ci_deploy()`
+
+#### Caching
+
+This workflow has two caches; one cache is for the lesson infrastructure and
+the other is for the the lesson dependencies if the lesson contains rendered
+content. These caches are invalidated by new versions of the infrastructure and
+the `renv.lock` file, respectively. If there is a problem with the cache,
+manual invaliation is necessary. You will need maintain access to the repository
+and you can either go to the actions tab and [click on the caches button to find
+and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/)
+or by setting the `CACHE_VERSION` secret to the current date (which will
+invalidate all of the caches).
+
+## Updates
+
+### Setup Information
+
+These workflows run on a schedule and at the maintainer's request. Because they
+create pull requests that update workflows/require the downstream actions to run,
+they need a special repository/organization secret token called
+`SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope.
+
+This can be an individual user token, OR it can be a trusted bot account. If you
+have a repository in one of the official Carpentries accounts, then you do not
+need to worry about this token being present because the Carpentries Core Team
+will take care of supplying this token.
+
+If you want to use your personal account: you can go to
+
+to create a token. Once you have created your token, you should copy it to your
+clipboard and then go to your repository's settings > secrets > actions and
+create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token.
+
+If you do not specify your token correctly, the runs will not fail and they will
+give you instructions to provide the token for your repository.
+
+### 02 Maintain: Update Workflow Files (update-workflow.yaml)
+
+The {sandpaper} repository was designed to do as much as possible to separate
+the tools from the content. For local builds, this is absolutely true, but
+there is a minor issue when it comes to workflow files: they must live inside
+the repository.
+
+This workflow ensures that the workflow files are up-to-date. The way it work is
+to download the update-workflows.sh script from GitHub and run it. The script
+will do the following:
+
+1. check the recorded version of sandpaper against the current version on github
+2. update the files if there is a difference in versions
+
+After the files are updated, if there are any changes, they are pushed to a
+branch called `update/workflows` and a pull request is created. Maintainers are
+encouraged to review the changes and accept the pull request if the outputs
+are okay.
+
+This update is run ~~weekly or~~ on demand.
+
+### 03 Maintain: Update Pacakge Cache (update-cache.yaml)
+
+For lessons that have generated content, we use {renv} to ensure that the output
+is stable. This is controlled by a single lockfile which documents the packages
+needed for the lesson and the version numbers. This workflow is skipped in
+lessons that do not have generated content.
+
+Because the lessons need to remain current with the package ecosystem, it's a
+good idea to make sure these packages can be updated periodically. The
+update cache workflow will do this by checking for updates, applying them in a
+branch called `updates/packages` and creating a pull request with _only the
+lockfile changed_.
+
+From here, the markdown documents will be rebuilt and you can inspect what has
+changed based on how the packages have updated.
+
+## Pull Request and Review Management
+
+Because our lessons execute code, pull requests are a secruity risk for any
+lesson and thus have security measures associted with them. **Do not merge any
+pull requests that do not pass checks and do not have bots commented on them.**
+
+This series of workflows all go together and are described in the following
+diagram and the below sections:
+
+
+
+### Pre Flight Pull Request Validation (pr-preflight.yaml)
+
+This workflow runs every time a pull request is created and its purpose is to
+validate that the pull request is okay to run. This means the following things:
+
+1. The pull request does not contain modified workflow files
+2. If the pull request contains modified workflow files, it does not contain
+ modified content files (such as a situation where @carpentries-bot will
+ make an automated pull request)
+3. The pull request does not contain an invalid commit hash (e.g. from a fork
+ that was made before a lesson was transitioned from styles to use the
+ workbench).
+
+Once the checks are finished, a comment is issued to the pull request, which
+will allow maintainers to determine if it is safe to run the
+"Receive Pull Request" workflow from new contributors.
+
+### Recieve Pull Request (pr-recieve.yaml)
+
+**Note of caution:** This workflow runs arbitrary code by anyone who creates a
+pull request. GitHub has safeguarded the token used in this workflow to have no
+priviledges in the repository, but we have taken precautions to protect against
+spoofing.
+
+This workflow is triggered with every push to a pull request. If this workflow
+is already running and a new push is sent to the pull request, the workflow
+running from the previous push will be cancelled and a new workflow run will be
+started.
+
+The first step of this workflow is to check if it is valid (e.g. that no
+workflow files have been modified). If there are workflow files that have been
+modified, a comment is made that indicates that the workflow is not run. If
+both a workflow file and lesson content is modified, an error will occurr.
+
+The second step (if valid) is to build the generated content from the pull
+request. This builds the content and uploads three artifacts:
+
+1. The pull request number (pr)
+2. A summary of changes after the rendering process (diff)
+3. The rendered files (build)
+
+Because this workflow builds generated content, it follows the same general
+process as the `sandpaper-main` workflow with the same caching mechanisms.
+
+The artifacts produced are used by the next workflow.
+
+### Comment on Pull Request (pr-comment.yaml)
+
+This workflow is triggered if the `pr-recieve.yaml` workflow is successful.
+The steps in this workflow are:
+
+1. Test if the workflow is valid and comment the validity of the workflow to the
+ pull request.
+2. If it is valid: create an orphan branch with two commits: the current state
+ of the repository and the proposed changes.
+3. If it is valid: update the pull request comment with the summary of changes
+
+Importantly: if the pull request is invalid, the branch is not created so any
+malicious code is not published.
+
+From here, the maintainer can request changes from the author and eventually
+either merge or reject the PR. When this happens, if the PR was valid, the
+preview branch needs to be deleted.
+
+### Send Close PR Signal (pr-close-signal.yaml)
+
+Triggered any time a pull request is closed. This emits an artifact that is the
+pull request number for the next action
+
+### Remove Pull Request Branch (pr-post-remove-branch.yaml)
+
+Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with
+the pull request (if it was created).
diff --git a/.github/workflows/pr-close-signal.yaml b/.github/workflows/pr-close-signal.yaml
new file mode 100755
index 00000000..9b129d5d
--- /dev/null
+++ b/.github/workflows/pr-close-signal.yaml
@@ -0,0 +1,23 @@
+name: "Bot: Send Close Pull Request Signal"
+
+on:
+ pull_request:
+ types:
+ [closed]
+
+jobs:
+ send-close-signal:
+ name: "Send closing signal"
+ runs-on: ubuntu-latest
+ if: ${{ github.event.action == 'closed' }}
+ steps:
+ - name: "Create PRtifact"
+ run: |
+ mkdir -p ./pr
+ printf ${{ github.event.number }} > ./pr/NUM
+ - name: Upload Diff
+ uses: actions/upload-artifact@v3
+ with:
+ name: pr
+ path: ./pr
+
diff --git a/.github/workflows/pr-comment.yaml b/.github/workflows/pr-comment.yaml
new file mode 100755
index 00000000..bb2eb03c
--- /dev/null
+++ b/.github/workflows/pr-comment.yaml
@@ -0,0 +1,185 @@
+name: "Bot: Comment on the Pull Request"
+
+# read-write repo token
+# access to secrets
+on:
+ workflow_run:
+ workflows: ["Receive Pull Request"]
+ types:
+ - completed
+
+concurrency:
+ group: pr-${{ github.event.workflow_run.pull_requests[0].number }}
+ cancel-in-progress: true
+
+
+jobs:
+ # Pull requests are valid if:
+ # - they match the sha of the workflow run head commit
+ # - they are open
+ # - no .github files were committed
+ test-pr:
+ name: "Test if pull request is valid"
+ runs-on: ubuntu-latest
+ if: >
+ github.event.workflow_run.event == 'pull_request' &&
+ github.event.workflow_run.conclusion == 'success'
+ outputs:
+ is_valid: ${{ steps.check-pr.outputs.VALID }}
+ payload: ${{ steps.check-pr.outputs.payload }}
+ number: ${{ steps.get-pr.outputs.NUM }}
+ msg: ${{ steps.check-pr.outputs.MSG }}
+ steps:
+ - name: 'Download PR artifact'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'pr'
+
+ - name: "Get PR Number"
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ id: get-pr
+ run: |
+ unzip pr.zip
+ echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT
+
+ - name: "Fail if PR number was not present"
+ id: bad-pr
+ if: ${{ steps.dl.outputs.success != 'true' }}
+ run: |
+ echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.'
+ exit 1
+ - name: "Get Invalid Hashes File"
+ id: hash
+ run: |
+ echo "json<> $GITHUB_OUTPUT
+ - name: "Check PR"
+ id: check-pr
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ uses: carpentries/actions/check-valid-pr@main
+ with:
+ pr: ${{ steps.get-pr.outputs.NUM }}
+ sha: ${{ github.event.workflow_run.head_sha }}
+ headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire
+ invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
+ fail_on_error: true
+
+ # Create an orphan branch on this repository with two commits
+ # - the current HEAD of the md-outputs branch
+ # - the output from running the current HEAD of the pull request through
+ # the md generator
+ create-branch:
+ name: "Create Git Branch"
+ needs: test-pr
+ runs-on: ubuntu-latest
+ if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
+ env:
+ NR: ${{ needs.test-pr.outputs.number }}
+ permissions:
+ contents: write
+ steps:
+ - name: 'Checkout md outputs'
+ uses: actions/checkout@v3
+ with:
+ ref: md-outputs
+ path: built
+ fetch-depth: 1
+
+ - name: 'Download built markdown'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'built'
+
+ - if: ${{ steps.dl.outputs.success == 'true' }}
+ run: unzip built.zip
+
+ - name: "Create orphan and push"
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ run: |
+ cd built/
+ git config --local user.email "actions@github.com"
+ git config --local user.name "GitHub Actions"
+ CURR_HEAD=$(git rev-parse HEAD)
+ git checkout --orphan md-outputs-PR-${NR}
+ git add -A
+ git commit -m "source commit: ${CURR_HEAD}"
+ ls -A | grep -v '^.git$' | xargs -I _ rm -r '_'
+ cd ..
+ unzip -o -d built built.zip
+ cd built
+ git add -A
+ git commit --allow-empty -m "differences for PR #${NR}"
+ git push -u --force --set-upstream origin md-outputs-PR-${NR}
+
+ # Comment on the Pull Request with a link to the branch and the diff
+ comment-pr:
+ name: "Comment on Pull Request"
+ needs: [test-pr, create-branch]
+ runs-on: ubuntu-latest
+ if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
+ env:
+ NR: ${{ needs.test-pr.outputs.number }}
+ permissions:
+ pull-requests: write
+ steps:
+ - name: 'Download comment artifact'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'diff'
+
+ - if: ${{ steps.dl.outputs.success == 'true' }}
+ run: unzip ${{ github.workspace }}/diff.zip
+
+ - name: "Comment on PR"
+ id: comment-diff
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ uses: carpentries/actions/comment-diff@main
+ with:
+ pr: ${{ env.NR }}
+ path: ${{ github.workspace }}/diff.md
+
+ # Comment if the PR is open and matches the SHA, but the workflow files have
+ # changed
+ comment-changed-workflow:
+ name: "Comment if workflow files have changed"
+ needs: test-pr
+ runs-on: ubuntu-latest
+ if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }}
+ env:
+ NR: ${{ github.event.workflow_run.pull_requests[0].number }}
+ body: ${{ needs.test-pr.outputs.msg }}
+ permissions:
+ pull-requests: write
+ steps:
+ - name: 'Check for spoofing'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'built'
+
+ - name: 'Alert if spoofed'
+ id: spoof
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ run: |
+ echo 'body<> $GITHUB_ENV
+ echo '' >> $GITHUB_ENV
+ echo '## :x: DANGER :x:' >> $GITHUB_ENV
+ echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV
+ echo '' >> $GITHUB_ENV
+ echo 'EOF' >> $GITHUB_ENV
+
+ - name: "Comment on PR"
+ id: comment-diff
+ uses: carpentries/actions/comment-diff@main
+ with:
+ pr: ${{ env.NR }}
+ body: ${{ env.body }}
+
diff --git a/.github/workflows/pr-post-remove-branch.yaml b/.github/workflows/pr-post-remove-branch.yaml
new file mode 100755
index 00000000..62c2e98d
--- /dev/null
+++ b/.github/workflows/pr-post-remove-branch.yaml
@@ -0,0 +1,32 @@
+name: "Bot: Remove Temporary PR Branch"
+
+on:
+ workflow_run:
+ workflows: ["Bot: Send Close Pull Request Signal"]
+ types:
+ - completed
+
+jobs:
+ delete:
+ name: "Delete branch from Pull Request"
+ runs-on: ubuntu-latest
+ if: >
+ github.event.workflow_run.event == 'pull_request' &&
+ github.event.workflow_run.conclusion == 'success'
+ permissions:
+ contents: write
+ steps:
+ - name: 'Download artifact'
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: pr
+ - name: "Get PR Number"
+ id: get-pr
+ run: |
+ unzip pr.zip
+ echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT
+ - name: 'Remove branch'
+ uses: carpentries/actions/remove-branch@main
+ with:
+ pr: ${{ steps.get-pr.outputs.NUM }}
diff --git a/.github/workflows/pr-preflight.yaml b/.github/workflows/pr-preflight.yaml
new file mode 100755
index 00000000..d0d7420d
--- /dev/null
+++ b/.github/workflows/pr-preflight.yaml
@@ -0,0 +1,39 @@
+name: "Pull Request Preflight Check"
+
+on:
+ pull_request_target:
+ branches:
+ ["main"]
+ types:
+ ["opened", "synchronize", "reopened"]
+
+jobs:
+ test-pr:
+ name: "Test if pull request is valid"
+ if: ${{ github.event.action != 'closed' }}
+ runs-on: ubuntu-latest
+ outputs:
+ is_valid: ${{ steps.check-pr.outputs.VALID }}
+ permissions:
+ pull-requests: write
+ steps:
+ - name: "Get Invalid Hashes File"
+ id: hash
+ run: |
+ echo "json<> $GITHUB_OUTPUT
+ - name: "Check PR"
+ id: check-pr
+ uses: carpentries/actions/check-valid-pr@main
+ with:
+ pr: ${{ github.event.number }}
+ invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
+ fail_on_error: true
+ - name: "Comment result of validation"
+ id: comment-diff
+ if: ${{ always() }}
+ uses: carpentries/actions/comment-diff@main
+ with:
+ pr: ${{ github.event.number }}
+ body: ${{ steps.check-pr.outputs.MSG }}
diff --git a/.github/workflows/pr-receive.yaml b/.github/workflows/pr-receive.yaml
new file mode 100755
index 00000000..371ef542
--- /dev/null
+++ b/.github/workflows/pr-receive.yaml
@@ -0,0 +1,131 @@
+name: "Receive Pull Request"
+
+on:
+ pull_request:
+ types:
+ [opened, synchronize, reopened]
+
+concurrency:
+ group: ${{ github.ref }}
+ cancel-in-progress: true
+
+jobs:
+ test-pr:
+ name: "Record PR number"
+ if: ${{ github.event.action != 'closed' }}
+ runs-on: ubuntu-latest
+ outputs:
+ is_valid: ${{ steps.check-pr.outputs.VALID }}
+ steps:
+ - name: "Record PR number"
+ id: record
+ if: ${{ always() }}
+ run: |
+ echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR
+ - name: "Upload PR number"
+ id: upload
+ if: ${{ always() }}
+ uses: actions/upload-artifact@v3
+ with:
+ name: pr
+ path: ${{ github.workspace }}/NR
+ - name: "Get Invalid Hashes File"
+ id: hash
+ run: |
+ echo "json<> $GITHUB_OUTPUT
+ - name: "echo output"
+ run: |
+ echo "${{ steps.hash.outputs.json }}"
+ - name: "Check PR"
+ id: check-pr
+ uses: carpentries/actions/check-valid-pr@main
+ with:
+ pr: ${{ github.event.number }}
+ invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
+
+ build-md-source:
+ name: "Build markdown source files if valid"
+ needs: test-pr
+ runs-on: ubuntu-latest
+ if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
+ env:
+ GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+ RENV_PATHS_ROOT: ~/.local/share/renv/
+ CHIVE: ${{ github.workspace }}/site/chive
+ PR: ${{ github.workspace }}/site/pr
+ MD: ${{ github.workspace }}/site/built
+ steps:
+ - name: "Check Out Main Branch"
+ uses: actions/checkout@v3
+
+ - name: "Check Out Staging Branch"
+ uses: actions/checkout@v3
+ with:
+ ref: md-outputs
+ path: ${{ env.MD }}
+
+ - name: "Set up R"
+ uses: r-lib/actions/setup-r@v2
+ with:
+ use-public-rspm: true
+ install-r: false
+
+ - name: "Set up Pandoc"
+ uses: r-lib/actions/setup-pandoc@v2
+
+ - name: "Setup Lesson Engine"
+ uses: carpentries/actions/setup-sandpaper@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Setup Package Cache"
+ uses: carpentries/actions/setup-lesson-deps@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Validate and Build Markdown"
+ id: build-site
+ run: |
+ sandpaper::package_cache_trigger(TRUE)
+ sandpaper::validate_lesson(path = '${{ github.workspace }}')
+ sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE)
+ shell: Rscript {0}
+
+ - name: "Generate Artifacts"
+ id: generate-artifacts
+ run: |
+ sandpaper:::ci_bundle_pr_artifacts(
+ repo = '${{ github.repository }}',
+ pr_number = '${{ github.event.number }}',
+ path_md = '${{ env.MD }}',
+ path_pr = '${{ env.PR }}',
+ path_archive = '${{ env.CHIVE }}',
+ branch = 'md-outputs'
+ )
+ shell: Rscript {0}
+
+ - name: "Upload PR"
+ uses: actions/upload-artifact@v3
+ with:
+ name: pr
+ path: ${{ env.PR }}
+
+ - name: "Upload Diff"
+ uses: actions/upload-artifact@v3
+ with:
+ name: diff
+ path: ${{ env.CHIVE }}
+ retention-days: 1
+
+ - name: "Upload Build"
+ uses: actions/upload-artifact@v3
+ with:
+ name: built
+ path: ${{ env.MD }}
+ retention-days: 1
+
+ - name: "Teardown"
+ run: sandpaper::reset_site()
+ shell: Rscript {0}
diff --git a/.github/workflows/sandpaper-main.yaml b/.github/workflows/sandpaper-main.yaml
new file mode 100755
index 00000000..e17707ac
--- /dev/null
+++ b/.github/workflows/sandpaper-main.yaml
@@ -0,0 +1,61 @@
+name: "01 Build and Deploy Site"
+
+on:
+ push:
+ branches:
+ - main
+ - master
+ schedule:
+ - cron: '0 0 * * 2'
+ workflow_dispatch:
+ inputs:
+ name:
+ description: 'Who triggered this build?'
+ required: true
+ default: 'Maintainer (via GitHub)'
+ reset:
+ description: 'Reset cached markdown files'
+ required: false
+ default: false
+ type: boolean
+jobs:
+ full-build:
+ name: "Build Full Site"
+ runs-on: ubuntu-latest
+ permissions:
+ checks: write
+ contents: write
+ pages: write
+ env:
+ GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+ RENV_PATHS_ROOT: ~/.local/share/renv/
+ steps:
+
+ - name: "Checkout Lesson"
+ uses: actions/checkout@v3
+
+ - name: "Set up R"
+ uses: r-lib/actions/setup-r@v2
+ with:
+ use-public-rspm: true
+ install-r: false
+
+ - name: "Set up Pandoc"
+ uses: r-lib/actions/setup-pandoc@v2
+
+ - name: "Setup Lesson Engine"
+ uses: carpentries/actions/setup-sandpaper@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Setup Package Cache"
+ uses: carpentries/actions/setup-lesson-deps@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Deploy Site"
+ run: |
+ reset <- "${{ github.event.inputs.reset }}" == "true"
+ sandpaper::package_cache_trigger(TRUE)
+ sandpaper:::ci_deploy(reset = reset)
+ shell: Rscript {0}
diff --git a/.github/workflows/sandpaper-version.txt b/.github/workflows/sandpaper-version.txt
new file mode 100644
index 00000000..4aa09069
--- /dev/null
+++ b/.github/workflows/sandpaper-version.txt
@@ -0,0 +1 @@
+0.11.15
diff --git a/.github/workflows/update-cache.yaml b/.github/workflows/update-cache.yaml
new file mode 100755
index 00000000..676d7424
--- /dev/null
+++ b/.github/workflows/update-cache.yaml
@@ -0,0 +1,125 @@
+name: "03 Maintain: Update Package Cache"
+
+on:
+ workflow_dispatch:
+ inputs:
+ name:
+ description: 'Who triggered this build (enter github username to tag yourself)?'
+ required: true
+ default: 'monthly run'
+ schedule:
+ # Run every tuesday
+ - cron: '0 0 * * 2'
+
+jobs:
+ preflight:
+ name: "Preflight Check"
+ runs-on: ubuntu-latest
+ outputs:
+ ok: ${{ steps.check.outputs.ok }}
+ steps:
+ - id: check
+ run: |
+ if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then
+ echo "ok=true" >> $GITHUB_OUTPUT
+ echo "Running on request"
+ # using single brackets here to avoid 08 being interpreted as octal
+ # https://github.com/carpentries/sandpaper/issues/250
+ elif [ `date +%d` -le 7 ]; then
+ # If the Tuesday lands in the first week of the month, run it
+ echo "ok=true" >> $GITHUB_OUTPUT
+ echo "Running on schedule"
+ else
+ echo "ok=false" >> $GITHUB_OUTPUT
+ echo "Not Running Today"
+ fi
+
+ check_renv:
+ name: "Check if We Need {renv}"
+ runs-on: ubuntu-latest
+ needs: preflight
+ if: ${{ needs.preflight.outputs.ok == 'true'}}
+ outputs:
+ needed: ${{ steps.renv.outputs.exists }}
+ steps:
+ - name: "Checkout Lesson"
+ uses: actions/checkout@v3
+ - id: renv
+ run: |
+ if [[ -d renv ]]; then
+ echo "exists=true" >> $GITHUB_OUTPUT
+ fi
+
+ check_token:
+ name: "Check SANDPAPER_WORKFLOW token"
+ runs-on: ubuntu-latest
+ needs: check_renv
+ if: ${{ needs.check_renv.outputs.needed == 'true' }}
+ outputs:
+ workflow: ${{ steps.validate.outputs.wf }}
+ repo: ${{ steps.validate.outputs.repo }}
+ steps:
+ - name: "validate token"
+ id: validate
+ uses: carpentries/actions/check-valid-credentials@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+
+ update_cache:
+ name: "Update Package Cache"
+ needs: check_token
+ if: ${{ needs.check_token.outputs.repo== 'true' }}
+ runs-on: ubuntu-latest
+ env:
+ GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+ RENV_PATHS_ROOT: ~/.local/share/renv/
+ steps:
+
+ - name: "Checkout Lesson"
+ uses: actions/checkout@v3
+
+ - name: "Set up R"
+ uses: r-lib/actions/setup-r@v2
+ with:
+ use-public-rspm: true
+ install-r: false
+
+ - name: "Update {renv} deps and determine if a PR is needed"
+ id: update
+ uses: carpentries/actions/update-lockfile@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: Create Pull Request
+ id: cpr
+ if: ${{ steps.update.outputs.n > 0 }}
+ uses: carpentries/create-pull-request@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+ delete-branch: true
+ branch: "update/packages"
+ commit-message: "[actions] update ${{ steps.update.outputs.n }} packages"
+ title: "Update ${{ steps.update.outputs.n }} packages"
+ body: |
+ :robot: This is an automated build
+
+ This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions:
+
+ ```
+ ${{ steps.update.outputs.report }}
+ ```
+
+ :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates.
+
+ If you want to inspect these changes locally, you can use the following code to check out a new branch:
+
+ ```bash
+ git fetch origin update/packages
+ git checkout update/packages
+ ```
+
+ - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
+
+ [1]: https://github.com/carpentries/create-pull-request/tree/main
+ labels: "type: package cache"
+ draft: false
diff --git a/.github/workflows/update-workflows.yaml b/.github/workflows/update-workflows.yaml
new file mode 100755
index 00000000..288bcd13
--- /dev/null
+++ b/.github/workflows/update-workflows.yaml
@@ -0,0 +1,66 @@
+name: "02 Maintain: Update Workflow Files"
+
+on:
+ workflow_dispatch:
+ inputs:
+ name:
+ description: 'Who triggered this build (enter github username to tag yourself)?'
+ required: true
+ default: 'weekly run'
+ clean:
+ description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)'
+ required: false
+ default: '.yaml'
+ schedule:
+ # Run every Tuesday
+ - cron: '0 0 * * 2'
+
+jobs:
+ check_token:
+ name: "Check SANDPAPER_WORKFLOW token"
+ runs-on: ubuntu-latest
+ outputs:
+ workflow: ${{ steps.validate.outputs.wf }}
+ repo: ${{ steps.validate.outputs.repo }}
+ steps:
+ - name: "validate token"
+ id: validate
+ uses: carpentries/actions/check-valid-credentials@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+
+ update_workflow:
+ name: "Update Workflow"
+ runs-on: ubuntu-latest
+ needs: check_token
+ if: ${{ needs.check_token.outputs.workflow == 'true' }}
+ steps:
+ - name: "Checkout Repository"
+ uses: actions/checkout@v3
+
+ - name: Update Workflows
+ id: update
+ uses: carpentries/actions/update-workflows@main
+ with:
+ clean: ${{ github.event.inputs.clean }}
+
+ - name: Create Pull Request
+ id: cpr
+ if: "${{ steps.update.outputs.new }}"
+ uses: carpentries/create-pull-request@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+ delete-branch: true
+ branch: "update/workflows"
+ commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}"
+ title: "Update Workflows to Version ${{ steps.update.outputs.new }}"
+ body: |
+ :robot: This is an automated build
+
+ Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }}
+
+ - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
+
+ [1]: https://github.com/carpentries/create-pull-request/tree/main
+ labels: "type: template and tools"
+ draft: false
diff --git a/.github/workflows/workbench-beta-phase.yml b/.github/workflows/workbench-beta-phase.yml
new file mode 100644
index 00000000..2faa25d9
--- /dev/null
+++ b/.github/workflows/workbench-beta-phase.yml
@@ -0,0 +1,60 @@
+name: "Deploy to AWS"
+
+on:
+ workflow_run:
+ workflows: ["01 Build and Deploy Site"]
+ types:
+ - completed
+ workflow_dispatch:
+
+jobs:
+ preflight:
+ name: "Preflight Check"
+ runs-on: ubuntu-latest
+ outputs:
+ ok: ${{ steps.check.outputs.ok }}
+ folder: ${{ steps.check.outputs.folder }}
+ steps:
+ - id: check
+ run: |
+ if [[ -z "${{ secrets.DISTRIBUTION }}" || -z "${{ secrets.AWS_ACCESS_KEY_ID }}" || -z "${{ secrets.AWS_SECRET_ACCESS_KEY }}" ]]; then
+ echo ":information_source: No site configured" >> $GITHUB_STEP_SUMMARY
+ echo "" >> $GITHUB_STEP_SUMMARY
+ echo 'To deploy the preview on AWS, you need the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` and `DISTRIBUTION` secrets set up' >> $GITHUB_STEP_SUMMARY
+ else
+ echo "::set-output name=folder::"$(sed -E 's^.+/(.+)^\1^' <<< ${{ github.repository }})
+ echo "::set-output name=ok::true"
+ fi
+
+ full-build:
+ name: "Deploy to AWS"
+ needs: [preflight]
+ if: ${{ needs.preflight.outputs.ok }}
+ runs-on: ubuntu-latest
+ steps:
+
+ - name: "Checkout site folder"
+ uses: actions/checkout@v3
+ with:
+ ref: 'gh-pages'
+ path: 'source'
+
+ - name: "Deploy to Bucket"
+ uses: jakejarvis/s3-sync-action@v0.5.1
+ with:
+ args: --acl public-read --follow-symlinks --delete --exclude '.git/*'
+ env:
+ AWS_S3_BUCKET: preview.carpentries.org
+ AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+ AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+ SOURCE_DIR: 'source'
+ DEST_DIR: ${{ needs.preflight.outputs.folder }}
+
+ - name: "Invalidate CloudFront"
+ uses: chetan/invalidate-cloudfront-action@master
+ env:
+ PATHS: /*
+ AWS_REGION: 'us-east-1'
+ DISTRIBUTION: ${{ secrets.DISTRIBUTION }}
+ AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+ AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..b8ab7062
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,55 @@
+# sandpaper files
+episodes/*html
+site/*
+!site/README.md
+
+# History files
+.Rhistory
+.Rapp.history
+# Session Data files
+.RData
+# User-specific files
+.Ruserdata
+# Example code in package build process
+*-Ex.R
+# Output files from R CMD build
+/*.tar.gz
+# Output files from R CMD check
+/*.Rcheck/
+# RStudio files
+.Rproj.user/
+# produced vignettes
+vignettes/*.html
+vignettes/*.pdf
+# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
+.httr-oauth
+# knitr and R markdown default cache directories
+*_cache/
+/cache/
+# Temporary files created by R markdown
+*.utf8.md
+*.knit.md
+# R Environment Variables
+.Renviron
+# pkgdown site
+docs/
+# translation temp files
+po/*~
+# renv detritus
+renv/sandbox/
+*.pyc
+*~
+.DS_Store
+.ipynb_checkpoints
+.sass-cache
+.jekyll-cache/
+.jekyll-metadata
+__pycache__
+_site
+.Rproj.user
+.bundle/
+.vendor/
+vendor/
+.docker-vendor/
+Gemfile.lock
+.*history
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
new file mode 100644
index 00000000..f19b8049
--- /dev/null
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,13 @@
+---
+title: "Contributor Code of Conduct"
+---
+
+As contributors and maintainers of this project,
+we pledge to follow the [The Carpentries Code of Conduct][coc].
+
+Instances of abusive, harassing, or otherwise unacceptable behavior
+may be reported by following our [reporting guidelines][coc-reporting].
+
+
+[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
+[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 00000000..ec44704c
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,121 @@
+## Contributing
+
+[The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data
+Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source
+projects, and we welcome contributions of all kinds: new lessons, fixes to
+existing material, bug reports, and reviews of proposed changes are all
+welcome.
+
+### Contributor Agreement
+
+By contributing, you agree that we may redistribute your work under [our
+license](LICENSE.md). In exchange, we will address your issues and/or assess
+your change proposal as promptly as we can, and help you become a member of our
+community. Everyone involved in [The Carpentries][cp-site] agrees to abide by
+our [code of conduct](CODE_OF_CONDUCT.md).
+
+### How to Contribute
+
+The easiest way to get started is to file an issue to tell us about a spelling
+mistake, some awkward wording, or a factual error. This is a good way to
+introduce yourself and to meet some of our community members.
+
+1. If you do not have a [GitHub][github] account, you can [send us comments by
+ email][contact]. However, we will be able to respond more quickly if you use
+ one of the other methods described below.
+
+2. If you have a [GitHub][github] account, or are willing to [create
+ one][github-join], but do not know how to use Git, you can report problems
+ or suggest improvements by [creating an issue][issues]. This allows us to
+ assign the item to someone and to respond to it in a threaded discussion.
+
+3. If you are comfortable with Git, and would like to add or change material,
+ you can submit a pull request (PR). Instructions for doing this are
+ [included below](#using-github).
+
+Note: if you want to build the website locally, please refer to [The Workbench
+documentation][template-doc].
+
+### Where to Contribute
+
+1. If you wish to change this lesson, add issues and pull requests here.
+2. If you wish to change the template used for workshop websites, please refer
+ to [The Workbench documentation][template-doc].
+
+
+### What to Contribute
+
+There are many ways to contribute, from writing new exercises and improving
+existing ones to updating or filling in the documentation and submitting [bug
+reports][issues] about things that do not work, are not clear, or are missing.
+If you are looking for ideas, please see [the list of issues for this
+repository][repo], or the issues for [Data Carpentry][dc-issues], [Library
+Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects.
+
+Comments on issues and reviews of pull requests are just as welcome: we are
+smarter together than we are on our own. **Reviews from novices and newcomers
+are particularly valuable**: it's easy for people who have been using these
+lessons for a while to forget how impenetrable some of this material can be, so
+fresh eyes are always welcome.
+
+### What *Not* to Contribute
+
+Our lessons already contain more material than we can cover in a typical
+workshop, so we are usually *not* looking for more concepts or tools to add to
+them. As a rule, if you want to introduce a new idea, you must (a) estimate how
+long it will take to teach and (b) explain what you would take out to make room
+for it. The first encourages contributors to be honest about requirements; the
+second, to think hard about priorities.
+
+We are also not looking for exercises or other material that only run on one
+platform. Our workshops typically contain a mixture of Windows, macOS, and
+Linux users; in order to be usable, our lessons must run equally well on all
+three.
+
+### Using GitHub
+
+If you choose to contribute via GitHub, you may want to look at [How to
+Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we
+use [GitHub flow][github-flow] to manage changes:
+
+1. Create a new branch in your desktop copy of this repository for each
+ significant change.
+2. Commit the change in that branch.
+3. Push that branch to your fork of this repository on GitHub.
+4. Submit a pull request from that branch to the [upstream repository][repo].
+5. If you receive feedback, make changes on your desktop and push to your
+ branch on GitHub: the pull request will update automatically.
+
+NB: The published copy of the lesson is usually in the `main` branch.
+
+Each lesson has a team of maintainers who review issues and pull requests or
+encourage others to do so. The maintainers are community volunteers, and have
+final say over what gets merged into the lesson.
+
+### Other Resources
+
+The Carpentries is a global organisation with volunteers and learners all over
+the world. We share values of inclusivity and a passion for sharing knowledge,
+teaching and learning. There are several ways to connect with The Carpentries
+community listed at including via social
+media, slack, newsletters, and email lists. You can also [reach us by
+email][contact].
+
+[repo]: https://example.com/FIXME
+[contact]: mailto:team@carpentries.org
+[cp-site]: https://carpentries.org/
+[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry
+[dc-lessons]: https://datacarpentry.org/lessons/
+[dc-site]: https://datacarpentry.org/
+[discuss-list]: https://lists.software-carpentry.org/listinfo/discuss
+[github]: https://github.com
+[github-flow]: https://guides.github.com/introduction/flow/
+[github-join]: https://github.com/join
+[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github
+[issues]: https://carpentries.org/help-wanted-issues/
+[lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry
+[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry
+[swc-lessons]: https://software-carpentry.org/lessons/
+[swc-site]: https://software-carpentry.org/
+[lc-site]: https://librarycarpentry.org/
+[template-doc]: https://carpentries.github.io/workbench/
diff --git a/LICENSE.md b/LICENSE.md
new file mode 100644
index 00000000..7632871f
--- /dev/null
+++ b/LICENSE.md
@@ -0,0 +1,79 @@
+---
+title: "Licenses"
+---
+
+## Instructional Material
+
+All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
+instructional material is made available under the [Creative Commons
+Attribution license][cc-by-human]. The following is a human-readable summary of
+(and not a substitute for) the [full legal text of the CC BY 4.0
+license][cc-by-legal].
+
+You are free:
+
+- to **Share**---copy and redistribute the material in any medium or format
+- to **Adapt**---remix, transform, and build upon the material
+
+for any purpose, even commercially.
+
+The licensor cannot revoke these freedoms as long as you follow the license
+terms.
+
+Under the following terms:
+
+- **Attribution**---You must give appropriate credit (mentioning that your work
+ is derived from work that is Copyright (c) The Carpentries and, where
+ practical, linking to ), provide a [link to the
+ license][cc-by-human], and indicate if changes were made. You may do so in
+ any reasonable manner, but not in any way that suggests the licensor endorses
+ you or your use.
+
+- **No additional restrictions**---You may not apply legal terms or
+ technological measures that legally restrict others from doing anything the
+ license permits. With the understanding that:
+
+Notices:
+
+* You do not have to comply with the license for elements of the material in
+ the public domain or where your use is permitted by an applicable exception
+ or limitation.
+* No warranties are given. The license may not give you all of the permissions
+ necessary for your intended use. For example, other rights such as publicity,
+ privacy, or moral rights may limit how you use the material.
+
+## Software
+
+Except where otherwise noted, the example programs and other software provided
+by The Carpentries are made available under the [OSI][osi]-approved [MIT
+license][mit-license].
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+## Trademark
+
+"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
+Carpentry" and their respective logos are registered trademarks of [Community
+Initiatives][ci].
+
+[cc-by-human]: https://creativecommons.org/licenses/by/4.0/
+[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
+[mit-license]: https://opensource.org/licenses/mit-license.html
+[ci]: https://communityin.org/
+[osi]: https://opensource.org
diff --git a/README.md b/README.md
index ce632032..56461546 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,8 @@
-[](https://swc-slack-invite.herokuapp.com/)
-[](https://swcarpentry.slack.com/messages/C9N1K7DCY)
+[](https://swc-slack-invite.herokuapp.com/)
+[](https://swcarpentry.slack.com/messages/C9N1K7DCY)
# organization-genomics
+
Lesson on data organization and project setup for genomics.
+
+
diff --git a/_extras/old-ncbi.md b/_extras/old-ncbi.md
deleted file mode 100644
index f8e09bdc..00000000
--- a/_extras/old-ncbi.md
+++ /dev/null
@@ -1,26 +0,0 @@
----
-layout: page
-title: Old NCBI
----
-
-## Original (older) NCBI instructions
-
-These will be phased out of our lesson when NCBI stops supporting
-the old page versions.
-
-1. Access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). Click on "Revert to the old Run Selector" at the top of the page.
-
-2. You will be presented with the old page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.
-
-
-3. In this window, you will click on the Run Number of the first entry in the “Runs Found” table (see red box above). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
-
-
-4. Use your browser’s “Back” button or arrow to go back to the ['previous page'](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). Above where it lists the "312 Runs found" is a line starting with **Total** and you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the 'RunInfo Table' button and save the file to your Desktop.
-
-We are not downloading any actual sequence data here! This is only a text file that fully describes the entire
-dataset.
-
-You should now have a **tab-delimited** file called `SraRunTable.txt`.
-
-**Return to lesson [Examining Data on the NCBI SRA Database](../03-ncbi-sra/index.html#you-should-now-have-a-file-called-sraruntabletxt) and continue.**
diff --git a/config.yaml b/config.yaml
new file mode 100644
index 00000000..d93d0c6f
--- /dev/null
+++ b/config.yaml
@@ -0,0 +1,84 @@
+#------------------------------------------------------------
+# Values for this lesson.
+#------------------------------------------------------------
+
+# Which carpentry is this (swc, dc, lc, or cp)?
+# swc: Software Carpentry
+# dc: Data Carpentry
+# lc: Library Carpentry
+# cp: Carpentries (to use for instructor training for instance)
+# incubator: The Carpentries Incubator
+carpentry: 'dc'
+
+# Overall title for pages.
+title: 'Project Organization and Management for Genomics'
+
+# Date the lesson was created (YYYY-MM-DD, this is empty by default)
+created:
+
+# Comma-separated list of keywords for the lesson
+keywords: 'software, data, lesson, The Carpentries'
+
+# Life cycle stage of the lesson
+# possible values: pre-alpha, alpha, beta, stable
+life_cycle: 'stable'
+
+# License of the lesson materials (recommended CC-BY 4.0)
+license: 'CC-BY 4.0'
+
+# Link to the source repository for this lesson
+source: 'https://github.com/fishtree-attempt/organization-genomics/'
+
+# Default branch of your lesson
+branch: 'main'
+
+# Who to contact if there are any issues
+contact: 'team@carpentries.org'
+
+# Navigation ------------------------------------------------
+#
+# Use the following menu items to specify the order of
+# individual pages in each dropdown section. Leave blank to
+# include all pages in the folder.
+#
+# Example -------------
+#
+# episodes:
+# - introduction.md
+# - first-steps.md
+#
+# learners:
+# - setup.md
+#
+# instructors:
+# - instructor-notes.md
+#
+# profiles:
+# - one-learner.md
+# - another-learner.md
+
+# Order of episodes in your lesson
+episodes:
+- 01-tidiness.md
+- 02-project-planning.md
+- 03-ncbi-sra.md
+
+# Information for Learners
+learners:
+
+# Information for Instructors
+instructors:
+
+# Learner Profiles
+profiles:
+
+# Customisation ---------------------------------------------
+#
+# This space below is where custom yaml items (e.g. pinning
+# sandpaper and varnish versions) should live
+
+
+url: https://preview.carpentries.org/organization-genomics
+analytics: carpentries
+lang: en
+workbench-beta: 'true'
diff --git a/episodes/01-tidiness.md b/episodes/01-tidiness.md
index 6eaa0d6b..1dc73b7c 100644
--- a/episodes/01-tidiness.md
+++ b/episodes/01-tidiness.md
@@ -1,47 +1,61 @@
---
-title: "Data Tidiness"
+title: Data Tidiness
teaching: 20
exercises: 10
-questions:
-- "What metadata should I collect?"
-- "How should I structure my sequencing data and metadata?"
-objectives:
-- "Think about and understand the types of metadata a sequencing experiment will generate."
-- "Understand the importance of metadata and potential metadata standards."
-- "Explore common formatting challenges in spreadsheet data."
-keypoints:
-- "Metadata is key for you and others to be able to work with your data."
-- "Tabular data needs to be structured to be able to work with it effectively."
---
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Think about and understand the types of metadata a sequencing experiment will generate.
+- Understand the importance of metadata and potential metadata standards.
+- Explore common formatting challenges in spreadsheet data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- What metadata should I collect?
+- How should I structure my sequencing data and metadata?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
## Introduction
+When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center. However, equally or more important is the data you've generated *about* the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+With the person next to you, discuss:
+
+What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
-When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center. However, equally or more important is the data you've generated _about_ the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.
-
-> ## Discussion
-> With the person next to you, discuss:
->
-> What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
->
-> > ## Solution
-> > Types of files and information you have generated:
-> > - Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study.
-> > - Lab notebook notes about how you conducted those experiments.
-> > - Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
-> > - Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq.
-> > There likely will be other ideas here too.
-> > Was this more information and data than you were expecting?
-> {: .solution}
-{: .challenge}
+::::::::::::::: solution
+
+## Solution
+
+Types of files and information you have generated:
+
+- Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study.
+- Lab notebook notes about how you conducted those experiments.
+- Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
+- Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq.
+ There likely will be other ideas here too.
+ Was this more information and data than you were expecting?
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
All of the data and information just discussed can be considered metadata, i.e. data about the data. We want to follow a few guidelines for metadata.
## Notes
Notes about your experiment, including how you prepared your samples for sequencing, should be in your lab notebook, whether that's a physical lab notebook or electronic lab notebook. For guidelines on good lab notebooks, see the Howard Hughes Medical Institute "Making the Right Moves: A Practical Guide to Scientifıc Management for Postdocs and New Faculty" section on
-[Data Management and Laboratory Notebooks](http://www.hhmi.org/sites/default/files/Educational%20Materials/Lab%20Management/Making%20the%20Right%20Moves/moves2_ch8.pdf).
-
+[Data Management and Laboratory Notebooks](https://www.hhmi.org/sites/default/files/Educational%20Materials/Lab%20Management/Making%20the%20Right%20Moves/moves2_ch8.pdf).
Ensure to include dates on your lab notebook pages, the samples themselves, and in
any records about those samples. This will help you correctly associate samples
@@ -49,11 +63,16 @@ other later. Using dates also helps create unique identifiers, because even
if you process the same sample twice, you do not usually do it on the same
day, or if you do, you're aware of it and give them names like A and B.
-> ## Unique identifiers
-> Unique identifiers are a unique name for a sample or set of sequencing data.
-> They are names for that data that only exist for that data. Having these
-> unique names makes them much easier to track later.
-{: .callout}
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Unique identifiers
+
+Unique identifiers are a unique name for a sample or set of sequencing data.
+They are names for that data that only exist for that data. Having these
+unique names makes them much easier to track later.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
## Data about the experiment
@@ -61,15 +80,18 @@ Data about the experiment is usually collected in spreadsheets, like Excel.
What type of data to collect depends on your experiment and there are often guidelines from metadata standards.
-> ## Metadata standards
-> Many fields have particular ways that they structure their metadata so it's
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Metadata standards
+
+Many fields have particular ways that they structure their metadata so it's
consistent and can be used across the field.
->
-> The Digital Curation Center maintains [a list of metadata standards](http://www.dcc.ac.uk/resources/metadata-standards/list) and some that are particularly relevant for genomics data are available from the [Genomics Standards Consortium](http://www.gensc.org/pages/projects.html).
->
-> If there are not metadata standards already, you can think about what the minimum amount of information is that someone would need to know about your data to be able to work with it, without talking to you.
->
-{: .callout}
+
+The Digital Curation Center maintains [a list of metadata standards](https://www.dcc.ac.uk/resources/metadata-standards/list) and some that are particularly relevant for genomics data are available from the [Genomics Standards Consortium](https://www.gensc.org/pages/projects.html).
+
+If there are not metadata standards already, you can think about what the minimum amount of information is that someone would need to know about your data to be able to work with it, without talking to you.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
## Structuring data in spreadsheets
@@ -79,27 +101,36 @@ The cardinal rules of using spreadsheet programs for data:
- Leave the raw data raw - do not change it!
- Put each observation or sample in its own row.
-- Put all your variables in columns - the thing that vary between samples, like ‘strain’ or ‘DNA-concentration’.
-- Have column names be explanatory, but without spaces. Use '-', '_' or [camel case](https://en.wikipedia.org/wiki/Camel_case) instead of a space. For instance 'library-prep-method' or 'LibraryPrep'is better than 'library preparation method' or 'prep', because computers interpret spaces in particular ways.
-- Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way
-you’ll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
-K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
-analysis you want to do, you may even separate the genus and species names into distinct columns.
+- Put all your variables in columns - the thing that vary between samples, like ‘strain' or ‘DNA-concentration'.
+- Have column names be explanatory, but without spaces. Use '-', '\_' or [camel case](https://en.wikipedia.org/wiki/Camel_case) instead of a space. For instance 'library-prep-method' or 'LibraryPrep'is better than 'library preparation method' or 'prep', because computers interpret spaces in particular ways.
+- Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that's the only way
+ you'll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
+ K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
+ analysis you want to do, you may even separate the genus and species names into distinct columns.
- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.
-[](https://github.com/datacarpentry/organization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx)
+[{alt='Messy spreadsheet'}](files/Ecoli_metadata_composite_messy.xlsx)
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to your computer via this [link](files/Ecoli_metadata_composite_messy.xlsx) and open it in a spreadsheet reader like Excel.
+
+::::::::::::::: solution
+
+## Solution
-> ## Discussion
-> This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to your computer via this [link](https://github.com/datacarpentry/organization-genomics/raw/gh-pages/files/Ecoli_metadata_composite_messy.xlsx) and open it in a spreadsheet reader like Excel.
->
->
-> > ## Solution
-> > A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
-> >
-> >[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv)
-> >Download the file using right-click (PC)/command-click (Mac).
-> {: .solution}
-{: .challenge}
+A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](https://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
+
+[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv)
+Download the file using right-click (PC)/command-click (Mac).
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
### Further notes on data tidiness
@@ -107,6 +138,15 @@ Organizing your data properly at this point of your experiment will help your an
Fear not! If you have already started your project and it's not set up this way, there are still opportunities to make updates. One of the biggest challenges is tabular data that is not formatted so computers can use it, or has inconsistencies that make it hard to analyze.
-More practice on how to structure data is outlined in our [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/)
+More practice on how to structure data is outlined in our [Data Carpentry Ecology spreadsheet lesson](https://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/)
+
+Tools like [OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) can help you clean your data.
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Metadata is key for you and others to be able to work with your data.
+- Tabular data needs to be structured to be able to work with it effectively.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
-Tools like [OpenRefine](http://www.datacarpentry.org/OpenRefine-ecology-lesson/) can help you clean your data.
diff --git a/episodes/02-project-planning.md b/episodes/02-project-planning.md
index 0dd5ec09..5dc1e969 100644
--- a/episodes/02-project-planning.md
+++ b/episodes/02-project-planning.md
@@ -1,18 +1,23 @@
---
-title: "Planning for NGS Projects"
+title: Planning for NGS Projects
teaching: 20
exercises: 10
-questions:
-- "How do I plan and organize a genome sequencing project?"
-- "What information does a sequencing facility need?"
-- "What are the guidelines for data storage?"
-objectives:
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
- Understand the data we send to and get back from a sequencing center.
- Make decisions about how (if) data will be stored, archived, shared, etc.
-keypoints:
-- "Data being sent to a sequencing center also needs to be structured so you can use it."
-- "Raw sequencing data should be kept raw somewhere, so you can always go back to the original files."
----
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How do I plan and organize a genome sequencing project?
+- What information does a sequencing facility need?
+- What are the guidelines for data storage?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
There are a variety of ways to work with a large sequencing dataset. You may be a novice who has not used
bioinformatics tools beyond doing BLAST searches. You may have bioinformatics experience with other types of data
@@ -20,59 +25,70 @@ and are working with high-throughput (NGS) sequence data for the first time. In
methods and approaches we need in bioinformatics are the same ones we need at the bench or in the field -
*planning, documenting, and organizing* are the key to good reproducible science.
-> ## Discussion
->
-> Before we go any further, here are some important questions to consider. If you are learning at a workshop,
-> please discuss these questions with your neighbor.
->
->
-> **Working with sequence data**
->
-> What challenges do you think you'll face (or have already faced) in working with a large sequence dataset?
-> What is your strategy for saving and sharing your sequence files?
-> How can you be sure that your raw data have not been unintentionally corrupted?
-> Where/how will you (did you) analyze your data - what software, what computer(s)?
-{: .challenge}
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+Before we go any further, here are some important questions to consider. If you are learning at a workshop,
+please discuss these questions with your neighbor.
+
+**Working with sequence data**
+What challenges do you think you'll face (or have already faced) in working with a large sequence dataset?
+What is your strategy for saving and sharing your sequence files?
+How can you be sure that your raw data have not been unintentionally corrupted?
+Where/how will you (did you) analyze your data - what software, what computer(s)?
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
## Sending samples to the facility
The first step in sending your sample for sequencing will be to complete a form documenting the metadata for the
facility. Take a look at the following example submission spreadsheet.
-[Sample submission sheet](../files/sample_submission.txt)
+[Sample submission sheet](files/sample_submission.txt)
Download the file using right-click (PC)/command-click (Mac). This is a tab-delimited text file. Try opening it
with Excel or another spreadsheet program.
-> ## Exercise
->
-> 1. What are some errors you can spot in the data? Typos, missing data, inconsistencies?
-> 2. What improvements could be made to the choices in naming?
-> 3. What are some errors in the spreadsheet that would be difficult to spot? Is there any way you can test this?
->
-> > ## Solution
-> > Errors:
-> > - Sequential order of well_position changes
-> > - Format of client_sample_id changes and cannot have spaces, slashes, non-standard ASCII characters
-> > - Capitalization of the replicate column changes
-> > - Volume and concentration column headers have unusual (not allowed) characters
-> > - Volume, concentration, and RIN column decimal accuracy changes
-> > - The prep_date and ship_date formats are different, and prep_date has multiple formats
-> > - Are there others not mentioned?
-> >
-> > Improvements in naming
-> > - Shorten client_sample_id names, and maybe just call them "names"
-> > - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information
-> > - The prep_date and ship_date might not be needed
-> > - Use "microliters" for "Volume (µL)" etc.
-> >
-> > Errors hard to spot:
-> > - No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names
-> > - Find by sorting, or counting
-> >
-> {: .solution}
-{: .challenge}
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+1. What are some errors you can spot in the data? Typos, missing data, inconsistencies?
+2. What improvements could be made to the choices in naming?
+3. What are some errors in the spreadsheet that would be difficult to spot? Is there any way you can test this?
+
+::::::::::::::: solution
+
+## Solution
+
+Errors:
+
+- Sequential order of well\_position changes
+- Format of client\_sample\_id changes and cannot have spaces, slashes, non-standard ASCII characters
+- Capitalization of the replicate column changes
+- Volume and concentration column headers have unusual (not allowed) characters
+- Volume, concentration, and RIN column decimal accuracy changes
+- The prep\_date and ship\_date formats are different, and prep\_date has multiple formats
+- Are there others not mentioned?
+
+Improvements in naming
+
+- Shorten client\_sample\_id names, and maybe just call them "names"
+ - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information
+- The prep\_date and ship\_date might not be needed
+- Use "microliters" for "Volume (µL)" etc.
+
+Errors hard to spot:
+
+- No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names
+- Find by sorting, or counting
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
## Retrieving sample sequencing data from the facility
@@ -80,29 +96,34 @@ When the data come back from the sequencing facility, you will receive some docu
the sequence files themselves. Download and examine the following example file - here provided as a text file and
Excel file:
-- [Sequencing results - text](../files/sequencing_results_metadata.txt)
-- [Sequencing results - Excel](../files/sequencing_results_metadata.xls)
-
-> ## Exercise
->
-> 1. How are these samples organized?
-> 2. If you wanted to associate the sequence file names with their corresponding sample names from the submission sheet, could you do so? How?
-> 3. What do the \_R1/\_R2 extensions mean in the file names?
-> 4. What does the '.gz' extension on the filenames indicate?
-> 5. What is the total file size - what challenges in downloading and sharing these data might exist?
->
-> > ## Solution
-> >
-> > 1. Samples are organized by sample\_id
-> > 2. To relate filenames use the sample\_id, and do a VLOOKUP on submission sheet
-> > 3. The \_R1/\_R2 extensions mean "read 1" and "read 2" of each sample. These
-> > typically refer to forward and reverse reads of the same DNA fragment from
-> > the sequencer, i.e. during paired-end sequencing.
-> > 4. The '.gz' extension means it is a compressed "gzip" type format to save disk space
-> > 5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large you should validate the file size following transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.
-> >
-> {: .solution}
-{: .challenge}
+- [Sequencing results - text](files/sequencing_results_metadata.txt)
+- [Sequencing results - Excel](files/sequencing_results_metadata.xls)
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+1. How are these samples organized?
+2. If you wanted to associate the sequence file names with their corresponding sample names from the submission sheet, could you do so? How?
+3. What do the \_R1/\_R2 extensions mean in the file names?
+4. What does the '.gz' extension on the filenames indicate?
+5. What is the total file size - what challenges in downloading and sharing these data might exist?
+
+::::::::::::::: solution
+
+## Solution
+
+1. Samples are organized by sample\_id
+2. To relate filenames use the sample\_id, and do a VLOOKUP on submission sheet
+3. The \_R1/\_R2 extensions mean "read 1" and "read 2" of each sample. These
+ typically refer to forward and reverse reads of the same DNA fragment from
+ the sequencer, i.e. during paired-end sequencing.
+4. The '.gz' extension means it is a compressed "gzip" type format to save disk space
+5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large you should validate the file size following transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
## Storing data
@@ -120,7 +141,7 @@ If you have a local high performance computing center or data storage facility o
If you do not have access to these resources, you can back up on hard drives. Have two backups, and keep the hard drives in different physical locations.
-You can also use resources like [Amazon S3](https://aws.amazon.com/s3/), [Microsoft Azure](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/), [Google Cloud](https://cloud.google.com/storage/) or others for cloud storage. The [open science framework](https://osf.io) is a free option for storing files up to 5 GB. See more in the lesson ["Introduction to Cloud Computing for Genomics"](http://www.datacarpentry.org/cloud-genomics/04-which-cloud/).
+You can also use resources like [Amazon S3](https://aws.amazon.com/s3/), [Microsoft Azure](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/), [Google Cloud](https://cloud.google.com/storage/) or others for cloud storage. The [open science framework](https://osf.io) is a free option for storing files up to 5 GB. See more in the lesson ["Introduction to Cloud Computing for Genomics"](https://www.datacarpentry.org/cloud-genomics/04-which-cloud/).
## Summary
@@ -132,10 +153,21 @@ you can accomplish routine tasks, under normal conditions, in an acceptable amou
be able to get to a solution on instinct alone - taking your time, using Google or another Internet search engine,
and asking for help are all valid ways of solving your problems. As you complete the lessons you'll be able to use all of those methods more efficiently.
-> ## Where to go from here?
->
-> More reading about core competencies
->
->L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta and M. Schneider, '[Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3945096/)', PLoS Comput Biol, vol. 10, no. 3, p. e1003496, 2014.
->
-{: .callout}
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Where to go from here?
+
+More reading about core competencies
+
+L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta and M. Schneider, '[Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3945096/)', PLoS Comput Biol, vol. 10, no. 3, p. e1003496, 2014.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Data being sent to a sequencing center also needs to be structured so you can use it.
+- Raw sequencing data should be kept raw somewhere, so you can always go back to the original files.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/episodes/03-ncbi-sra.md b/episodes/03-ncbi-sra.md
index 26e04075..6907d2a8 100644
--- a/episodes/03-ncbi-sra.md
+++ b/episodes/03-ncbi-sra.md
@@ -1,17 +1,22 @@
---
-title: "Examining Data on the NCBI SRA Database"
+title: Examining Data on the NCBI SRA Database
teaching: 20
exercises: 10
-questions:
-- "How do I access public sequencing data?"
-objectives:
-- "Be aware that public genomic data is available."
-- "Understand how to access and download this data."
-keypoints:
-- "Public data repositories are a great source of genomic data."
-- "You are likely to put your own data on a public repository."
---
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Be aware that public genomic data is available.
+- Understand how to access and download this data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How do I access public sequencing data?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
In our experiments we usually think about generating our own sequencing data. However, almost all analyses use reference data, and you may want to use it to compare your results or annotate your data with publicly available data. You may also want to do a full project or set of analyses using publicly available data. This data is a great, and essential, resource for genomic data analysis.
When you come to publish a paper including your sequencing data, most journals and funders require that you place your data on a public repository. Sharing your data makes it more likely that your work will be re-used and cited. It helps to prepare for this early!
@@ -20,13 +25,13 @@ There are many repositories for public data. Some model organisms or fields have
## Accessing the original archived data
-The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra), which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section).
+The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](https://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section).
### Locate the Run Selector Table for the Lenski Dataset on the SRA
See the figures below for how information about data access is provided within the original paper.
-
+
The **above image** shows the title of the study, as well as the authors.
@@ -37,78 +42,85 @@ The excerpt from the paper below includes information on how to locate the seque
> analysis pipeline is available at GitHub ([http://github.com/barricklab/breseq](https://github.com/barricklab/breseq/)).
> Other analysis scripts are available at the Dryad Digital Repository ([http://dx.doi.org/10.5061/dryad.6226d](https://doi.org/10.5061/dryad.6226d)). R.E.L. will make strains available to qualified
> recipients, subject to a material transfer agreement. Repreints and permissions
-> information is available at www.nature.com/reprints. The authors declare no
+> information is available at [www.nature.com/reprints](https://www.nature.com/reprints). The authors declare no
> competing financial interests. Readers are welcome to comment on the online
> version of the paper. Correspondence and requests for materials should be
> addressed to R.E.L. (lenski *at* msu.edu)
-**At the beginning of this workshop we gave you [experimental information about these data](http://www.datacarpentry.org/organization-genomics/data/). This lesson uses a *subset* of SRA files, from a small *subproject* of the BioProject database
+**At the beginning of this workshop we gave you [experimental information about these data](https://www.datacarpentry.org/organization-genomics/data/). This lesson uses a *subset* of SRA files, from a small *subproject* of the BioProject database
"PRJNA294072". To find these data you can follow the instructions below:**
1. Notice that the paper references "PRJNA294072" as a "BioProject" at NCBI. If you go to the [NCBI website](https://www.ncbi.nlm.nih.gov/) and search for "PRJNA294072" you will be shown a link to the "Long-Term Evolution Experiment with E. coli" BioProject. Here is the link to that database: [https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).
2. Once on the BioProject page, scroll down to the table under **"This project encompasses the
-following 15 sub-projects:"**.
+ following 15 sub-projects:"**.
3. In this table, select **subproject**
-*"[PRJNA295606](https://www.ncbi.nlm.nih.gov/bioproject/295606) SRA or Trace Escherichia coli B str. REL606 E. coli genome evolution over 50,000 generations (The University of Texas at...)"*.
+ *"[PRJNA295606](https://www.ncbi.nlm.nih.gov/bioproject/295606) SRA or Trace Escherichia coli B str. REL606 E. coli genome evolution over 50,000 generations (The University of Texas at...)"*.
4. This will take you to a page with the subproject description, and a table **"Project Data"**
-that has a link to the 224 SRA files for this subproject.
+ that has a link to the 224 SRA files for this subproject.
5. Click on the number
-["224"](https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=295606) next to "SRA Experiments" and it will take you to the SRA page for this subproject.
-
+ ["224"](https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=295606) next to "SRA Experiments" and it will take you to the SRA page for this subproject.
+ {alt='03\_send\_results.png'}
6. For a more organized table, select "Send results to Run selector". This
-takes you to the Run Selector page for BioProject PRJNA295606 (the BioProject number for the experiment SRP064605) that is used in the next section.
+ takes you to the Run Selector page for BioProject PRJNA295606 (the BioProject number for the experiment SRP064605) that is used in the next section.
### Download the Lenski SRA data from the SRA Run Selector Table
-1. Make sure you access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). This is NCBI’s cloud-based SRA interface. You will be presented with a page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.
+1. Make sure you access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). This is NCBI's cloud-based SRA interface. You will be presented with a page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.
-2. Notice on this page there are three sections. “Common Fields” “Select”, and “Found 312 Items”. Within “Found 312 Items”, click on the first Run Number (Column “Run” Row “1”).
-
+2. Notice on this page there are three sections. "Common Fields" "Select", and "Found 312 Items". Within "Found 312 Items", click on the first Run Number (Column "Run" Row "1").
+ {alt='ncbi-new-tables2.png'}
3. This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
-
+ {alt='ncbi-run-browser.png'}
-4. Use the browser’s back button to go back to the 'previous page'. As shown in the figure below, the second section of the page (“Select”) has the **Total** row showing you the current number of “Runs”, “Bytes”, and “Bases” in the dataset to date. On 2022-12-06 there were 312 runs, 109.58 Gb data, and 177.17 Gbases of data.
-
+4. Use the browser's back button to go back to the 'previous page'. As shown in the figure below, the second section of the page ("Select") has the **Total** row showing you the current number of "Runs", "Bytes", and "Bases" in the dataset to date. On 2022-12-06 there were 312 runs, 109.58 Gb data, and 177.17 Gbases of data.
+ {alt='ncbi-new-metadata.png'}
-5. Click on the “Metadata” button to download the data for this lesson. The filename is “SraRunTable.txt” and save it on your computer Desktop. This text-based file is actually a "comma-delimited" file, so you should rename the file to "SraRunTable.csv" for your spreadsheet software to open it correctly.
+5. Click on the "Metadata" button to download the data for this lesson. The filename is "SraRunTable.txt" and save it on your computer Desktop. This text-based file is actually a "comma-delimited" file, so you should rename the file to "SraRunTable.csv" for your spreadsheet software to open it correctly.
**You should now have a file called `SraRunTable.csv`** on your desktop.
-> Now you know that comma-separated (and tab-separated) files can be designated as "text" (`.txt`)
-> files but use either commas (or tabs) as **delimiters**, respectively. Sometimes you
-> might need to use a text-editor (*e.g.* Notepad) to determine if a file suffixed with `.txt` is
-> actually comma-delimited or tab-delimited.
+> Now you know that comma-separated (and tab-separated) files can be designated as "text" (`.txt`)
+> files but use either commas (or tabs) as **delimiters**, respectively. Sometimes you
+> might need to use a text-editor (*e.g.* Notepad) to determine if a file suffixed with `.txt` is
+> actually comma-delimited or tab-delimited.
### Review the SraRunTable metadata in a spreadsheet program
-Using your choice of spreadsheet program, open the `SraRunTable.csv` file.
-
-> ## Discussion
-> Discuss with the person next to you:
->
-> 1. What strain of *E. coli* was used in this experiment?
-> 2. What was the sequencing platform used for this experiment?
-> 3. What samples in the experiment contain
-> [paired end](http://www.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html)
-> sequencing data?
-> 4. What other kind of data is available?
-> 5. Why are you collecting this kind of information about your sequencing runs?
->
-> > ## Solution
-> > 1. Escherichia coli B str. REL606 shown under the "organism" column. This is a tricky question because the column labeled "strain" actually has sample names
-> > 2. The Illumina sequencing platform was used shown in the column "Platform". But notice they used multiple instrument types listed under "Instrument"
-> > 3. Sort by LibraryLayout and the column "DATASTORE_filetype" shows that "realign,sra,wgmlst_sig" were used for paired-end data, while "fastq,sra" were used for all single-end reads. (Also notice the Illumina Genome Analyzer IIx was never used for paired-end sequencing)
-> > 4. There are several columns including: megabases of sequence per sample, Assay type, BioSample Model, and more.
-> > 5. These are examples of "metadata" that you should collect for sequencing projects that are sent to public databases.
-> >
-> {: .solution}
-{: .challenge}
+Using your choice of spreadsheet program, open the `SraRunTable.csv` file.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+Discuss with the person next to you:
+
+1. What strain of *E. coli* was used in this experiment?
+2. What was the sequencing platform used for this experiment?
+3. What samples in the experiment contain
+ [paired end](https://www.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html)
+ sequencing data?
+4. What other kind of data is available?
+5. Why are you collecting this kind of information about your sequencing runs?
+
+::::::::::::::: solution
+
+## Solution
+
+1. Escherichia coli B str. REL606 shown under the "organism" column. This is a tricky question because the column labeled "strain" actually has sample names
+2. The Illumina sequencing platform was used shown in the column "Platform". But notice they used multiple instrument types listed under "Instrument"
+3. Sort by LibraryLayout and the column "DATASTORE\_filetype" shows that "realign,sra,wgmlst\_sig" were used for paired-end data, while "fastq,sra" were used for all single-end reads. (Also notice the Illumina Genome Analyzer IIx was never used for paired-end sequencing)
+4. There are several columns including: megabases of sequence per sample, Assay type, BioSample Model, and more.
+5. These are examples of "metadata" that you should collect for sequencing projects that are sent to public databases.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
After answering the questions, you should avoid saving any changes you might have made to the metadata file. We do not want to make any changes. If you were to save this file, make sure you save it as a text-based `.csv` file format.
@@ -128,8 +140,8 @@ We do not recommend downloading large numbers of sequencing files this way. For
### About the Sequence Read Archive
-* You can learn more about the SRA by reading the [SRA Documentation](http://www.ncbi.nlm.nih.gov/Traces/sra/)
-* The best way to transfer a large SRA dataset is by using the [SRA Toolkit](http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc)
+- You can learn more about the SRA by reading the [SRA Documentation](https://www.ncbi.nlm.nih.gov/Traces/sra/)
+- The best way to transfer a large SRA dataset is by using the [SRA Toolkit](https://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc)
## References
@@ -138,3 +150,12 @@ Tempo and mode of genome evolution in a 50,000-generation experiment (2016) Natu
[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), [Supplemental materials](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/#)
Data on NCBI SRA: [https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605)
Data on EMBL-EBI ENA: [https://www.ebi.ac.uk/ena/data/view/PRJNA295606](https://www.ebi.ac.uk/ena/data/view/PRJNA295606)
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Public data repositories are a great source of genomic data.
+- You are likely to put your own data on a public repository.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/fig/01_tidiness_datasheet_example_clean.png b/episodes/fig/01_tidiness_datasheet_example_clean.png
similarity index 100%
rename from fig/01_tidiness_datasheet_example_clean.png
rename to episodes/fig/01_tidiness_datasheet_example_clean.png
diff --git a/fig/01_tidiness_datasheet_example_messy.png b/episodes/fig/01_tidiness_datasheet_example_messy.png
similarity index 100%
rename from fig/01_tidiness_datasheet_example_messy.png
rename to episodes/fig/01_tidiness_datasheet_example_messy.png
diff --git a/fig/03_acc_info.png b/episodes/fig/03_acc_info.png
similarity index 100%
rename from fig/03_acc_info.png
rename to episodes/fig/03_acc_info.png
diff --git a/fig/03_ncbi_new_metadata.png b/episodes/fig/03_ncbi_new_metadata.png
similarity index 100%
rename from fig/03_ncbi_new_metadata.png
rename to episodes/fig/03_ncbi_new_metadata.png
diff --git a/fig/03_ncbi_new_run_browser.png b/episodes/fig/03_ncbi_new_run_browser.png
similarity index 100%
rename from fig/03_ncbi_new_run_browser.png
rename to episodes/fig/03_ncbi_new_run_browser.png
diff --git a/fig/03_ncbi_new_tables2.png b/episodes/fig/03_ncbi_new_tables2.png
similarity index 100%
rename from fig/03_ncbi_new_tables2.png
rename to episodes/fig/03_ncbi_new_tables2.png
diff --git a/fig/03_ncbi_new_top.png b/episodes/fig/03_ncbi_new_top.png
similarity index 100%
rename from fig/03_ncbi_new_top.png
rename to episodes/fig/03_ncbi_new_top.png
diff --git a/fig/03_ncbi_new_top2.png b/episodes/fig/03_ncbi_new_top2.png
similarity index 100%
rename from fig/03_ncbi_new_top2.png
rename to episodes/fig/03_ncbi_new_top2.png
diff --git a/fig/03_ncbi_old_run_selector.png b/episodes/fig/03_ncbi_old_run_selector.png
similarity index 100%
rename from fig/03_ncbi_old_run_selector.png
rename to episodes/fig/03_ncbi_old_run_selector.png
diff --git a/fig/03_ncbi_old_runtable_button.png b/episodes/fig/03_ncbi_old_runtable_button.png
similarity index 100%
rename from fig/03_ncbi_old_runtable_button.png
rename to episodes/fig/03_ncbi_old_runtable_button.png
diff --git a/fig/03_ncbi_run_browser.png b/episodes/fig/03_ncbi_run_browser.png
similarity index 100%
rename from fig/03_ncbi_run_browser.png
rename to episodes/fig/03_ncbi_run_browser.png
diff --git a/fig/03_ncbi_send_results.png b/episodes/fig/03_ncbi_send_results.png
similarity index 100%
rename from fig/03_ncbi_send_results.png
rename to episodes/fig/03_ncbi_send_results.png
diff --git a/fig/03_paper_header.png b/episodes/fig/03_paper_header.png
similarity index 100%
rename from fig/03_paper_header.png
rename to episodes/fig/03_paper_header.png
diff --git a/fig/2_datasheet_example.jpg b/episodes/fig/2_datasheet_example.jpg
similarity index 100%
rename from fig/2_datasheet_example.jpg
rename to episodes/fig/2_datasheet_example.jpg
diff --git a/files/Ecoli_metadata_composite_messy.pdf b/episodes/files/Ecoli_metadata_composite_messy.pdf
similarity index 100%
rename from files/Ecoli_metadata_composite_messy.pdf
rename to episodes/files/Ecoli_metadata_composite_messy.pdf
diff --git a/files/Ecoli_metadata_composite_messy.xlsx b/episodes/files/Ecoli_metadata_composite_messy.xlsx
similarity index 100%
rename from files/Ecoli_metadata_composite_messy.xlsx
rename to episodes/files/Ecoli_metadata_composite_messy.xlsx
diff --git a/files/SampleSheet_Example_clean.csv b/episodes/files/SampleSheet_Example_clean.csv
similarity index 100%
rename from files/SampleSheet_Example_clean.csv
rename to episodes/files/SampleSheet_Example_clean.csv
diff --git a/files/SampleSheet_Example_messy.csv b/episodes/files/SampleSheet_Example_messy.csv
similarity index 100%
rename from files/SampleSheet_Example_messy.csv
rename to episodes/files/SampleSheet_Example_messy.csv
diff --git a/files/sample_submission.txt b/episodes/files/sample_submission.txt
similarity index 100%
rename from files/sample_submission.txt
rename to episodes/files/sample_submission.txt
diff --git a/files/sequencing_results_metadata.txt b/episodes/files/sequencing_results_metadata.txt
similarity index 100%
rename from files/sequencing_results_metadata.txt
rename to episodes/files/sequencing_results_metadata.txt
diff --git a/files/sequencing_results_metadata.xls b/episodes/files/sequencing_results_metadata.xls
similarity index 100%
rename from files/sequencing_results_metadata.xls
rename to episodes/files/sequencing_results_metadata.xls
diff --git a/index.md b/index.md
index ab27ab33..261a6216 100644
--- a/index.md
+++ b/index.md
@@ -1,13 +1,12 @@
---
-layout: lesson
-root: .
+site: sandpaper::sandpaper_site
---
Good data organization is the foundation of any research project. It not only sets you up well for an analysis, but it also makes it easier to come back to the project later and share with collaborators, including your most important collaborator - future you.
Organizing a project that includes sequencing involves many components. There's the experimental setup and conditions metadata, measurements of experimental parameters, sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis. So much of the information of a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer. In this lesson, we'll go through the project organization and documentation that will make an efficient bioinformatics workflow possible. Not only will this make you a more effective bioinformatics researcher, it also prepares your data and project for publication, as grant agencies and publishers increasingly require this information.
-In this lesson, we'll be using data from a study of experimental evolution using *E. coli*. [More information about this dataset is available here](http://www.datacarpentry.org/organization-genomics/data/). In this study there are several types of files:
+In this lesson, we'll be using data from a study of experimental evolution using *E. coli*. [More information about this dataset is available here](https://www.datacarpentry.org/organization-genomics/data/). In this study there are several types of files:
- spreadsheet data from the experiment that tracks the strains and their phenotype over time
- spreadsheet data with information on the samples that were sequenced - the names of the samples, how they were prepared and the sequencing conditions
@@ -22,20 +21,31 @@ In this lesson you will learn:
- How to access and download publicly available data that may need to be used in your bioinformatics analysis
- The concepts of organizing the files and documenting the workflow of your bioinformatics analysis
-> ## Getting Started
->
-> This lesson assumes no prior experience with the tools covered in the workshop.
-> However, learners are expected to have some familiarity with biological concepts,
-> including the
-> concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.
->
-> This lesson is part of a workshop that uses data hosted on an Amazon Machine Instance (AMI). Workshop participants will be given
-> information on how
-> to log-in to the AMI during the workshop. Learners using these materials for self-directed study will need to set up their own
-> AMI. Information on setting up an AMI and accessing the required data is provided on the [Genomics Workshop setup page](http://www.datacarpentry.org/genomics-workshop/setup.html).
-{: .prereq}
-
-> ## For Instructors
-> If you are teaching this lesson in a workshop, please see the
-> [Instructor notes](guide/).
-{: .prereq}
+:::::::::::::::::::::::::::::::::::::::::: prereq
+
+## Getting Started
+
+This lesson assumes no prior experience with the tools covered in the workshop.
+However, learners are expected to have some familiarity with biological concepts,
+including the
+concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.
+
+This lesson is part of a workshop that uses data hosted on an Amazon Machine Instance (AMI). Workshop participants will be given
+information on how
+to log-in to the AMI during the workshop. Learners using these materials for self-directed study will need to set up their own
+AMI. Information on setting up an AMI and accessing the required data is provided on the [Genomics Workshop setup page](https://www.datacarpentry.org/genomics-workshop/setup.html).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::::: prereq
+
+## For Instructors
+
+If you are teaching this lesson in a workshop, please see the
+[Instructor notes](instructors/instructor-notes.md).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/_extras/data.md b/instructors/data.md
similarity index 69%
rename from _extras/data.md
rename to instructors/data.md
index 395ed817..1c6f6826 100644
--- a/_extras/data.md
+++ b/instructors/data.md
@@ -1,39 +1,40 @@
---
-layout: page
title: Data
---
-# Features of the dataset
+# Features of the dataset
This dataset was selected for our exercise on NGS Data Carpentry for several reasons, including:
-* Simple, but iconic NGS-problem: Examine a population where we want to characterize changes in sequence *a priori*
-* Dataset publicly available - in this case through the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra)
-* Small file sizes - while several of related files may still be hundreds of MBs, overall we will be able to get through more quickly than if we worked with a larger eukaryotic genome
+- Simple, but iconic NGS-problem: Examine a population where we want to characterize changes in sequence *a priori*
+- Dataset publicly available - in this case through the NCBI Sequence Read Archive ([http://www.ncbi.nlm.nih.gov/sra](https://www.ncbi.nlm.nih.gov/sra))
+- Small file sizes - while several of related files may still be hundreds of MBs, overall we will be able to get through more quickly than if we worked with a larger eukaryotic genome
-# Introduction to the dataset
+# Introduction to the dataset
-Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In [Tenaillon et al 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), 12 populations of *Escherichia coli* were propagated for more than 50,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of *E.coli* (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the *E. coli* species. Eventually, Cit+ mutants became the dominant population as the experimental growth medium contained a high concentration of citrate relative to glucose. Around the same time that this mutation emerged, another phenotype become prominent in the Ara-3 population. Many *E. coli* began to develop excessive numbers of mutations, meaning they became hypermutable.
+Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In [Tenaillon et al 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), 12 populations of *Escherichia coli* were propagated for more than 50,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of *E.coli* (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the *E. coli* species. Eventually, Cit+ mutants became the dominant population as the experimental growth medium contained a high concentration of citrate relative to glucose. Around the same time that this mutation emerged, another phenotype become prominent in the Ara-3 population. Many *E. coli* began to develop excessive numbers of mutations, meaning they became hypermutable.
-Strains from generation 0 to generation 50,000 were sequenced, including ones that were both Cit+ and Cit- and hypermutable in later generations.
+Strains from generation 0 to generation 50,000 were sequenced, including ones that were both Cit+ and Cit- and hypermutable in later generations.
-For the purposes of this workshop we're going to be working with 3 of the sequence reads from this experiment.
+For the purposes of this workshop we're going to be working with 3 of the sequence reads from this experiment.
-| SRA Run Number | Clone | Generation | Cit | Hypermutable | Read Length | Sequencing Depth |
-| -------------- | ----- | ---------- | ---- | ----- |-------|--------|
-| SRR2589044 | REL2181A | 5,000 | Unknown | None | 150 | 60.2 |
-| SRR2584863 | REL7179B | 15,000 | Unknown | None | 150 | 88 |
-| SRR2584866 | REL11365 | 50,000 | Cit+ | plus | 150 | 138.3 |
+| SRA Run Number | Clone | Generation | Cit | Hypermutable | Read Length | Sequencing Depth |
+| -------------- | -------- | ---------- | ------- | ------------ | ----------- | ---------------- |
+| SRR2589044 | REL2181A | 5,000 | Unknown | None | 150 | 60\.2 |
+| SRR2584863 | REL7179B | 15,000 | Unknown | None | 150 | 88 |
+| SRR2584866 | REL11365 | 50,000 | Cit+ | plus | 150 | 138\.3 |
-We want to be able to look at differences in mutation rates between hypermutable and non-hypermutable strains. We also want to analyze the sequences to figure out what changes occurred in genomes to make the strains Cit+. Ultimately, we will answer the questions:
+We want to be able to look at differences in mutation rates between hypermutable and non-hypermutable strains. We also want to analyze the sequences to figure out what changes occurred in genomes to make the strains Cit+. Ultimately, we will answer the questions:
-- How many base pair changes are there between the Cit+ and Cit- strains?
-- What are the base pair changes between strains?
+- How many base pair changes are there between the Cit+ and Cit- strains?
+- What are the base pair changes between strains?
-## References
+## References
Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, Lenski RE.
Tempo and mode of genome evolution in a 50,000-generation experiment (2016) Nature. 536(7615): 165–170.
[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), [Supplemental materials](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/#)
Data on NCBI SRA: [https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605)
Data on EMBL-EBI ENA: [https://www.ebi.ac.uk/ena/data/view/PRJNA295606](https://www.ebi.ac.uk/ena/data/view/PRJNA295606)
+
+
diff --git a/_extras/guide.md b/instructors/instructor-notes.md
similarity index 55%
rename from _extras/guide.md
rename to instructors/instructor-notes.md
index be508be7..1ec471b0 100644
--- a/_extras/guide.md
+++ b/instructors/instructor-notes.md
@@ -1,5 +1,4 @@
---
-layout: page
title: Instructor Notes
---
@@ -8,46 +7,46 @@ title: Instructor Notes
## Lesson motivation and learning objectives
The purpose of this lesson is *not* to teach how to do data analysis in spreadsheets,
-but to teach good data organization and how to do some data cleaning and
+but to teach good data organization and how to do some data cleaning and
quality control in a spreadsheet program.
## Lesson design
-#### [Data tidiness](../01-tidiness/)
+#### [Data tidiness](../episodes/01-tidiness.md)
-* Introduce that we're teaching data organization, and that we're using
-spreadsheets, because most people do data entry in spreadsheets or
-have data in spreadsheets.
-* Emphasize that we are teaching good practice in data organization and that
-this is the foundation of their research practice. Without organized and clean
-data, it will be difficult for them to apply the things we're teaching in the
-rest of the workshop to their data.
-* Much of their lives as a researcher will be spent on this 'data wrangling' stage, but
-some of it can be prevented with good strategies for data collection up front.
-* Tell that we're not teaching data analysis or plotting in spreadsheets, because it's
-very manual and also not reproducible. That's why we're teaching bash shell scripting!
-* Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that
-does spreadsheets like Excel, LibreOffice, OpenOffice. Most learners are probably using Excel.
-* Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest.
-of the data in the spreadsheet. What are the pain points!?
-* As people answer, highlight some of these issues with spreadsheets.
-* Go through the point about keeping track of your steps and keeping raw data raw.
-* Go through the cardinal rule of spreadsheets about columns, rows and cells.
-* Hand them a messy data file and have them pair up and work together to clean up the data.
+- Introduce that we're teaching data organization, and that we're using
+ spreadsheets, because most people do data entry in spreadsheets or
+ have data in spreadsheets.
+- Emphasize that we are teaching good practice in data organization and that
+ this is the foundation of their research practice. Without organized and clean
+ data, it will be difficult for them to apply the things we're teaching in the
+ rest of the workshop to their data.
+- Much of their lives as a researcher will be spent on this 'data wrangling' stage, but
+ some of it can be prevented with good strategies for data collection up front.
+- Tell that we're not teaching data analysis or plotting in spreadsheets, because it's
+ very manual and also not reproducible. That's why we're teaching bash shell scripting!
+- Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that
+ does spreadsheets like Excel, LibreOffice, OpenOffice. Most learners are probably using Excel.
+- Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest.
+ of the data in the spreadsheet. What are the pain points!?
+- As people answer, highlight some of these issues with spreadsheets.
+- Go through the point about keeping track of your steps and keeping raw data raw.
+- Go through the cardinal rule of spreadsheets about columns, rows and cells.
+- Hand them a messy data file and have them pair up and work together to clean up the data.
-#### [Planning for NGS projects](../02-project-planning/)
+#### [Planning for NGS projects](../episodes/02-project-planning.md)
-* This episode depends on learners discussing exercises with one another. Be sure to give plenty of time for this discussion.
+- This episode depends on learners discussing exercises with one another. Be sure to give plenty of time for this discussion.
-#### [Examining Data on the NCBI SRA Database](../03-ncbi-sra/)
+#### [Examining Data on the NCBI SRA Database](../episodes/03-ncbi-sra.md)
-* Learners should *not* actually download the ENA files in the "Downloading a few sequencing files: EMBL-EBI" section.
+- Learners should *not* actually download the ENA files in the "Downloading a few sequencing files: EMBL-EBI" section.
#### Concluding points
-* Now your data is organized so that a computer can read and understand it. This
-lets you use the full power of the computer for your analyses as we'll see in the
-rest of the workshop.
+- Now your data is organized so that a computer can read and understand it. This
+ lets you use the full power of the computer for your analyses as we'll see in the
+ rest of the workshop.
## Working with participants' level of expertise
@@ -55,35 +54,36 @@ Learners may be taking this lesson for many reasons - they may be just thinking
You should feel free to "read the room", and it can be helpful to ask more specifics in a pre-workshop survey.
-#### [Data tidiness](../01-tidiness/)
+#### [Data tidiness](../episodes/01-tidiness.md)
-Discussion 1, "What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?" can go very differently depending on the participants' background. Many instructors make adjustments to this section, and they should, depending on the learners.
+Discussion 1, "What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?" can go very differently depending on the participants' background. Many instructors make adjustments to this section, and they should, depending on the learners.
Some instructors have succeeded in adding ice-breaker questions and more on scientific background to discussion 1, such as:
-* What's your name?
-* What kind of sequencing data are you collecting?
-* What question is your experiment answering?
-* What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
+- What's your name?
+- What kind of sequencing data are you collecting?
+- What question is your experiment answering?
+- What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
This had some positive points:
-* instructors got to see the range of projects being worked on (metagenomics, RNA-seq, DNA-seq, etc).
-* we had a good discussion about linked metadata, e.g. a plant scientist also takes photos of their plants, an ecologist has site sampling data.
-* learners got to share lessons they'd learned.
-* for some learners, it may have been the first time they'd thought about it.
-* it only added 5 minutes.
+- instructors got to see the range of projects being worked on (metagenomics, RNA-seq, DNA-seq, etc).
+- we had a good discussion about linked metadata, e.g. a plant scientist also takes photos of their plants, an ecologist has site sampling data.
+- learners got to share lessons they'd learned.
+- for some learners, it may have been the first time they'd thought about it.
+- it only added 5 minutes.
The drawback:
-* only about 1/2 of learners got to the point of talking about file types of that data.
+
+- only about 1/2 of learners got to the point of talking about file types of that data.
It could be more efficient to ask these questions in the pre-workshop survey, then present the range of answers during the class. It can also be helpful for instructors and helpers to share what they work on.
## Technical tips and tricks
-Provide information on setting up your environment for learners to view your
-live coding (increasing text size, changing text color, etc), as well as
-general recommendations for working with coding tools to best suit the
+Provide information on setting up your environment for learners to view your
+live coding (increasing text size, changing text color, etc), as well as
+general recommendations for working with coding tools to best suit the
learning environment.
## Common problems
@@ -92,10 +92,10 @@ learning environment.
The main challenge with this lesson is that Excel looks very different and how you
do things is even different between Mac and PC, and between different versions of
-Excel. So, the presenter's environment will only be the same as some of the learners.
+Excel. So, the presenter's environment will only be the same as some of the learners.
We need better notes and screenshots of how things work on both Mac and PC. But we
-likely won't be able to cover all the different versions of Excel.
+likely won't be able to cover all the different versions of Excel.
If you have a helper who has more experience with the other OS than you, it would be good
to prepare them to help with this lesson and tell people how to do things in the other OS.
@@ -105,5 +105,6 @@ to prepare them to help with this lesson and tell people how to do things in the
This lesson depends on people working on the exercise and responding with things
that are fixed. If your audience is reluctant to participate, start out with
some things on your own, or ask a helper for their answers. This generally gets
-even a reluctant audience started.
+even a reluctant audience started.
+
diff --git a/instructors/old-ncbi.md b/instructors/old-ncbi.md
new file mode 100644
index 00000000..06e728f2
--- /dev/null
+++ b/instructors/old-ncbi.md
@@ -0,0 +1,27 @@
+---
+title: Old NCBI
+---
+
+## Original (older) NCBI instructions
+
+These will be phased out of our lesson when NCBI stops supporting
+the old page versions.
+
+1. Access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). Click on "Revert to the old Run Selector" at the top of the page.
+
+2. You will be presented with the old page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.
+ {alt='ncbi-old-run-selector'}
+
+3. In this window, you will click on the Run Number of the first entry in the "Runs Found" table (see red box above). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
+ {alt='ncbi-run-browser.png'}
+
+4. Use your browser's "Back" button or arrow to go back to the ['previous page'](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). Above where it lists the "312 Runs found" is a line starting with **Total** and you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the 'RunInfo Table' button and save the file to your Desktop.
+ {alt='ncbi-old-runtable-button.png'}
+ We are not downloading any actual sequence data here! This is only a text file that fully describes the entire
+ dataset.
+
+You should now have a **tab-delimited** file called `SraRunTable.txt`.
+
+**Return to lesson [Examining Data on the NCBI SRA Database](../episodes/03-ncbi-sra.md#you-should-now-have-a-file-called-sraruntabletxt) and continue.**
+
+
diff --git a/_extras/discuss.md b/learners/discuss.md
similarity index 79%
rename from _extras/discuss.md
rename to learners/discuss.md
index 5784491d..bd4eb222 100644
--- a/_extras/discuss.md
+++ b/learners/discuss.md
@@ -1,6 +1,7 @@
---
-layout: page
title: Discussion
---
No current discussion
+
+
diff --git a/reference.md b/learners/reference.md
similarity index 88%
rename from reference.md
rename to learners/reference.md
index 72396270..0ba51183 100644
--- a/reference.md
+++ b/learners/reference.md
@@ -1,10 +1,10 @@
---
-layout: reference
+title: 'Glossary'
---
## Glossary
-{:auto_ids}
+{:auto\_ids}
accession
: a unique identifier assigned to each sequence or set of sequences
@@ -12,13 +12,13 @@ BLAST
: The Basic Local Alignment Search Tool at NCBI that searches for similarities between known and unknown biomolecules like DNA
categorical variable
-: Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical variables take on a fixed number of values that are names or labels.
+: Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical variables take on a fixed number of values that are names or labels.
cleaned data
: data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
-: formatting that is applied to a specific cell or range of cells depending on a set of criteria
+: formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
: a plain text file format in which values are separated by commas
@@ -36,7 +36,7 @@ headers
: names at tops of columns that are descriptive about the column contents (sometimes optional)
metadata
-: data which describes other data
+: data which describes other data
NGS
: common acronym for "Next Generation Sequencing" currently being replaced by "High Throughput Sequencing"
@@ -51,7 +51,7 @@ plain text
: unformatted text
quality assurance
-: any process which checks data for validity during entry
+: any process which checks data for validity during entry
quality control
: any process which removes problematic data from a dataset
@@ -69,4 +69,6 @@ TSV (tab separated values) format
: a plain text file format in which values are separated by tabs
variable
-: a category of data being collected on the object being recorded (e.g. a mouse's weight)
\ No newline at end of file
+: a category of data being collected on the object being recorded (e.g. a mouse's weight)
+
+
diff --git a/learners/setup.md b/learners/setup.md
new file mode 100644
index 00000000..244dced1
--- /dev/null
+++ b/learners/setup.md
@@ -0,0 +1,10 @@
+---
+title: Setup
+---
+
+This workshop is designed to be run on pre-imaged Amazon Web Services
+(AWS) instances. For information about how to
+use the workshop materials, see the
+[setup instructions](https://www.datacarpentry.org/genomics-workshop/setup.html) on the main workshop page.
+
+
diff --git a/profiles/learner-profiles.md b/profiles/learner-profiles.md
new file mode 100644
index 00000000..434e335a
--- /dev/null
+++ b/profiles/learner-profiles.md
@@ -0,0 +1,5 @@
+---
+title: FIXME
+---
+
+This is a placeholder file. Please add content here.
diff --git a/setup.md b/setup.md
deleted file mode 100644
index 200959b1..00000000
--- a/setup.md
+++ /dev/null
@@ -1,9 +0,0 @@
----
-layout: page
-title: Setup
----
-
-This workshop is designed to be run on pre-imaged Amazon Web Services
-(AWS) instances. For information about how to
-use the workshop materials, see the
-[setup instructions](http://www.datacarpentry.org/genomics-workshop/setup.html) on the main workshop page.
diff --git a/site/README.md b/site/README.md
new file mode 100644
index 00000000..42997e3d
--- /dev/null
+++ b/site/README.md
@@ -0,0 +1,2 @@
+This directory contains rendered lesson materials. Please do not edit files
+here.