diff --git a/.editorconfig b/.editorconfig
new file mode 100644
index 00000000..5bf4860b
--- /dev/null
+++ b/.editorconfig
@@ -0,0 +1,26 @@
+root = true
+
+[*]
+charset = utf-8
+insert_final_newline = true
+trim_trailing_whitespace = true
+
+[*.md]
+indent_size = 2
+indent_style = space
+max_line_length = 100 # Please keep this in sync with bin/lesson_check.py!
+trim_trailing_whitespace = false # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (
)
+
+[*.r]
+max_line_length = 80
+
+[*.py]
+indent_size = 4
+indent_style = space
+max_line_length = 79
+
+[*.sh]
+end_of_line = lf
+
+[Makefile]
+indent_style = tab
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
new file mode 100755
index 00000000..101967e4
--- /dev/null
+++ b/.github/workflows/README.md
@@ -0,0 +1,198 @@
+# Carpentries Workflows
+
+This directory contains workflows to be used for Lessons using the {sandpaper}
+lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml`
+and `pr-recieve.yaml`) and the rest are bots to handle pull request management.
+
+These workflows will likely change as {sandpaper} evolves, so it is important to
+keep them up-to-date. To do this in your lesson you can do the following in your
+R console:
+
+```r
+# Install/Update sandpaper
+options(repos = c(carpentries = "https://carpentries.r-universe.dev/",
+ CRAN = "https://cloud.r-project.org"))
+install.packages("sandpaper")
+
+# update the workflows in your lesson
+library("sandpaper")
+update_github_workflows()
+```
+
+Inside this folder, you will find a file called `sandpaper-version.txt`, which
+will contain a version number for sandpaper. This will be used in the future to
+alert you if a workflow update is needed.
+
+What follows are the descriptions of the workflow files:
+
+## Deployment
+
+### 01 Build and Deploy (sandpaper-main.yaml)
+
+This is the main driver that will only act on the main branch of the repository.
+This workflow does the following:
+
+ 1. checks out the lesson
+ 2. provisions the following resources
+ - R
+ - pandoc
+ - lesson infrastructure (stored in a cache)
+ - lesson dependencies if needed (stored in a cache)
+ 3. builds the lesson via `sandpaper:::ci_deploy()`
+
+#### Caching
+
+This workflow has two caches; one cache is for the lesson infrastructure and
+the other is for the the lesson dependencies if the lesson contains rendered
+content. These caches are invalidated by new versions of the infrastructure and
+the `renv.lock` file, respectively. If there is a problem with the cache,
+manual invaliation is necessary. You will need maintain access to the repository
+and you can either go to the actions tab and [click on the caches button to find
+and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/)
+or by setting the `CACHE_VERSION` secret to the current date (which will
+invalidate all of the caches).
+
+## Updates
+
+### Setup Information
+
+These workflows run on a schedule and at the maintainer's request. Because they
+create pull requests that update workflows/require the downstream actions to run,
+they need a special repository/organization secret token called
+`SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope.
+
+This can be an individual user token, OR it can be a trusted bot account. If you
+have a repository in one of the official Carpentries accounts, then you do not
+need to worry about this token being present because the Carpentries Core Team
+will take care of supplying this token.
+
+If you want to use your personal account: you can go to
+
+to create a token. Once you have created your token, you should copy it to your
+clipboard and then go to your repository's settings > secrets > actions and
+create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token.
+
+If you do not specify your token correctly, the runs will not fail and they will
+give you instructions to provide the token for your repository.
+
+### 02 Maintain: Update Workflow Files (update-workflow.yaml)
+
+The {sandpaper} repository was designed to do as much as possible to separate
+the tools from the content. For local builds, this is absolutely true, but
+there is a minor issue when it comes to workflow files: they must live inside
+the repository.
+
+This workflow ensures that the workflow files are up-to-date. The way it work is
+to download the update-workflows.sh script from GitHub and run it. The script
+will do the following:
+
+1. check the recorded version of sandpaper against the current version on github
+2. update the files if there is a difference in versions
+
+After the files are updated, if there are any changes, they are pushed to a
+branch called `update/workflows` and a pull request is created. Maintainers are
+encouraged to review the changes and accept the pull request if the outputs
+are okay.
+
+This update is run ~~weekly or~~ on demand.
+
+### 03 Maintain: Update Pacakge Cache (update-cache.yaml)
+
+For lessons that have generated content, we use {renv} to ensure that the output
+is stable. This is controlled by a single lockfile which documents the packages
+needed for the lesson and the version numbers. This workflow is skipped in
+lessons that do not have generated content.
+
+Because the lessons need to remain current with the package ecosystem, it's a
+good idea to make sure these packages can be updated periodically. The
+update cache workflow will do this by checking for updates, applying them in a
+branch called `updates/packages` and creating a pull request with _only the
+lockfile changed_.
+
+From here, the markdown documents will be rebuilt and you can inspect what has
+changed based on how the packages have updated.
+
+## Pull Request and Review Management
+
+Because our lessons execute code, pull requests are a secruity risk for any
+lesson and thus have security measures associted with them. **Do not merge any
+pull requests that do not pass checks and do not have bots commented on them.**
+
+This series of workflows all go together and are described in the following
+diagram and the below sections:
+
+
+
+### Pre Flight Pull Request Validation (pr-preflight.yaml)
+
+This workflow runs every time a pull request is created and its purpose is to
+validate that the pull request is okay to run. This means the following things:
+
+1. The pull request does not contain modified workflow files
+2. If the pull request contains modified workflow files, it does not contain
+ modified content files (such as a situation where @carpentries-bot will
+ make an automated pull request)
+3. The pull request does not contain an invalid commit hash (e.g. from a fork
+ that was made before a lesson was transitioned from styles to use the
+ workbench).
+
+Once the checks are finished, a comment is issued to the pull request, which
+will allow maintainers to determine if it is safe to run the
+"Receive Pull Request" workflow from new contributors.
+
+### Recieve Pull Request (pr-recieve.yaml)
+
+**Note of caution:** This workflow runs arbitrary code by anyone who creates a
+pull request. GitHub has safeguarded the token used in this workflow to have no
+priviledges in the repository, but we have taken precautions to protect against
+spoofing.
+
+This workflow is triggered with every push to a pull request. If this workflow
+is already running and a new push is sent to the pull request, the workflow
+running from the previous push will be cancelled and a new workflow run will be
+started.
+
+The first step of this workflow is to check if it is valid (e.g. that no
+workflow files have been modified). If there are workflow files that have been
+modified, a comment is made that indicates that the workflow is not run. If
+both a workflow file and lesson content is modified, an error will occurr.
+
+The second step (if valid) is to build the generated content from the pull
+request. This builds the content and uploads three artifacts:
+
+1. The pull request number (pr)
+2. A summary of changes after the rendering process (diff)
+3. The rendered files (build)
+
+Because this workflow builds generated content, it follows the same general
+process as the `sandpaper-main` workflow with the same caching mechanisms.
+
+The artifacts produced are used by the next workflow.
+
+### Comment on Pull Request (pr-comment.yaml)
+
+This workflow is triggered if the `pr-recieve.yaml` workflow is successful.
+The steps in this workflow are:
+
+1. Test if the workflow is valid and comment the validity of the workflow to the
+ pull request.
+2. If it is valid: create an orphan branch with two commits: the current state
+ of the repository and the proposed changes.
+3. If it is valid: update the pull request comment with the summary of changes
+
+Importantly: if the pull request is invalid, the branch is not created so any
+malicious code is not published.
+
+From here, the maintainer can request changes from the author and eventually
+either merge or reject the PR. When this happens, if the PR was valid, the
+preview branch needs to be deleted.
+
+### Send Close PR Signal (pr-close-signal.yaml)
+
+Triggered any time a pull request is closed. This emits an artifact that is the
+pull request number for the next action
+
+### Remove Pull Request Branch (pr-post-remove-branch.yaml)
+
+Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with
+the pull request (if it was created).
diff --git a/.github/workflows/pr-close-signal.yaml b/.github/workflows/pr-close-signal.yaml
new file mode 100755
index 00000000..9b129d5d
--- /dev/null
+++ b/.github/workflows/pr-close-signal.yaml
@@ -0,0 +1,23 @@
+name: "Bot: Send Close Pull Request Signal"
+
+on:
+ pull_request:
+ types:
+ [closed]
+
+jobs:
+ send-close-signal:
+ name: "Send closing signal"
+ runs-on: ubuntu-latest
+ if: ${{ github.event.action == 'closed' }}
+ steps:
+ - name: "Create PRtifact"
+ run: |
+ mkdir -p ./pr
+ printf ${{ github.event.number }} > ./pr/NUM
+ - name: Upload Diff
+ uses: actions/upload-artifact@v3
+ with:
+ name: pr
+ path: ./pr
+
diff --git a/.github/workflows/pr-comment.yaml b/.github/workflows/pr-comment.yaml
new file mode 100755
index 00000000..bb2eb03c
--- /dev/null
+++ b/.github/workflows/pr-comment.yaml
@@ -0,0 +1,185 @@
+name: "Bot: Comment on the Pull Request"
+
+# read-write repo token
+# access to secrets
+on:
+ workflow_run:
+ workflows: ["Receive Pull Request"]
+ types:
+ - completed
+
+concurrency:
+ group: pr-${{ github.event.workflow_run.pull_requests[0].number }}
+ cancel-in-progress: true
+
+
+jobs:
+ # Pull requests are valid if:
+ # - they match the sha of the workflow run head commit
+ # - they are open
+ # - no .github files were committed
+ test-pr:
+ name: "Test if pull request is valid"
+ runs-on: ubuntu-latest
+ if: >
+ github.event.workflow_run.event == 'pull_request' &&
+ github.event.workflow_run.conclusion == 'success'
+ outputs:
+ is_valid: ${{ steps.check-pr.outputs.VALID }}
+ payload: ${{ steps.check-pr.outputs.payload }}
+ number: ${{ steps.get-pr.outputs.NUM }}
+ msg: ${{ steps.check-pr.outputs.MSG }}
+ steps:
+ - name: 'Download PR artifact'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'pr'
+
+ - name: "Get PR Number"
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ id: get-pr
+ run: |
+ unzip pr.zip
+ echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT
+
+ - name: "Fail if PR number was not present"
+ id: bad-pr
+ if: ${{ steps.dl.outputs.success != 'true' }}
+ run: |
+ echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.'
+ exit 1
+ - name: "Get Invalid Hashes File"
+ id: hash
+ run: |
+ echo "json<> $GITHUB_OUTPUT
+ - name: "Check PR"
+ id: check-pr
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ uses: carpentries/actions/check-valid-pr@main
+ with:
+ pr: ${{ steps.get-pr.outputs.NUM }}
+ sha: ${{ github.event.workflow_run.head_sha }}
+ headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire
+ invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
+ fail_on_error: true
+
+ # Create an orphan branch on this repository with two commits
+ # - the current HEAD of the md-outputs branch
+ # - the output from running the current HEAD of the pull request through
+ # the md generator
+ create-branch:
+ name: "Create Git Branch"
+ needs: test-pr
+ runs-on: ubuntu-latest
+ if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
+ env:
+ NR: ${{ needs.test-pr.outputs.number }}
+ permissions:
+ contents: write
+ steps:
+ - name: 'Checkout md outputs'
+ uses: actions/checkout@v3
+ with:
+ ref: md-outputs
+ path: built
+ fetch-depth: 1
+
+ - name: 'Download built markdown'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'built'
+
+ - if: ${{ steps.dl.outputs.success == 'true' }}
+ run: unzip built.zip
+
+ - name: "Create orphan and push"
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ run: |
+ cd built/
+ git config --local user.email "actions@github.com"
+ git config --local user.name "GitHub Actions"
+ CURR_HEAD=$(git rev-parse HEAD)
+ git checkout --orphan md-outputs-PR-${NR}
+ git add -A
+ git commit -m "source commit: ${CURR_HEAD}"
+ ls -A | grep -v '^.git$' | xargs -I _ rm -r '_'
+ cd ..
+ unzip -o -d built built.zip
+ cd built
+ git add -A
+ git commit --allow-empty -m "differences for PR #${NR}"
+ git push -u --force --set-upstream origin md-outputs-PR-${NR}
+
+ # Comment on the Pull Request with a link to the branch and the diff
+ comment-pr:
+ name: "Comment on Pull Request"
+ needs: [test-pr, create-branch]
+ runs-on: ubuntu-latest
+ if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
+ env:
+ NR: ${{ needs.test-pr.outputs.number }}
+ permissions:
+ pull-requests: write
+ steps:
+ - name: 'Download comment artifact'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'diff'
+
+ - if: ${{ steps.dl.outputs.success == 'true' }}
+ run: unzip ${{ github.workspace }}/diff.zip
+
+ - name: "Comment on PR"
+ id: comment-diff
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ uses: carpentries/actions/comment-diff@main
+ with:
+ pr: ${{ env.NR }}
+ path: ${{ github.workspace }}/diff.md
+
+ # Comment if the PR is open and matches the SHA, but the workflow files have
+ # changed
+ comment-changed-workflow:
+ name: "Comment if workflow files have changed"
+ needs: test-pr
+ runs-on: ubuntu-latest
+ if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }}
+ env:
+ NR: ${{ github.event.workflow_run.pull_requests[0].number }}
+ body: ${{ needs.test-pr.outputs.msg }}
+ permissions:
+ pull-requests: write
+ steps:
+ - name: 'Check for spoofing'
+ id: dl
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: 'built'
+
+ - name: 'Alert if spoofed'
+ id: spoof
+ if: ${{ steps.dl.outputs.success == 'true' }}
+ run: |
+ echo 'body<> $GITHUB_ENV
+ echo '' >> $GITHUB_ENV
+ echo '## :x: DANGER :x:' >> $GITHUB_ENV
+ echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV
+ echo '' >> $GITHUB_ENV
+ echo 'EOF' >> $GITHUB_ENV
+
+ - name: "Comment on PR"
+ id: comment-diff
+ uses: carpentries/actions/comment-diff@main
+ with:
+ pr: ${{ env.NR }}
+ body: ${{ env.body }}
+
diff --git a/.github/workflows/pr-post-remove-branch.yaml b/.github/workflows/pr-post-remove-branch.yaml
new file mode 100755
index 00000000..62c2e98d
--- /dev/null
+++ b/.github/workflows/pr-post-remove-branch.yaml
@@ -0,0 +1,32 @@
+name: "Bot: Remove Temporary PR Branch"
+
+on:
+ workflow_run:
+ workflows: ["Bot: Send Close Pull Request Signal"]
+ types:
+ - completed
+
+jobs:
+ delete:
+ name: "Delete branch from Pull Request"
+ runs-on: ubuntu-latest
+ if: >
+ github.event.workflow_run.event == 'pull_request' &&
+ github.event.workflow_run.conclusion == 'success'
+ permissions:
+ contents: write
+ steps:
+ - name: 'Download artifact'
+ uses: carpentries/actions/download-workflow-artifact@main
+ with:
+ run: ${{ github.event.workflow_run.id }}
+ name: pr
+ - name: "Get PR Number"
+ id: get-pr
+ run: |
+ unzip pr.zip
+ echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT
+ - name: 'Remove branch'
+ uses: carpentries/actions/remove-branch@main
+ with:
+ pr: ${{ steps.get-pr.outputs.NUM }}
diff --git a/.github/workflows/pr-preflight.yaml b/.github/workflows/pr-preflight.yaml
new file mode 100755
index 00000000..d0d7420d
--- /dev/null
+++ b/.github/workflows/pr-preflight.yaml
@@ -0,0 +1,39 @@
+name: "Pull Request Preflight Check"
+
+on:
+ pull_request_target:
+ branches:
+ ["main"]
+ types:
+ ["opened", "synchronize", "reopened"]
+
+jobs:
+ test-pr:
+ name: "Test if pull request is valid"
+ if: ${{ github.event.action != 'closed' }}
+ runs-on: ubuntu-latest
+ outputs:
+ is_valid: ${{ steps.check-pr.outputs.VALID }}
+ permissions:
+ pull-requests: write
+ steps:
+ - name: "Get Invalid Hashes File"
+ id: hash
+ run: |
+ echo "json<> $GITHUB_OUTPUT
+ - name: "Check PR"
+ id: check-pr
+ uses: carpentries/actions/check-valid-pr@main
+ with:
+ pr: ${{ github.event.number }}
+ invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
+ fail_on_error: true
+ - name: "Comment result of validation"
+ id: comment-diff
+ if: ${{ always() }}
+ uses: carpentries/actions/comment-diff@main
+ with:
+ pr: ${{ github.event.number }}
+ body: ${{ steps.check-pr.outputs.MSG }}
diff --git a/.github/workflows/pr-receive.yaml b/.github/workflows/pr-receive.yaml
new file mode 100755
index 00000000..371ef542
--- /dev/null
+++ b/.github/workflows/pr-receive.yaml
@@ -0,0 +1,131 @@
+name: "Receive Pull Request"
+
+on:
+ pull_request:
+ types:
+ [opened, synchronize, reopened]
+
+concurrency:
+ group: ${{ github.ref }}
+ cancel-in-progress: true
+
+jobs:
+ test-pr:
+ name: "Record PR number"
+ if: ${{ github.event.action != 'closed' }}
+ runs-on: ubuntu-latest
+ outputs:
+ is_valid: ${{ steps.check-pr.outputs.VALID }}
+ steps:
+ - name: "Record PR number"
+ id: record
+ if: ${{ always() }}
+ run: |
+ echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR
+ - name: "Upload PR number"
+ id: upload
+ if: ${{ always() }}
+ uses: actions/upload-artifact@v3
+ with:
+ name: pr
+ path: ${{ github.workspace }}/NR
+ - name: "Get Invalid Hashes File"
+ id: hash
+ run: |
+ echo "json<> $GITHUB_OUTPUT
+ - name: "echo output"
+ run: |
+ echo "${{ steps.hash.outputs.json }}"
+ - name: "Check PR"
+ id: check-pr
+ uses: carpentries/actions/check-valid-pr@main
+ with:
+ pr: ${{ github.event.number }}
+ invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
+
+ build-md-source:
+ name: "Build markdown source files if valid"
+ needs: test-pr
+ runs-on: ubuntu-latest
+ if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
+ env:
+ GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+ RENV_PATHS_ROOT: ~/.local/share/renv/
+ CHIVE: ${{ github.workspace }}/site/chive
+ PR: ${{ github.workspace }}/site/pr
+ MD: ${{ github.workspace }}/site/built
+ steps:
+ - name: "Check Out Main Branch"
+ uses: actions/checkout@v3
+
+ - name: "Check Out Staging Branch"
+ uses: actions/checkout@v3
+ with:
+ ref: md-outputs
+ path: ${{ env.MD }}
+
+ - name: "Set up R"
+ uses: r-lib/actions/setup-r@v2
+ with:
+ use-public-rspm: true
+ install-r: false
+
+ - name: "Set up Pandoc"
+ uses: r-lib/actions/setup-pandoc@v2
+
+ - name: "Setup Lesson Engine"
+ uses: carpentries/actions/setup-sandpaper@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Setup Package Cache"
+ uses: carpentries/actions/setup-lesson-deps@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Validate and Build Markdown"
+ id: build-site
+ run: |
+ sandpaper::package_cache_trigger(TRUE)
+ sandpaper::validate_lesson(path = '${{ github.workspace }}')
+ sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE)
+ shell: Rscript {0}
+
+ - name: "Generate Artifacts"
+ id: generate-artifacts
+ run: |
+ sandpaper:::ci_bundle_pr_artifacts(
+ repo = '${{ github.repository }}',
+ pr_number = '${{ github.event.number }}',
+ path_md = '${{ env.MD }}',
+ path_pr = '${{ env.PR }}',
+ path_archive = '${{ env.CHIVE }}',
+ branch = 'md-outputs'
+ )
+ shell: Rscript {0}
+
+ - name: "Upload PR"
+ uses: actions/upload-artifact@v3
+ with:
+ name: pr
+ path: ${{ env.PR }}
+
+ - name: "Upload Diff"
+ uses: actions/upload-artifact@v3
+ with:
+ name: diff
+ path: ${{ env.CHIVE }}
+ retention-days: 1
+
+ - name: "Upload Build"
+ uses: actions/upload-artifact@v3
+ with:
+ name: built
+ path: ${{ env.MD }}
+ retention-days: 1
+
+ - name: "Teardown"
+ run: sandpaper::reset_site()
+ shell: Rscript {0}
diff --git a/.github/workflows/sandpaper-main.yaml b/.github/workflows/sandpaper-main.yaml
new file mode 100755
index 00000000..e17707ac
--- /dev/null
+++ b/.github/workflows/sandpaper-main.yaml
@@ -0,0 +1,61 @@
+name: "01 Build and Deploy Site"
+
+on:
+ push:
+ branches:
+ - main
+ - master
+ schedule:
+ - cron: '0 0 * * 2'
+ workflow_dispatch:
+ inputs:
+ name:
+ description: 'Who triggered this build?'
+ required: true
+ default: 'Maintainer (via GitHub)'
+ reset:
+ description: 'Reset cached markdown files'
+ required: false
+ default: false
+ type: boolean
+jobs:
+ full-build:
+ name: "Build Full Site"
+ runs-on: ubuntu-latest
+ permissions:
+ checks: write
+ contents: write
+ pages: write
+ env:
+ GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+ RENV_PATHS_ROOT: ~/.local/share/renv/
+ steps:
+
+ - name: "Checkout Lesson"
+ uses: actions/checkout@v3
+
+ - name: "Set up R"
+ uses: r-lib/actions/setup-r@v2
+ with:
+ use-public-rspm: true
+ install-r: false
+
+ - name: "Set up Pandoc"
+ uses: r-lib/actions/setup-pandoc@v2
+
+ - name: "Setup Lesson Engine"
+ uses: carpentries/actions/setup-sandpaper@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Setup Package Cache"
+ uses: carpentries/actions/setup-lesson-deps@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: "Deploy Site"
+ run: |
+ reset <- "${{ github.event.inputs.reset }}" == "true"
+ sandpaper::package_cache_trigger(TRUE)
+ sandpaper:::ci_deploy(reset = reset)
+ shell: Rscript {0}
diff --git a/.github/workflows/sandpaper-version.txt b/.github/workflows/sandpaper-version.txt
new file mode 100644
index 00000000..4aa09069
--- /dev/null
+++ b/.github/workflows/sandpaper-version.txt
@@ -0,0 +1 @@
+0.11.15
diff --git a/.github/workflows/update-cache.yaml b/.github/workflows/update-cache.yaml
new file mode 100755
index 00000000..676d7424
--- /dev/null
+++ b/.github/workflows/update-cache.yaml
@@ -0,0 +1,125 @@
+name: "03 Maintain: Update Package Cache"
+
+on:
+ workflow_dispatch:
+ inputs:
+ name:
+ description: 'Who triggered this build (enter github username to tag yourself)?'
+ required: true
+ default: 'monthly run'
+ schedule:
+ # Run every tuesday
+ - cron: '0 0 * * 2'
+
+jobs:
+ preflight:
+ name: "Preflight Check"
+ runs-on: ubuntu-latest
+ outputs:
+ ok: ${{ steps.check.outputs.ok }}
+ steps:
+ - id: check
+ run: |
+ if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then
+ echo "ok=true" >> $GITHUB_OUTPUT
+ echo "Running on request"
+ # using single brackets here to avoid 08 being interpreted as octal
+ # https://github.com/carpentries/sandpaper/issues/250
+ elif [ `date +%d` -le 7 ]; then
+ # If the Tuesday lands in the first week of the month, run it
+ echo "ok=true" >> $GITHUB_OUTPUT
+ echo "Running on schedule"
+ else
+ echo "ok=false" >> $GITHUB_OUTPUT
+ echo "Not Running Today"
+ fi
+
+ check_renv:
+ name: "Check if We Need {renv}"
+ runs-on: ubuntu-latest
+ needs: preflight
+ if: ${{ needs.preflight.outputs.ok == 'true'}}
+ outputs:
+ needed: ${{ steps.renv.outputs.exists }}
+ steps:
+ - name: "Checkout Lesson"
+ uses: actions/checkout@v3
+ - id: renv
+ run: |
+ if [[ -d renv ]]; then
+ echo "exists=true" >> $GITHUB_OUTPUT
+ fi
+
+ check_token:
+ name: "Check SANDPAPER_WORKFLOW token"
+ runs-on: ubuntu-latest
+ needs: check_renv
+ if: ${{ needs.check_renv.outputs.needed == 'true' }}
+ outputs:
+ workflow: ${{ steps.validate.outputs.wf }}
+ repo: ${{ steps.validate.outputs.repo }}
+ steps:
+ - name: "validate token"
+ id: validate
+ uses: carpentries/actions/check-valid-credentials@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+
+ update_cache:
+ name: "Update Package Cache"
+ needs: check_token
+ if: ${{ needs.check_token.outputs.repo== 'true' }}
+ runs-on: ubuntu-latest
+ env:
+ GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
+ RENV_PATHS_ROOT: ~/.local/share/renv/
+ steps:
+
+ - name: "Checkout Lesson"
+ uses: actions/checkout@v3
+
+ - name: "Set up R"
+ uses: r-lib/actions/setup-r@v2
+ with:
+ use-public-rspm: true
+ install-r: false
+
+ - name: "Update {renv} deps and determine if a PR is needed"
+ id: update
+ uses: carpentries/actions/update-lockfile@main
+ with:
+ cache-version: ${{ secrets.CACHE_VERSION }}
+
+ - name: Create Pull Request
+ id: cpr
+ if: ${{ steps.update.outputs.n > 0 }}
+ uses: carpentries/create-pull-request@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+ delete-branch: true
+ branch: "update/packages"
+ commit-message: "[actions] update ${{ steps.update.outputs.n }} packages"
+ title: "Update ${{ steps.update.outputs.n }} packages"
+ body: |
+ :robot: This is an automated build
+
+ This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions:
+
+ ```
+ ${{ steps.update.outputs.report }}
+ ```
+
+ :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates.
+
+ If you want to inspect these changes locally, you can use the following code to check out a new branch:
+
+ ```bash
+ git fetch origin update/packages
+ git checkout update/packages
+ ```
+
+ - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
+
+ [1]: https://github.com/carpentries/create-pull-request/tree/main
+ labels: "type: package cache"
+ draft: false
diff --git a/.github/workflows/update-workflows.yaml b/.github/workflows/update-workflows.yaml
new file mode 100755
index 00000000..288bcd13
--- /dev/null
+++ b/.github/workflows/update-workflows.yaml
@@ -0,0 +1,66 @@
+name: "02 Maintain: Update Workflow Files"
+
+on:
+ workflow_dispatch:
+ inputs:
+ name:
+ description: 'Who triggered this build (enter github username to tag yourself)?'
+ required: true
+ default: 'weekly run'
+ clean:
+ description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)'
+ required: false
+ default: '.yaml'
+ schedule:
+ # Run every Tuesday
+ - cron: '0 0 * * 2'
+
+jobs:
+ check_token:
+ name: "Check SANDPAPER_WORKFLOW token"
+ runs-on: ubuntu-latest
+ outputs:
+ workflow: ${{ steps.validate.outputs.wf }}
+ repo: ${{ steps.validate.outputs.repo }}
+ steps:
+ - name: "validate token"
+ id: validate
+ uses: carpentries/actions/check-valid-credentials@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+
+ update_workflow:
+ name: "Update Workflow"
+ runs-on: ubuntu-latest
+ needs: check_token
+ if: ${{ needs.check_token.outputs.workflow == 'true' }}
+ steps:
+ - name: "Checkout Repository"
+ uses: actions/checkout@v3
+
+ - name: Update Workflows
+ id: update
+ uses: carpentries/actions/update-workflows@main
+ with:
+ clean: ${{ github.event.inputs.clean }}
+
+ - name: Create Pull Request
+ id: cpr
+ if: "${{ steps.update.outputs.new }}"
+ uses: carpentries/create-pull-request@main
+ with:
+ token: ${{ secrets.SANDPAPER_WORKFLOW }}
+ delete-branch: true
+ branch: "update/workflows"
+ commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}"
+ title: "Update Workflows to Version ${{ steps.update.outputs.new }}"
+ body: |
+ :robot: This is an automated build
+
+ Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }}
+
+ - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
+
+ [1]: https://github.com/carpentries/create-pull-request/tree/main
+ labels: "type: template and tools"
+ draft: false
diff --git a/.github/workflows/workbench-beta-phase.yml b/.github/workflows/workbench-beta-phase.yml
new file mode 100644
index 00000000..2faa25d9
--- /dev/null
+++ b/.github/workflows/workbench-beta-phase.yml
@@ -0,0 +1,60 @@
+name: "Deploy to AWS"
+
+on:
+ workflow_run:
+ workflows: ["01 Build and Deploy Site"]
+ types:
+ - completed
+ workflow_dispatch:
+
+jobs:
+ preflight:
+ name: "Preflight Check"
+ runs-on: ubuntu-latest
+ outputs:
+ ok: ${{ steps.check.outputs.ok }}
+ folder: ${{ steps.check.outputs.folder }}
+ steps:
+ - id: check
+ run: |
+ if [[ -z "${{ secrets.DISTRIBUTION }}" || -z "${{ secrets.AWS_ACCESS_KEY_ID }}" || -z "${{ secrets.AWS_SECRET_ACCESS_KEY }}" ]]; then
+ echo ":information_source: No site configured" >> $GITHUB_STEP_SUMMARY
+ echo "" >> $GITHUB_STEP_SUMMARY
+ echo 'To deploy the preview on AWS, you need the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` and `DISTRIBUTION` secrets set up' >> $GITHUB_STEP_SUMMARY
+ else
+ echo "::set-output name=folder::"$(sed -E 's^.+/(.+)^\1^' <<< ${{ github.repository }})
+ echo "::set-output name=ok::true"
+ fi
+
+ full-build:
+ name: "Deploy to AWS"
+ needs: [preflight]
+ if: ${{ needs.preflight.outputs.ok }}
+ runs-on: ubuntu-latest
+ steps:
+
+ - name: "Checkout site folder"
+ uses: actions/checkout@v3
+ with:
+ ref: 'gh-pages'
+ path: 'source'
+
+ - name: "Deploy to Bucket"
+ uses: jakejarvis/s3-sync-action@v0.5.1
+ with:
+ args: --acl public-read --follow-symlinks --delete --exclude '.git/*'
+ env:
+ AWS_S3_BUCKET: preview.carpentries.org
+ AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+ AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+ SOURCE_DIR: 'source'
+ DEST_DIR: ${{ needs.preflight.outputs.folder }}
+
+ - name: "Invalidate CloudFront"
+ uses: chetan/invalidate-cloudfront-action@master
+ env:
+ PATHS: /*
+ AWS_REGION: 'us-east-1'
+ DISTRIBUTION: ${{ secrets.DISTRIBUTION }}
+ AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+ AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..1146fea3
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,56 @@
+# sandpaper files
+episodes/*html
+site/*
+!site/README.md
+
+# History files
+.Rhistory
+.Rapp.history
+# Session Data files
+.RData
+# User-specific files
+.Ruserdata
+# Example code in package build process
+*-Ex.R
+# Output files from R CMD build
+/*.tar.gz
+# Output files from R CMD check
+/*.Rcheck/
+# RStudio files
+.Rproj.user/
+# produced vignettes
+vignettes/*.html
+vignettes/*.pdf
+# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
+.httr-oauth
+# knitr and R markdown default cache directories
+*_cache/
+/cache/
+# Temporary files created by R markdown
+*.utf8.md
+*.knit.md
+# R Environment Variables
+.Renviron
+# pkgdown site
+docs/
+# translation temp files
+po/*~
+# renv detritus
+renv/sandbox/
+GC_Pipe.txt
+*.pyc
+*~
+.DS_Store
+.ipynb_checkpoints
+.sass-cache
+.jekyll-cache/
+.jekyll-metadata
+__pycache__
+_site
+.Rproj.user
+.bundle/
+.vendor/
+vendor/
+.docker-vendor/
+Gemfile.lock
+.*history
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
new file mode 100644
index 00000000..f19b8049
--- /dev/null
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,13 @@
+---
+title: "Contributor Code of Conduct"
+---
+
+As contributors and maintainers of this project,
+we pledge to follow the [The Carpentries Code of Conduct][coc].
+
+Instances of abusive, harassing, or otherwise unacceptable behavior
+may be reported by following our [reporting guidelines][coc-reporting].
+
+
+[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
+[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
new file mode 100644
index 00000000..ec44704c
--- /dev/null
+++ b/CONTRIBUTING.md
@@ -0,0 +1,121 @@
+## Contributing
+
+[The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data
+Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source
+projects, and we welcome contributions of all kinds: new lessons, fixes to
+existing material, bug reports, and reviews of proposed changes are all
+welcome.
+
+### Contributor Agreement
+
+By contributing, you agree that we may redistribute your work under [our
+license](LICENSE.md). In exchange, we will address your issues and/or assess
+your change proposal as promptly as we can, and help you become a member of our
+community. Everyone involved in [The Carpentries][cp-site] agrees to abide by
+our [code of conduct](CODE_OF_CONDUCT.md).
+
+### How to Contribute
+
+The easiest way to get started is to file an issue to tell us about a spelling
+mistake, some awkward wording, or a factual error. This is a good way to
+introduce yourself and to meet some of our community members.
+
+1. If you do not have a [GitHub][github] account, you can [send us comments by
+ email][contact]. However, we will be able to respond more quickly if you use
+ one of the other methods described below.
+
+2. If you have a [GitHub][github] account, or are willing to [create
+ one][github-join], but do not know how to use Git, you can report problems
+ or suggest improvements by [creating an issue][issues]. This allows us to
+ assign the item to someone and to respond to it in a threaded discussion.
+
+3. If you are comfortable with Git, and would like to add or change material,
+ you can submit a pull request (PR). Instructions for doing this are
+ [included below](#using-github).
+
+Note: if you want to build the website locally, please refer to [The Workbench
+documentation][template-doc].
+
+### Where to Contribute
+
+1. If you wish to change this lesson, add issues and pull requests here.
+2. If you wish to change the template used for workshop websites, please refer
+ to [The Workbench documentation][template-doc].
+
+
+### What to Contribute
+
+There are many ways to contribute, from writing new exercises and improving
+existing ones to updating or filling in the documentation and submitting [bug
+reports][issues] about things that do not work, are not clear, or are missing.
+If you are looking for ideas, please see [the list of issues for this
+repository][repo], or the issues for [Data Carpentry][dc-issues], [Library
+Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects.
+
+Comments on issues and reviews of pull requests are just as welcome: we are
+smarter together than we are on our own. **Reviews from novices and newcomers
+are particularly valuable**: it's easy for people who have been using these
+lessons for a while to forget how impenetrable some of this material can be, so
+fresh eyes are always welcome.
+
+### What *Not* to Contribute
+
+Our lessons already contain more material than we can cover in a typical
+workshop, so we are usually *not* looking for more concepts or tools to add to
+them. As a rule, if you want to introduce a new idea, you must (a) estimate how
+long it will take to teach and (b) explain what you would take out to make room
+for it. The first encourages contributors to be honest about requirements; the
+second, to think hard about priorities.
+
+We are also not looking for exercises or other material that only run on one
+platform. Our workshops typically contain a mixture of Windows, macOS, and
+Linux users; in order to be usable, our lessons must run equally well on all
+three.
+
+### Using GitHub
+
+If you choose to contribute via GitHub, you may want to look at [How to
+Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we
+use [GitHub flow][github-flow] to manage changes:
+
+1. Create a new branch in your desktop copy of this repository for each
+ significant change.
+2. Commit the change in that branch.
+3. Push that branch to your fork of this repository on GitHub.
+4. Submit a pull request from that branch to the [upstream repository][repo].
+5. If you receive feedback, make changes on your desktop and push to your
+ branch on GitHub: the pull request will update automatically.
+
+NB: The published copy of the lesson is usually in the `main` branch.
+
+Each lesson has a team of maintainers who review issues and pull requests or
+encourage others to do so. The maintainers are community volunteers, and have
+final say over what gets merged into the lesson.
+
+### Other Resources
+
+The Carpentries is a global organisation with volunteers and learners all over
+the world. We share values of inclusivity and a passion for sharing knowledge,
+teaching and learning. There are several ways to connect with The Carpentries
+community listed at including via social
+media, slack, newsletters, and email lists. You can also [reach us by
+email][contact].
+
+[repo]: https://example.com/FIXME
+[contact]: mailto:team@carpentries.org
+[cp-site]: https://carpentries.org/
+[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry
+[dc-lessons]: https://datacarpentry.org/lessons/
+[dc-site]: https://datacarpentry.org/
+[discuss-list]: https://lists.software-carpentry.org/listinfo/discuss
+[github]: https://github.com
+[github-flow]: https://guides.github.com/introduction/flow/
+[github-join]: https://github.com/join
+[how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github
+[issues]: https://carpentries.org/help-wanted-issues/
+[lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry
+[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry
+[swc-lessons]: https://software-carpentry.org/lessons/
+[swc-site]: https://software-carpentry.org/
+[lc-site]: https://librarycarpentry.org/
+[template-doc]: https://carpentries.github.io/workbench/
diff --git a/LICENSE.md b/LICENSE.md
new file mode 100644
index 00000000..7632871f
--- /dev/null
+++ b/LICENSE.md
@@ -0,0 +1,79 @@
+---
+title: "Licenses"
+---
+
+## Instructional Material
+
+All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
+instructional material is made available under the [Creative Commons
+Attribution license][cc-by-human]. The following is a human-readable summary of
+(and not a substitute for) the [full legal text of the CC BY 4.0
+license][cc-by-legal].
+
+You are free:
+
+- to **Share**---copy and redistribute the material in any medium or format
+- to **Adapt**---remix, transform, and build upon the material
+
+for any purpose, even commercially.
+
+The licensor cannot revoke these freedoms as long as you follow the license
+terms.
+
+Under the following terms:
+
+- **Attribution**---You must give appropriate credit (mentioning that your work
+ is derived from work that is Copyright (c) The Carpentries and, where
+ practical, linking to ), provide a [link to the
+ license][cc-by-human], and indicate if changes were made. You may do so in
+ any reasonable manner, but not in any way that suggests the licensor endorses
+ you or your use.
+
+- **No additional restrictions**---You may not apply legal terms or
+ technological measures that legally restrict others from doing anything the
+ license permits. With the understanding that:
+
+Notices:
+
+* You do not have to comply with the license for elements of the material in
+ the public domain or where your use is permitted by an applicable exception
+ or limitation.
+* No warranties are given. The license may not give you all of the permissions
+ necessary for your intended use. For example, other rights such as publicity,
+ privacy, or moral rights may limit how you use the material.
+
+## Software
+
+Except where otherwise noted, the example programs and other software provided
+by The Carpentries are made available under the [OSI][osi]-approved [MIT
+license][mit-license].
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+## Trademark
+
+"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
+Carpentry" and their respective logos are registered trademarks of [Community
+Initiatives][ci].
+
+[cc-by-human]: https://creativecommons.org/licenses/by/4.0/
+[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
+[mit-license]: https://opensource.org/licenses/mit-license.html
+[ci]: https://communityin.org/
+[osi]: https://opensource.org
diff --git a/README.md b/README.md
index 9f3aa44a..3c88a319 100644
--- a/README.md
+++ b/README.md
@@ -1,18 +1,18 @@
[](https://doi.org/10.5281/zenodo.3260609)
-[](https://swc-slack-invite.herokuapp.com/)
-[](https://swcarpentry.slack.com/messages/C9N1K7DCY)
+[](https://swc-slack-invite.herokuapp.com/)
+[](https://swcarpentry.slack.com/messages/C9N1K7DCY)
# Wrangling Genomics
Lesson for quality control and wrangling genomics data. This repository is maintained by [Josh Herr](https://github.com/jrherr), [Ming Tang](https://github.com/crazyhottommy), and [Fotis Psomopoulos](https://github.com/fpsom).
-Amazon public AMI for this tutorial is "dataCgen-qc".
+Amazon public AMI for this tutorial is "dataCgen-qc".
## Background
Wrangling genomics trains novice learners on a variant calling workflow. Participants will learn how to evaluate sequence quality and what to do if it is not good. We will then cover aligning reads to a genome, and calling variants, as well as discussing different file formats. Results will be visualized. Finally, we will cover how to automate the process by building a shell script.
-This lesson is part of the [Data Carpentry](http://www.datacarpentry.org/) [Genomics Workshop](http://www.datacarpentry.org/genomics-workshop/).
+This lesson is part of the [Data Carpentry](https://www.datacarpentry.org/) [Genomics Workshop](https://www.datacarpentry.org/genomics-workshop/).
## Contribution
@@ -20,7 +20,7 @@ This lesson is part of the [Data Carpentry](http://www.datacarpentry.org/) [Geno
## Code of Conduct
-All participants should agree to abide by the [Data Carpentry Code of Conduct](http://www.datacarpentry.org/code-of-conduct/).
+All participants should agree to abide by the [Data Carpentry Code of Conduct](https://www.datacarpentry.org/code-of-conduct/).
## Authors
@@ -30,4 +30,6 @@ Wrangling genomics is authored and maintained by the [community](https://github.
Please cite as:
-Erin Alison Becker, Taylor Reiter, Fotis Psomopoulos, Sheldon John McKay, Jessica Elizabeth Mizzi, Jason Williams, … Winni Kretzschmar. (2019, June). datacarpentry/wrangling-genomics: Data Carpentry: Genomics data wrangling and processing, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260609
+Erin Alison Becker, Taylor Reiter, Fotis Psomopoulos, Sheldon John McKay, Jessica Elizabeth Mizzi, Jason Williams, … Winni Kretzschmar. (2019, June). datacarpentry/wrangling-genomics: Data Carpentry: Genomics data wrangling and processing, June 2019 (Version v2019.06.1). Zenodo. [http://doi.org/10.5281/zenodo.3260609](https://doi.org/10.5281/zenodo.3260609)
+
+
diff --git a/config.yaml b/config.yaml
new file mode 100644
index 00000000..ebb05ac9
--- /dev/null
+++ b/config.yaml
@@ -0,0 +1,86 @@
+#------------------------------------------------------------
+# Values for this lesson.
+#------------------------------------------------------------
+
+# Which carpentry is this (swc, dc, lc, or cp)?
+# swc: Software Carpentry
+# dc: Data Carpentry
+# lc: Library Carpentry
+# cp: Carpentries (to use for instructor training for instance)
+# incubator: The Carpentries Incubator
+carpentry: 'dc'
+
+# Overall title for pages.
+title: 'Data Wrangling and Processing for Genomics'
+
+# Date the lesson was created (YYYY-MM-DD, this is empty by default)
+created:
+
+# Comma-separated list of keywords for the lesson
+keywords: 'software, data, lesson, The Carpentries'
+
+# Life cycle stage of the lesson
+# possible values: pre-alpha, alpha, beta, stable
+life_cycle: 'stable'
+
+# License of the lesson materials (recommended CC-BY 4.0)
+license: 'CC-BY 4.0'
+
+# Link to the source repository for this lesson
+source: 'https://github.com/fishtree-attempt/wrangling-genomics/'
+
+# Default branch of your lesson
+branch: 'main'
+
+# Who to contact if there are any issues
+contact: 'team@carpentries.org'
+
+# Navigation ------------------------------------------------
+#
+# Use the following menu items to specify the order of
+# individual pages in each dropdown section. Leave blank to
+# include all pages in the folder.
+#
+# Example -------------
+#
+# episodes:
+# - introduction.md
+# - first-steps.md
+#
+# learners:
+# - setup.md
+#
+# instructors:
+# - instructor-notes.md
+#
+# profiles:
+# - one-learner.md
+# - another-learner.md
+
+# Order of episodes in your lesson
+episodes:
+- 01-background.md
+- 02-quality-control.md
+- 03-trimming.md
+- 04-variant_calling.md
+- 05-automation.md
+
+# Information for Learners
+learners:
+
+# Information for Instructors
+instructors:
+
+# Learner Profiles
+profiles:
+
+# Customisation ---------------------------------------------
+#
+# This space below is where custom yaml items (e.g. pinning
+# sandpaper and varnish versions) should live
+
+
+url: https://preview.carpentries.org/wrangling-genomics
+analytics: carpentries
+lang: en
+workbench-beta: 'true'
diff --git a/episodes/01-background.md b/episodes/01-background.md
index 5576c6c4..bc74905f 100644
--- a/episodes/01-background.md
+++ b/episodes/01-background.md
@@ -1,82 +1,99 @@
---
-title: "Background and Metadata"
+title: Background and Metadata
teaching: 10
exercises: 5
-questions:
-- "What data are we using?"
-- "Why is this experiment important?"
-objectives:
-- "Why study *E. coli*?"
-- "Understand the data set."
-- "What is hypermutability?"
-keypoints:
-- "It is important to record and understand your experiment's metadata."
---
-# Background
+::::::::::::::::::::::::::::::::::::::: objectives
-We are going to use a long-term sequencing dataset from a population of *Escherichia coli*.
+- Why study *E. coli*?
+- Understand the data set.
+- What is hypermutability?
- - **What is *E. coli*?**
- - *E. coli* are rod-shaped bacteria that can survive under a wide variety of conditions including variable temperatures, nutrient availability, and oxygen levels. Most strains are harmless, but some are associated with food-poisoning.
-
- ](../img/172px-EscherichiaColi_NIAID.jpg)
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- What data are we using?
+- Why is this experiment important?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Background
+
+We are going to use a long-term sequencing dataset from a population of *Escherichia coli*.
+
+- **What is *E. coli*?**
+ - *E. coli* are rod-shaped bacteria that can survive under a wide variety of conditions including variable temperatures, nutrient availability, and oxygen levels. Most strains are harmless, but some are associated with food-poisoning.
+
+{alt='Wikimedia'}
- - **Why is *E. coli* important?**
- - *E. coli* are one of the most well-studied model organisms in science. As a single-celled organism, *E. coli* reproduces rapidly, typically doubling its population every 20 minutes, which means it can be manipulated easily in experiments. In addition, most naturally occurring strains of *E. coli* are harmless. Most importantly, the genetics of *E. coli* are fairly well understood and can be manipulated to study adaptation and evolution.
-
-# The data
+- **Why is *E. coli* important?**
+ - *E. coli* are one of the most well-studied model organisms in science. As a single-celled organism, *E. coli* reproduces rapidly, typically doubling its population every 20 minutes, which means it can be manipulated easily in experiments. In addition, most naturally occurring strains of *E. coli* are harmless. Most importantly, the genetics of *E. coli* are fairly well understood and can be manipulated to study adaptation and evolution.
- - The data we are going to use is part of a long-term evolution experiment led by [Richard Lenski](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment).
-
- - The experiment was designed to assess adaptation in *E. coli*. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for *E. coli*, providing faster growth than other sugars). This medium was supplemented with citrate, which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using variant (**Cit+**) appeared between 31,000 and 31,500 generations, causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.
-
- - To see a timeline of the experiment to date, check out this [figure](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment#/media/File:LTEE_Timeline_as_of_May_28,_2016.png), and this paper [Blount et al. 2008: Historical contingency and the evolution of a key innovation in an experimental population of *Escherichia coli*](http://www.pnas.org/content/105/23/7899).
-
-
-## View the metadata
+## The data
-We will be working with three sample events from the **Ara-3** strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a **Cit+** mutant and **hypermutability**) with our variant calling workflow. The metadata file associated with this lesson can be [downloaded directly here](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.csv) or [viewed in Github](https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite.csv). If you would like to know details of how the file was created, you can look at [some notes and sources here](https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite_README.md).
+- The data we are going to use is part of a long-term evolution experiment led by [Richard Lenski](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment).
+
+- The experiment was designed to assess adaptation in *E. coli*. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for *E. coli*, providing faster growth than other sugars). This medium was supplemented with citrate, which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using variant (**Cit+**) appeared between 31,000 and 31,500 generations, causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.
+- To see a timeline of the experiment to date, check out this [figure](https://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment#/media/File:LTEE_Timeline_as_of_May_28,_2016.png), and this paper [Blount et al. 2008: Historical contingency and the evolution of a key innovation in an experimental population of *Escherichia coli*](https://www.pnas.org/content/105/23/7899).
+### View the metadata
+
+We will be working with three sample events from the **Ara-3** strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a **Cit+** mutant and **hypermutability**) with our variant calling workflow. The metadata file associated with this lesson can be [downloaded directly here](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.csv) or [viewed in Github](https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite.csv). If you would like to know details of how the file was created, you can look at [some notes and sources here](https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/files/Ecoli_metadata_composite_README.md).
This metadata describes information on the *Ara-3* clones and the columns represent:
-| Column | Description |
-|------------------|--------------------------------------------|
-| strain | strain name |
-| generation | generation when sample frozen |
-| clade | based on parsimony-based tree |
-| reference | study the samples were originally sequenced for |
-| population | ancestral population group |
-| mutator | hypermutability mutant status |
-| facility | facility samples were sequenced at |
-| run | Sequence read archive sample ID |
-| read_type | library type of reads |
-| read_length | length of reads in sample |
-| sequencing_depth | depth of sequencing |
-| cit | citrate-using mutant status |
-
-
-> ## Challenge
->
-> Based on the metadata, can you answer the following questions?
->
-> 1. How many different generations exist in the data?
-> 2. How many rows and how many columns are in this data?
-> 3. How many citrate+ mutants have been recorded in **Ara-3**?
-> 4. How many hypermutable mutants have been recorded in **Ara-3**?
->
-> > ## Solution
->>
-> > 1. 25 different generations
-> > 2. 62 rows, 12 columns
-> > 3. 10 citrate+ mutants
-> > 4. 6 hypermutable mutants
-> {: .solution}
-{: .challenge}
+| Column | Description |
+| ---------------- | ----------------------------------------------- |
+| strain | strain name |
+| generation | generation when sample frozen |
+| clade | based on parsimony-based tree |
+| reference | study the samples were originally sequenced for |
+| population | ancestral population group |
+| mutator | hypermutability mutant status |
+| facility | facility samples were sequenced at |
+| run | Sequence read archive sample ID |
+| read\_type | library type of reads |
+| read\_length | length of reads in sample |
+| sequencing\_depth | depth of sequencing |
+| cit | citrate-using mutant status |
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Challenge
+
+Based on the metadata, can you answer the following questions?
+
+1. How many different generations exist in the data?
+2. How many rows and how many columns are in this data?
+3. How many citrate+ mutants have been recorded in **Ara-3**?
+4. How many hypermutable mutants have been recorded in **Ara-3**?
+
+::::::::::::::: solution
+
+### Solution
+
+1. 25 different generations
+2. 62 rows, 12 columns
+3. 10 citrate+ mutants
+4. 6 hypermutable mutants
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- It is important to record and understand your experiment's metadata.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/episodes/02-quality-control.md b/episodes/02-quality-control.md
index 974bb8c0..bde152c0 100644
--- a/episodes/02-quality-control.md
+++ b/episodes/02-quality-control.md
@@ -1,19 +1,24 @@
---
-title: "Assessing Read Quality"
+title: Assessing Read Quality
teaching: 30
exercises: 20
-questions:
-- "How can I describe the quality of my data?"
-objectives:
-- "Explain how a FASTQ file encodes per-base quality scores."
-- "Interpret a FastQC plot summarizing per-base quality across all reads."
-- "Use `for` loops to automate operations on multiple files."
-keypoints:
-- "Quality encodings vary across sequencing platforms."
-- "`for` loops let you perform the same set of operations on multiple files with a single command."
---
-# Bioinformatic workflows
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Explain how a FASTQ file encodes per-base quality scores.
+- Interpret a FastQC plot summarizing per-base quality across all reads.
+- Use `for` loops to automate operations on multiple files.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I describe the quality of my data?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Bioinformatic workflows
When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass
through a number of different tools in order to generate your final desired output. The execution of this set of
@@ -22,8 +27,7 @@ tools in a specified order is commonly referred to as a *workflow* or a *pipelin
An example of the workflow we will be using for our variant calling analysis is provided below with a brief
description of each step.
-
-
+{alt='workflow'}
1. Quality control - Assessing quality using FastQC
2. Quality control - Trimming and/or filtering reads (if necessary)
@@ -37,7 +41,7 @@ makes this feasible. Standards ensure that data is stored in a way that is gener
within the community. The tools that are used to analyze data at different stages of the workflow are therefore
built under the assumption that the data will be provided in a specific format.
-# Starting with data
+## Starting with data
Often times, the first step in a bioinformatic workflow is getting the data you want to work with onto a computer where you can work with it. If you have outsourced sequencing of your data, the sequencing center will usually provide you with a link that you can use to download your data. Today we will be working with publicly available sequencing data.
@@ -47,12 +51,11 @@ The data are paired-end, so we will download two files for each sample. We will
To download the data, run the commands below.
-
-Here we are using the `-p` option for `mkdir`. This option allows `mkdir` to create the new directory, even if one of the parent directories does not already exist. It also supresses errors if the directory already exists, without overwriting that directory.
-
+Here we are using the `-p` option for `mkdir`. This option allows `mkdir` to create the new directory, even if one of the parent directories does not already exist. It also supresses errors if the directory already exists, without overwriting that directory.
It will take about 15 minutes to download the files.
-~~~
+
+```bash
mkdir -p ~/dc_workshop/data/untrimmed_fastq/
cd ~/dc_workshop/data/untrimmed_fastq
@@ -62,64 +65,63 @@ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_1.fa
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/003/SRR2584863/SRR2584863_2.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR258/006/SRR2584866/SRR2584866_2.fastq.gz
-~~~
-{: .bash}
-
-> ## Faster option
->
-> If your workshop is short on time or the venue's internet connection is weak or unstable, learners can
-> avoid needing to download the data and instead use the data files provided in the `.backup/` directory.
->
-> ~~~
-> $ cp ~/.backup/untrimmed_fastq/*fastq.gz .
-> ~~~
-> {: .bash}
->
-> This command creates a copy of each of the files in the `.backup/untrimmed_fastq/` directory that end in `fastq.gz` and
-> places the copies in the current working directory (signified by `.`).
-{: .callout}
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Faster option
+If your workshop is short on time or the venue's internet connection is weak or unstable, learners can
+avoid needing to download the data and instead use the data files provided in the `.backup/` directory.
+
+```bash
+$ cp ~/.backup/untrimmed_fastq/*fastq.gz .
+```
+
+This command creates a copy of each of the files in the `.backup/untrimmed_fastq/` directory that end in `fastq.gz` and
+places the copies in the current working directory (signified by `.`).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
The data comes in a compressed format, which is why there is a `.gz` at the end of the file names. This makes it faster to transfer, and allows it to take up less space on our computer. Let's unzip one of the files so that we can look at the fastq format.
-~~~
+```bash
$ gunzip SRR2584863_1.fastq.gz
-~~~
-{: .bash}
+```
-# Quality control
+## Quality control
We will now assess the quality of the sequence reads contained in our fastq files.
-
-## Details on the FASTQ format
+{alt='workflow\_qc'}
+
+### Details on the FASTQ format
Although it looks complicated (and it is), we can understand the
[fastq](https://en.wikipedia.org/wiki/FASTQ_format) format with a little decoding. Some rules about the format
include...
-|Line|Description|
-|----|-----------|
-|1|Always begins with '@' and then information about the read|
-|2|The actual DNA sequence|
-|3|Always begins with a '+' and sometimes the same info in line 1|
-|4|Has a string of characters which represent the quality scores; must have same number of characters as line 2|
+| Line | Description |
+| ---- | ------------------------------------------------------------------------------------------------------------ |
+| 1 | Always begins with '@' and then information about the read |
+| 2 | The actual DNA sequence |
+| 3 | Always begins with a '+' and sometimes the same info in line 1 |
+| 4 | Has a string of characters which represent the quality scores; must have same number of characters as line 2 |
We can view the first complete read in one of the files our dataset by using `head` to look at
the first four lines.
-~~~
+```bash
$ head -n 4 SRR2584863_1.fastq
-~~~
-{: .bash}
+```
-~~~
+```output
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
-~~~
-{: .output}
+```
Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the
probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call
@@ -128,10 +130,9 @@ score, the numerical score is converted into a code where each individual charac
represents the numerical quality score for an individual nucleotide. For example, in the line
above, the quality score line is:
-~~~
+```output
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
-~~~
-{: .output}
+```
The numerical value assigned to each of these characters depends on the
sequencing platform that generated the reads. The sequencing machine used to generate our data
@@ -139,12 +140,11 @@ uses the standard Sanger quality PHRED score encoding, using Illumina version 1.
Each character is assigned a quality score between 0 and 41 as shown in
the chart below.
-~~~
+```output
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
| | | | |
Quality score: 01........11........21........31........41
-~~~
-{: .output}
+```
Each quality score represents the probability that the corresponding nucleotide call is
incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a
@@ -154,47 +154,50 @@ much signal was captured for the base incorporation.
Looking back at our read:
-~~~
+```output
@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################
-~~~
-{: .output}
+```
we can now see that there is a range of quality scores, but that the end of the sequence is
very poor (`#` = a quality score of 2).
-> ## Exercise
->
-> What is the last read in the `SRR2584863_1.fastq ` file? How confident
-> are you in this read?
->
->> ## Solution
->> ~~~
->> $ tail -n 4 SRR2584863_1.fastq
->> ~~~
->> {: .bash}
->>
->> ~~~
->> @SRR2584863.1553259 HWI-ST957:245:H73R4ADXX:2:2216:21048:100894/1
->> CTGCAATACCACGCTGATCTTTCACATGATGTAAGAAAAGTGGGATCAGCAAACCGGGTGCTGCTGTGGCTAGTTGCAGCAAACCATGCAGTGAACCCGCCTGTGCTTCGCTATAGCCGTGACTGATGAGGATCGCCGGAAGCCAGCCAA
->> +
->> CCCFFFFFHHHHGJJJJJJJJJHGIJJJIJJJJIJJJJIIIIJJJJJJJJJJJJJIIJJJHHHHHFFFFFEEEEEDDDDDDDDDDDDDDDDDCDEDDBDBDDBDDDDDDDDDBDEEDDDD7@BDDDDDD>AA>?B?<@BDD@BDC?BDA?
->> ~~~
->> {: .output}
->>
->> This read has more consistent quality at its end than the first
->> read that we looked at, but still has a range of quality scores,
->> most of them high. We will look at variations in position-based quality
->> in just a moment.
->>
-> {: .solution}
-{: .challenge}
-
-At this point, lets validate that all the relevant tools are installed. If you are using the AWS AMI then these _should_ be preinstalled.
-
-~~~
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+What is the last read in the `SRR2584863_1.fastq ` file? How confident
+are you in this read?
+
+::::::::::::::: solution
+
+### Solution
+
+```bash
+$ tail -n 4 SRR2584863_1.fastq
+```
+
+```output
+@SRR2584863.1553259 HWI-ST957:245:H73R4ADXX:2:2216:21048:100894/1
+CTGCAATACCACGCTGATCTTTCACATGATGTAAGAAAAGTGGGATCAGCAAACCGGGTGCTGCTGTGGCTAGTTGCAGCAAACCATGCAGTGAACCCGCCTGTGCTTCGCTATAGCCGTGACTGATGAGGATCGCCGGAAGCCAGCCAA
++
+CCCFFFFFHHHHGJJJJJJJJJHGIJJJIJJJJIJJJJIIIIJJJJJJJJJJJJJIIJJJHHHHHFFFFFEEEEEDDDDDDDDDDDDDDDDDCDEDDBDBDDBDDDDDDDDDBDEEDDDD7@BDDDDDD>AA>?B?<@BDD@BDC?BDA?
+```
+
+This read has more consistent quality at its end than the first
+read that we looked at, but still has a range of quality scores,
+most of them high. We will look at variations in position-based quality
+in just a moment.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+At this point, lets validate that all the relevant tools are installed. If you are using the AWS AMI then these *should* be preinstalled.
+
+```bash
$ fastqc -h
FastQC - A high throughput sequence QC analysis tool
@@ -308,26 +311,24 @@ BUGS
Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
or in www.bioinformatics.babraham.ac.uk/bugzilla/
-~~~
-{: .bash}
+```
if fastqc is not installed then you would expect to see an error like
-~~~
+```
$ fastqc -h
The program 'fastqc' is currently not installed. You can install it by typing:
sudo apt-get install fastqc
-~~~
+```
If this happens check with your instructor before trying to install it.
+### Assessing quality using FastQC
-## Assessing quality using FastQC
-In real life, you will not be assessing the quality of your reads by visually inspecting your
-FASTQ files. Rather, you will be using a software program to assess read quality and
-filter out poor quality reads. We will first use a program called [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to visualize the quality of our reads.
-Later in our workflow, we will use another program to filter out poor quality reads.
-
+In real life, you will not be assessing the quality of your reads by visually inspecting your
+FASTQ files. Rather, you will be using a software program to assess read quality and
+filter out poor quality reads. We will first use a program called [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to visualize the quality of our reads.
+Later in our workflow, we will use another program to filter out poor quality reads.
FastQC has a number of features which can give you a quick impression of any problems your
data may have, so you can take these issues into consideration before moving forward with your
@@ -335,7 +336,7 @@ analyses. Rather than looking at quality scores for each individual read, FastQC
quality collectively across all reads within a sample. The image below shows one FastQC-generated plot that indicates
a very high quality sample:
-
+{alt='good\_quality'}
The x-axis displays the base position in the read, and the y-axis shows quality scores. In this
example, the sample contains reads that are 40 bp long. This is much shorter than the reads we
@@ -352,58 +353,59 @@ acceptable (yellow), and bad (red) quality scores.
Now let's take a look at a quality plot on the other end of the spectrum.
-
+{alt='bad\_quality'}
Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the "bad" range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above.
-## Running FastQC
+### Running FastQC
We will now assess the quality of the reads that we downloaded. First, make sure you are still in the `untrimmed_fastq` directory
-~~~
+```bash
$ cd ~/dc_workshop/data/untrimmed_fastq/
-~~~
-{: .bash}
-
-> ## Exercise
->
-> How big are the files?
-> (Hint: Look at the options for the `ls` command to see how to show
-> file sizes.)
->
->> ## Solution
->>
->> ~~~
->> $ ls -l -h
->> ~~~
->> {: .bash}
->>
->> ~~~
->> -rw-rw-r-- 1 dcuser dcuser 545M Jul 6 20:27 SRR2584863_1.fastq
->> -rw-rw-r-- 1 dcuser dcuser 183M Jul 6 20:29 SRR2584863_2.fastq.gz
->> -rw-rw-r-- 1 dcuser dcuser 309M Jul 6 20:34 SRR2584866_1.fastq.gz
->> -rw-rw-r-- 1 dcuser dcuser 296M Jul 6 20:37 SRR2584866_2.fastq.gz
->> -rw-rw-r-- 1 dcuser dcuser 124M Jul 6 20:22 SRR2589044_1.fastq.gz
->> -rw-rw-r-- 1 dcuser dcuser 128M Jul 6 20:24 SRR2589044_2.fastq.gz
->> ~~~
->> {: .output}
->>
->> There are six FASTQ files ranging from 124M (124MB) to 545M.
->>
-> {: .solution}
-{: .challenge}
-
-FastQC can accept multiple file names as input, and on both zipped and unzipped files, so we can use the \*.fastq* wildcard to run FastQC on all of the FASTQ files in this directory.
-
-~~~
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+How big are the files?
+(Hint: Look at the options for the `ls` command to see how to show
+file sizes.)
+
+::::::::::::::: solution
+
+### Solution
+
+```bash
+$ ls -l -h
+```
+
+```output
+-rw-rw-r-- 1 dcuser dcuser 545M Jul 6 20:27 SRR2584863_1.fastq
+-rw-rw-r-- 1 dcuser dcuser 183M Jul 6 20:29 SRR2584863_2.fastq.gz
+-rw-rw-r-- 1 dcuser dcuser 309M Jul 6 20:34 SRR2584866_1.fastq.gz
+-rw-rw-r-- 1 dcuser dcuser 296M Jul 6 20:37 SRR2584866_2.fastq.gz
+-rw-rw-r-- 1 dcuser dcuser 124M Jul 6 20:22 SRR2589044_1.fastq.gz
+-rw-rw-r-- 1 dcuser dcuser 128M Jul 6 20:24 SRR2589044_2.fastq.gz
+```
+
+There are six FASTQ files ranging from 124M (124MB) to 545M.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+FastQC can accept multiple file names as input, and on both zipped and unzipped files, so we can use the \*.fastq\* wildcard to run FastQC on all of the FASTQ files in this directory.
+
+```bash
$ fastqc *.fastq*
-~~~
-{: .bash}
+```
You will see an automatically updating output message telling you the
progress of the analysis. It will start like this:
-~~~
+```output
Started analysis of SRR2584863_1.fastq
Approx 5% complete for SRR2584863_1.fastq
Approx 10% complete for SRR2584863_1.fastq
@@ -414,44 +416,40 @@ Approx 30% complete for SRR2584863_1.fastq
Approx 35% complete for SRR2584863_1.fastq
Approx 40% complete for SRR2584863_1.fastq
Approx 45% complete for SRR2584863_1.fastq
-~~~
-{: .output}
+```
In total, it should take about five minutes for FastQC to run on all
six of our FASTQ files. When the analysis completes, your prompt
will return. So your screen will look something like this:
-~~~
+```output
Approx 80% complete for SRR2589044_2.fastq.gz
Approx 85% complete for SRR2589044_2.fastq.gz
Approx 90% complete for SRR2589044_2.fastq.gz
Approx 95% complete for SRR2589044_2.fastq.gz
Analysis complete for SRR2589044_2.fastq.gz
$
-~~~
-{: .output}
+```
The FastQC program has created several new files within our
`data/untrimmed_fastq/` directory.
-~~~
+```bash
$ ls
-~~~
-{: .bash}
+```
-~~~
+```output
SRR2584863_1.fastq SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_1_fastqc.zip SRR2584866_1.fastq.gz SRR2589044_1.fastq.gz
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
SRR2584863_2.fastq.gz SRR2584866_2.fastq.gz SRR2589044_2.fastq.gz
-~~~
-{: .output}
+```
For each input FASTQ file, FastQC has created a `.zip` file and a
-`.html` file. The `.zip` file extension indicates that this is
+`.html` file. The `.zip` file extension indicates that this is
actually a compressed set of multiple output files. We will be working
with these output files soon. The `.html` file is a stable webpage
displaying the summary report for each of our samples.
@@ -460,31 +458,28 @@ We want to keep our data files and our results files separate, so we
will move these
output files into a new directory within our `results/` directory.
-~~~
+```bash
$ mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
$ mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
$ mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
-~~~
-{: .bash}
+```
Now we can navigate into this results directory and do some closer
inspection of our output files.
-~~~
+```bash
$ cd ~/dc_workshop/results/fastqc_untrimmed_reads/
-~~~
-{: .bash}
+```
-## Viewing the FastQC results
+### Viewing the FastQC results
-
-If we were working on our local computers, we would be able to look at
+If we were working on our local computers, we would be able to look at
each of these HTML files by opening them in a web browser.
-However, these files are currently sitting on our remote AWS
+However, these files are currently sitting on our remote AWS
instance, where our local computer can not see them.
-And, since we are only logging into the AWS instance via the
-command line - it does not have any web browser setup to display
+And, since we are only logging into the AWS instance via the
+command line - it does not have any web browser setup to display
these files either.
So the easiest way to look at these webpage summary reports will be
@@ -499,34 +494,35 @@ we are transferring. Let's put it on our desktop for now. Open a new
tab in your terminal program (you can use the pull down menu at the
top of your screen or the Cmd+t keyboard shortcut) and type:
-~~~
+```bash
$ mkdir -p ~/Desktop/fastqc_html
-~~~
-{: .bash}
+```
Now we can transfer our HTML files to our local computer using `scp`.
-~~~
+```bash
$ scp dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html
-~~~
-{: .bash}
-
-> ## Note on using zsh
-> If you are using zsh instead of bash (macOS for example changed the default recently to zsh), it is
-> likely that a `no matches found` error will be displayed. The reason for this is that the wildcard
-> ("*") is not correctly interpreted. To fix this problem the wildcard needs to be escaped with a "\\":
-> ~~~
-> $ scp dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/\*.html ~/Desktop/fastqc_html
-> ~~~
-> {: .bash}
->
-> Alternatively, you can put the whole path into quotation marks:
-> ~~~
-> $ scp "dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/*.html" ~/Desktop/fastqc_html
-> ~~~
-> {: .bash}
->
-{: .callout}
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Note on using zsh
+
+If you are using zsh instead of bash (macOS for example changed the default recently to zsh), it is
+likely that a `no matches found` error will be displayed. The reason for this is that the wildcard
+("\*") is not correctly interpreted. To fix this problem the wildcard needs to be escaped with a "\\":
+
+```bash
+$ scp dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/\*.html ~/Desktop/fastqc_html
+```
+
+Alternatively, you can put the whole path into quotation marks:
+
+```bash
+$ scp "dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/*.html" ~/Desktop/fastqc_html
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
As a reminder, the first part
of the command `dcuser@ec2-34-238-162-94.compute-1.amazonaws.com` is
@@ -544,15 +540,14 @@ directory we just created `~/Desktop/fastqc_html`.
You should see a status output like this:
-~~~
+```output
SRR2584863_1_fastqc.html 100% 249KB 152.3KB/s 00:01
SRR2584863_2_fastqc.html 100% 254KB 219.8KB/s 00:01
SRR2584866_1_fastqc.html 100% 254KB 271.8KB/s 00:00
SRR2584866_2_fastqc.html 100% 251KB 252.8KB/s 00:00
SRR2589044_1_fastqc.html 100% 249KB 370.1KB/s 00:00
SRR2589044_2_fastqc.html 100% 251KB 592.2KB/s 00:00
-~~~
-{: .output}
+```
Now we can go to our new directory and open the 6 HTML files.
@@ -560,54 +555,61 @@ Depending on your system,
you should be able to select and open them all at once via a right click menu
in your file browser.
-> ## Exercise
->
-> Discuss your results with a neighbor. Which sample(s) looks the best
-> in terms of per base sequence quality? Which sample(s) look the
-> worst?
->
->> ## Solution
->> All of the reads contain usable data, but the quality decreases toward
->> the end of the reads.
-> {: .solution}
-{: .challenge}
-
-## Decoding the other FastQC outputs
-We have now looked at quite a few "Per base sequence quality" FastQC graphs, but there are nine other graphs that we have not talked about! Below we have provided a brief overview of interpretations for each of these plots. For more information, please see the FastQC documentation [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)
-
-
-+ [**Per tile sequence quality**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html): the machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots can also occur in the middle if an air bubble was introduced at some point during the run.
-+ [**Per sequence quality scores**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/3%20Per%20Sequence%20Quality%20Scores.html): a density plot of quality for all reads at all positions. This plot shows what quality scores are most common.
-+ [**Per base sequence content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html): plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content.
-+ [**Per sequence GC content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/5%20Per%20Sequence%20GC%20Content.html): a density plot of average GC content in each of the reads.
-+ [**Per base N content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/6%20Per%20Base%20N%20Content.html): the percent of times that 'N' occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing.
-+ [**Sequence Length Distribution**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/7%20Sequence%20Length%20Distribution.html): the distribution of sequence lengths of all reads in the file. If the data is raw, there is often on sharp peak, however if the reads have been trimmed, there may be a distribution of shorter lengths.
-+ [**Sequence Duplication Levels**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html): A distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). If the samples are high coverage (or RNA-seq or amplicon), this might not be true.
-+ [**Overrepresented sequences**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html): A list of sequences that occur more frequently than would be expected by chance.
-+ [**Adapter Content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html): a graph indicating where adapater sequences occur in the reads.
-+ [**K-mer Content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html): a graph showing any sequences which may show a positional bias within the reads.
-
-## Working with the FastQC text output
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+Discuss your results with a neighbor. Which sample(s) looks the best
+in terms of per base sequence quality? Which sample(s) look the
+worst?
+
+::::::::::::::: solution
+
+### Solution
+
+All of the reads contain usable data, but the quality decreases toward
+the end of the reads.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Decoding the other FastQC outputs
+
+We have now looked at quite a few "Per base sequence quality" FastQC graphs, but there are nine other graphs that we have not talked about! Below we have provided a brief overview of interpretations for each of these plots. For more information, please see the FastQC documentation [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)
+
+- [**Per tile sequence quality**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html): the machines that perform sequencing are divided into tiles. This plot displays patterns in base quality along these tiles. Consistently low scores are often found around the edges, but hot spots can also occur in the middle if an air bubble was introduced at some point during the run.
+- [**Per sequence quality scores**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/3%20Per%20Sequence%20Quality%20Scores.html): a density plot of quality for all reads at all positions. This plot shows what quality scores are most common.
+- [**Per base sequence content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html): plots the proportion of each base position over all of the reads. Typically, we expect to see each base roughly 25% of the time at each position, but this often fails at the beginning or end of the read due to quality or adapter content.
+- [**Per sequence GC content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/5%20Per%20Sequence%20GC%20Content.html): a density plot of average GC content in each of the reads.
+- [**Per base N content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/6%20Per%20Base%20N%20Content.html): the percent of times that 'N' occurs at a position in all reads. If there is an increase at a particular position, this might indicate that something went wrong during sequencing.
+- [**Sequence Length Distribution**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/7%20Sequence%20Length%20Distribution.html): the distribution of sequence lengths of all reads in the file. If the data is raw, there is often on sharp peak, however if the reads have been trimmed, there may be a distribution of shorter lengths.
+- [**Sequence Duplication Levels**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html): A distribution of duplicated sequences. In sequencing, we expect most reads to only occur once. If some sequences are occurring more than once, it might indicate enrichment bias (e.g. from PCR). If the samples are high coverage (or RNA-seq or amplicon), this might not be true.
+- [**Overrepresented sequences**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html): A list of sequences that occur more frequently than would be expected by chance.
+- [**Adapter Content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html): a graph indicating where adapater sequences occur in the reads.
+- [**K-mer Content**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/11%20Kmer%20Content.html): a graph showing any sequences which may show a positional bias within the reads.
+
+### Working with the FastQC text output
Now that we have looked at our HTML reports to get a feel for the data,
let's look more closely at the other output files. Go back to the tab
in your terminal program that is connected to your AWS instance
(the tab label will start with `dcuser@ip`) and make sure you are in
-our results subdirectory.
+our results subdirectory.
-~~~
+```bash
$ cd ~/dc_workshop/results/fastqc_untrimmed_reads/
$ ls
-~~~
-{: .bash}
+```
-~~~
+```output
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
-~~~
-{: .output}
+```
Our `.zip` files are compressed files. They each contain multiple
different types of output files for a single input FASTQ file. To
@@ -615,38 +617,35 @@ view the contents of a `.zip` file, we can use the program `unzip`
to decompress these files. Let's try doing them all at once using a
wildcard.
-~~~
+```bash
$ unzip *.zip
-~~~
-{: .bash}
+```
-~~~
+```output
Archive: SRR2584863_1_fastqc.zip
caution: filename not matched: SRR2584863_2_fastqc.zip
caution: filename not matched: SRR2584866_1_fastqc.zip
caution: filename not matched: SRR2584866_2_fastqc.zip
caution: filename not matched: SRR2589044_1_fastqc.zip
caution: filename not matched: SRR2589044_2_fastqc.zip
-~~~
-{: .output}
+```
This did not work. We unzipped the first file and then got a warning
-message for each of the other `.zip` files. This is because `unzip`
-expects to get only one zip file as input. We could go through and
-unzip each file one at a time, but this is very time consuming and
+message for each of the other `.zip` files. This is because `unzip`
+expects to get only one zip file as input. We could go through and
+unzip each file one at a time, but this is very time consuming and
error-prone. Someday you may have 500 files to unzip!
A more efficient way is to use a `for` loop like we learned in the Shell Genomics lesson to iterate through all of
-our `.zip` files. Let's see what that looks like and then we will
+our `.zip` files. Let's see what that looks like and then we will
discuss what we are doing with each line of our loop.
-~~~
+```bash
$ for filename in *.zip
> do
> unzip $filename
> done
-~~~
-{: .bash}
+```
In this example, the input is six filenames (one filename for each of our `.zip` files).
Each time the loop iterates, it will assign a file name to the variable `filename`
@@ -658,10 +657,9 @@ For the second iteration, `$filename` becomes
`SRR2584863_2_fastqc.zip`. This time, the shell runs `unzip` on `SRR2584863_2_fastqc.zip`.
It then repeats this process for the four other `.zip` files in our directory.
-
When we run our `for` loop, you will see output that starts like this:
-~~~
+```output
Archive: SRR2589044_2_fastqc.zip
creating: SRR2589044_2_fastqc/
creating: SRR2589044_2_fastqc/Icons/
@@ -683,68 +681,61 @@ Archive: SRR2589044_2_fastqc.zip
inflating: SRR2589044_2_fastqc/fastqc_report.html
inflating: SRR2589044_2_fastqc/fastqc_data.txt
inflating: SRR2589044_2_fastqc/fastqc.fo
-~~~
-{: .output}
+```
The `unzip` program is decompressing the `.zip` files and creating
a new directory (with subdirectories) for each of our samples, to
store all of the different output that is produced by FastQC. There
-are a lot of files here. The one we are going to focus on is the
-`summary.txt` file.
-
+are a lot of files here. The one we are going to focus on is the
+`summary.txt` file.
If you list the files in our directory now you will see:
-~~~
+```
SRR2584863_1_fastqc SRR2584866_1_fastqc SRR2589044_1_fastqc
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc SRR2584866_2_fastqc SRR2589044_2_fastqc
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
-~~~
+```
{:. output}
The `.html` files and the uncompressed `.zip` files are still present,
-but now we also have a new directory for each of our samples. We can
-see for sure that it is a directory if we use the `-F` flag for `ls`.
+but now we also have a new directory for each of our samples. We can
+see for sure that it is a directory if we use the `-F` flag for `ls`.
-~~~
+```bash
$ ls -F
-~~~
-{: .bash}
+```
-~~~
+```output
SRR2584863_1_fastqc/ SRR2584866_1_fastqc/ SRR2589044_1_fastqc/
SRR2584863_1_fastqc.html SRR2584866_1_fastqc.html SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip SRR2584866_1_fastqc.zip SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc/ SRR2584866_2_fastqc/ SRR2589044_2_fastqc/
SRR2584863_2_fastqc.html SRR2584866_2_fastqc.html SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip SRR2584866_2_fastqc.zip SRR2589044_2_fastqc.zip
-~~~
-{: .output}
+```
Let's see what files are present within one of these output directories.
-~~~
+```bash
$ ls -F SRR2584863_1_fastqc/
-~~~
-{: .bash}
+```
-~~~
+```output
fastqc_data.txt fastqc.fo fastqc_report.html Icons/ Images/ summary.txt
-~~~
-{: .output}
+```
Use `less` to preview the `summary.txt` file for this sample.
-~~~
+```bash
$ less SRR2584863_1_fastqc/summary.txt
-~~~
-{: .bash}
+```
-~~~
+```output
PASS Basic Statistics SRR2584863_1.fastq
PASS Per base sequence quality SRR2584863_1.fastq
PASS Per tile sequence quality SRR2584863_1.fastq
@@ -756,89 +747,103 @@ PASS Sequence Length Distribution SRR2584863_1.fastq
PASS Sequence Duplication Levels SRR2584863_1.fastq
PASS Overrepresented sequences SRR2584863_1.fastq
WARN Adapter Content SRR2584863_1.fastq
-~~~
-{: .output}
+```
The summary file gives us a list of tests that FastQC ran, and tells
us whether this sample passed, failed, or is borderline (`WARN`). Remember, to quit from `less` you must type `q`.
-## Documenting our work
+### Documenting our work
We can make a record of the results we obtained for all our samples
-by concatenating all of our `summary.txt` files into a single file
+by concatenating all of our `summary.txt` files into a single file
using the `cat` command. We will call this `fastqc_summaries.txt` and move
it to `~/dc_workshop/docs`.
-~~~
+```bash
$ cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
-~~~
-{: .bash}
-
-> ## Exercise
->
-> Which samples failed at least one of FastQC's quality tests? What
-> test(s) did those samples fail?
->
->> ## Solution
->>
->> We can get the list of all failed tests using `grep`.
->>
->> ~~~
->> $ cd ~/dc_workshop/docs
->> $ grep FAIL fastqc_summaries.txt
->> ~~~
->> {: .bash}
->>
->> ~~~
->> FAIL Per base sequence quality SRR2584863_2.fastq.gz
->> FAIL Per tile sequence quality SRR2584863_2.fastq.gz
->> FAIL Per base sequence content SRR2584863_2.fastq.gz
->> FAIL Per base sequence quality SRR2584866_1.fastq.gz
->> FAIL Per base sequence content SRR2584866_1.fastq.gz
->> FAIL Adapter Content SRR2584866_1.fastq.gz
->> FAIL Adapter Content SRR2584866_2.fastq.gz
->> FAIL Adapter Content SRR2589044_1.fastq.gz
->> FAIL Per base sequence quality SRR2589044_2.fastq.gz
->> FAIL Per tile sequence quality SRR2589044_2.fastq.gz
->> FAIL Per base sequence content SRR2589044_2.fastq.gz
->> FAIL Adapter Content SRR2589044_2.fastq.gz
->> ~~~
->> {: .output}
->>
-> {: .solution}
-{: .challenge}
-
-
-
-# Other notes -- optional
-
-
-> ## Quality encodings vary
->
-> Although we have used a particular quality encoding system to demonstrate interpretation of
-> read quality, different sequencing machines use different encoding systems. This means that,
-> depending on which sequencer you use to generate your data, a `#` may not be an indicator of
-> a poor quality base call.
->
-> This mainly relates to older Solexa/Illumina data,
-> but it is essential that you know which sequencing platform was
-> used to generate your data, so that you can tell your quality control program which encoding
-> to use. If you choose the wrong encoding, you run the risk of throwing away good reads or
-> (even worse) not throwing away bad reads!
-{: .callout}
-
-
-> ## Same symbols, different meanings
->
-> Here we see `>` being used as a shell prompt, whereas `>` is also
-> used to redirect output.
-> Similarly, `$` is used as a shell prompt, but, as we saw earlier,
-> it is also used to ask the shell to get the value of a variable.
->
-> If the *shell* prints `>` or `$` then it expects you to type something,
-> and the symbol is a prompt.
->
-> If *you* type `>` or `$` yourself, it is an instruction from you that
-> the shell should redirect output or get the value of a variable.
-{: .callout}
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+Which samples failed at least one of FastQC's quality tests? What
+test(s) did those samples fail?
+
+::::::::::::::: solution
+
+### Solution
+
+We can get the list of all failed tests using `grep`.
+
+```bash
+$ cd ~/dc_workshop/docs
+$ grep FAIL fastqc_summaries.txt
+```
+
+```output
+FAIL Per base sequence quality SRR2584863_2.fastq.gz
+FAIL Per tile sequence quality SRR2584863_2.fastq.gz
+FAIL Per base sequence content SRR2584863_2.fastq.gz
+FAIL Per base sequence quality SRR2584866_1.fastq.gz
+FAIL Per base sequence content SRR2584866_1.fastq.gz
+FAIL Adapter Content SRR2584866_1.fastq.gz
+FAIL Adapter Content SRR2584866_2.fastq.gz
+FAIL Adapter Content SRR2589044_1.fastq.gz
+FAIL Per base sequence quality SRR2589044_2.fastq.gz
+FAIL Per tile sequence quality SRR2589044_2.fastq.gz
+FAIL Per base sequence content SRR2589044_2.fastq.gz
+FAIL Adapter Content SRR2589044_2.fastq.gz
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Other notes -- optional
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Quality encodings vary
+
+Although we have used a particular quality encoding system to demonstrate interpretation of
+read quality, different sequencing machines use different encoding systems. This means that,
+depending on which sequencer you use to generate your data, a `#` may not be an indicator of
+a poor quality base call.
+
+This mainly relates to older Solexa/Illumina data,
+but it is essential that you know which sequencing platform was
+used to generate your data, so that you can tell your quality control program which encoding
+to use. If you choose the wrong encoding, you run the risk of throwing away good reads or
+(even worse) not throwing away bad reads!
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Same symbols, different meanings
+
+Here we see `>` being used as a shell prompt, whereas `>` is also
+used to redirect output.
+Similarly, `$` is used as a shell prompt, but, as we saw earlier,
+it is also used to ask the shell to get the value of a variable.
+
+If the *shell* prints `>` or `$` then it expects you to type something,
+and the symbol is a prompt.
+
+If *you* type `>` or `$` yourself, it is an instruction from you that
+the shell should redirect output or get the value of a variable.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Quality encodings vary across sequencing platforms.
+- `for` loops let you perform the same set of operations on multiple files with a single command.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/episodes/03-trimming.md b/episodes/03-trimming.md
index 6da47a9e..fca0c98b 100644
--- a/episodes/03-trimming.md
+++ b/episodes/03-trimming.md
@@ -1,19 +1,24 @@
---
-title: "Trimming and Filtering"
+title: Trimming and Filtering
teaching: 30
exercises: 25
-questions:
-- "How can I get rid of sequence data that does not meet my quality standards?"
-objectives:
-- "Clean FASTQ reads using Trimmomatic."
-- "Select and set multiple options for command-line bioinformatic tools."
-- "Write `for` loops with two variables."
-keypoints:
-- "The options you set for the command-line tools you use are important!"
-- "Data cleaning is an essential step in a genomics workflow."
---
-# Cleaning reads
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Clean FASTQ reads using Trimmomatic.
+- Select and set multiple options for command-line bioinformatic tools.
+- Write `for` loops with two variables.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I get rid of sequence data that does not meet my quality standards?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Cleaning reads
In the previous episode, we took a high-level look at the quality
of each of our samples using FastQC. We visualized per-base quality
@@ -23,135 +28,130 @@ fail which quality checks. Some of our samples failed quite a few quality metric
though, that our samples should be thrown out! It is very common to have some quality metrics fail, and this may or may not be a problem for your downstream application. For our variant calling workflow, we will be removing some of the low quality sequences to reduce our false positive rate due to sequencing error.
We will use a program called
-[Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) to
+[Trimmomatic](https://www.usadellab.org/cms/?page=trimmomatic) to
filter poor quality reads and trim poor quality bases from our samples.
## Trimmomatic options
Trimmomatic has a variety of options to trim your reads. If we run the following command, we can see some of our options.
-~~~
+```bash
$ trimmomatic
-~~~
-{: .bash}
+```
Which will give you the following output:
-~~~
+
+```output
Usage:
PE [-version] [-threads ] [-phred33|-phred64] [-trimlog ] [-summary ] [-quiet] [-validatePairs] [-basein | ] [-baseout | ] ...
or:
SE [-version] [-threads ] [-phred33|-phred64] [-trimlog ] [-summary ] [-quiet] ...
or:
-version
-~~~
-{: .output}
+```
This output shows us that we must first specify whether we have paired end (`PE`) or single end (`SE`) reads.
Next, we specify what flag we would like to run. For example, you can specify `threads` to indicate the number of
-processors on your computer that you want Trimmomatic to use. In most cases using multiple threads (processors) can help to run the trimming faster. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important.
+processors on your computer that you want Trimmomatic to use. In most cases using multiple threads (processors) can help to run the trimming faster. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important.
In paired end mode, Trimmomatic expects the two input files, and then the names of the output files. These files are described below. While, in single end mode, Trimmomatic will expect 1 file as input, after which you can enter the optional settings and lastly the name of the output file.
-| option | meaning |
-| ------- | ---------- |
-| \ | Input reads to be trimmed. Typically the file name will contain an `_1` or `_R1` in the name.|
-| \ | Input reads to be trimmed. Typically the file name will contain an `_2` or `_R2` in the name.|
-| \ | Output file that contains surviving pairs from the `_1` file. |
-| \ | Output file that contains orphaned reads from the `_1` file. |
-| \ | Output file that contains surviving pairs from the `_2` file.|
-| \ | Output file that contains orphaned reads from the `_2` file.|
+| option | meaning |
+| -------------- | ------------------------------------------------------------------------------------------------------------ |
+| \ | Input reads to be trimmed. Typically the file name will contain an `_1` or `_R1` in the name. |
+| \ | Input reads to be trimmed. Typically the file name will contain an `_2` or `_R2` in the name. |
+| \ | Output file that contains surviving pairs from the `_1` file. |
+| \ | Output file that contains orphaned reads from the `_1` file. |
+| \ | Output file that contains surviving pairs from the `_2` file. |
+| \ | Output file that contains orphaned reads from the `_2` file. |
The last thing trimmomatic expects to see is the trimming parameters:
-| step | meaning |
-| ------- | ---------- |
-| `ILLUMINACLIP` | Perform adapter removal. |
-| `SLIDINGWINDOW` | Perform sliding window trimming, cutting once the average quality within the window falls below a threshold. |
-| `LEADING` | Cut bases off the start of a read, if below a threshold quality. |
-| `TRAILING` | Cut bases off the end of a read, if below a threshold quality. |
-| `CROP` | Cut the read to a specified length. |
-| `HEADCROP` | Cut the specified number of bases from the start of the read. |
-| `MINLEN` | Drop an entire read if it is below a specified length. |
-| `TOPHRED33` | Convert quality scores to Phred-33. |
-| `TOPHRED64` | Convert quality scores to Phred-64. |
+| step | meaning |
+| -------------- | ------------------------------------------------------------------------------------------------------------ |
+| `ILLUMINACLIP` | Perform adapter removal. |
+| `SLIDINGWINDOW` | Perform sliding window trimming, cutting once the average quality within the window falls below a threshold. |
+| `LEADING` | Cut bases off the start of a read, if below a threshold quality. |
+| `TRAILING` | Cut bases off the end of a read, if below a threshold quality. |
+| `CROP` | Cut the read to a specified length. |
+| `HEADCROP` | Cut the specified number of bases from the start of the read. |
+| `MINLEN` | Drop an entire read if it is below a specified length. |
+| `TOPHRED33` | Convert quality scores to Phred-33. |
+| `TOPHRED64` | Convert quality scores to Phred-64. |
We will use only a few of these options and trimming steps in our
analysis. It is important to understand the steps you are using to
clean your data. For more information about the Trimmomatic arguments
-and options, see [the Trimmomatic manual](http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf).
+and options, see [the Trimmomatic manual](https://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf).
However, a complete command for Trimmomatic will look something like the command below. This command is an example and will not work, as we do not have the files it refers to:
-~~~
+```bash
$ trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq \
SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20
-~~~
-{: .bash}
+```
In this example, we have told Trimmomatic:
-| code | meaning |
-| ------- | ---------- |
-| `PE` | that it will be taking a paired end file as input |
-| `-threads 4` | to use four computing threads to run (this will speed up our run) |
-| `SRR_1056_1.fastq` | the first input file name |
-| `SRR_1056_2.fastq` | the second input file name |
-| `SRR_1056_1.trimmed.fastq` | the output file for surviving pairs from the `_1` file |
-| `SRR_1056_1un.trimmed.fastq` | the output file for orphaned reads from the `_1` file |
-| `SRR_1056_2.trimmed.fastq` | the output file for surviving pairs from the `_2` file |
-| `SRR_1056_2un.trimmed.fastq` | the output file for orphaned reads from the `_2` file |
-| `ILLUMINACLIP:SRR_adapters.fa`| to clip the Illumina adapters from the input file using the adapter sequences listed in `SRR_adapters.fa` |
-|`SLIDINGWINDOW:4:20` | to use a sliding window of size 4 that will remove bases if their phred score is below 20 |
+| code | meaning |
+| -------------- | ------------------------------------------------------------------------------------------------------------ |
+| `PE` | that it will be taking a paired end file as input |
+| `-threads 4` | to use four computing threads to run (this will speed up our run) |
+| `SRR_1056_1.fastq` | the first input file name |
+| `SRR_1056_2.fastq` | the second input file name |
+| `SRR_1056_1.trimmed.fastq` | the output file for surviving pairs from the `_1` file |
+| `SRR_1056_1un.trimmed.fastq` | the output file for orphaned reads from the `_1` file |
+| `SRR_1056_2.trimmed.fastq` | the output file for surviving pairs from the `_2` file |
+| `SRR_1056_2un.trimmed.fastq` | the output file for orphaned reads from the `_2` file |
+| `ILLUMINACLIP:SRR_adapters.fa` | to clip the Illumina adapters from the input file using the adapter sequences listed in `SRR_adapters.fa` |
+| `SLIDINGWINDOW:4:20` | to use a sliding window of size 4 that will remove bases if their phred score is below 20 |
+::::::::::::::::::::::::::::::::::::::::: callout
+## Multi-line commands
-> ## Multi-line commands
-> Some of the commands we ran in this lesson are long! When typing a long
-> command into your terminal, you can use the `\` character
-> to separate code chunks onto separate lines. This can make your code more readable.
-{: .callout}
+Some of the commands we ran in this lesson are long! When typing a long
+command into your terminal, you can use the `\` character
+to separate code chunks onto separate lines. This can make your code more readable.
+::::::::::::::::::::::::::::::::::::::::::::::::::
## Running Trimmomatic
Now we will run Trimmomatic on our data. To begin, navigate to your `untrimmed_fastq` data directory:
-~~~
+```bash
$ cd ~/dc_workshop/data/untrimmed_fastq
-~~~
-{: .bash}
+```
-We are going to run Trimmomatic on one of our paired-end samples.
-While using FastQC we saw that Nextera adapters were present in our samples.
+We are going to run Trimmomatic on one of our paired-end samples.
+While using FastQC we saw that Nextera adapters were present in our samples.
The adapter sequences came with the installation of trimmomatic, so we will first copy these sequences into our current directory.
-~~~
+```bash
$ cp ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/NexteraPE-PE.fa .
-~~~
-{: .bash}
+```
We will also use a sliding window of size 4 that will remove bases if their
phred score is below 20 (like in our example above). We will also
discard any reads that do not have at least 25 bases remaining after
this trimming step. Three additional pieces of code are also added to the end
of the ILLUMINACLIP step. These three additional numbers (2:40:15) tell
-Trimmimatic how to handle sequence matches to the Nextera adapters. A detailed
+Trimmimatic how to handle sequence matches to the Nextera adapters. A detailed
explanation of how they work is advanced for this particular lesson. For now we
will use these numbers as a default and recognize they are needed to for Trimmomatic
to run properly. This command will take a few minutes to run.
-~~~
+```bash
$ trimmomatic PE SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz \
SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz \
SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz \
SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
-~~~
-{: .bash}
+```
-
-~~~
+```output
TrimmomaticPE: Started with arguments:
SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
Multiple cores found: Using 2 threads
@@ -164,22 +164,30 @@ ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only
Quality encoding detected as phred33
Input Read Pairs: 1107090 Both Surviving: 885220 (79.96%) Forward Only Surviving: 216472 (19.55%) Reverse Only Surviving: 2850 (0.26%) Dropped: 2548 (0.23%)
TrimmomaticPE: Completed successfully
-~~~
-{: .output}
-
-> ## Exercise
->
-> Use the output from your Trimmomatic command to answer the
-> following questions.
->
-> 1) What percent of reads did we discard from our sample?
-> 2) What percent of reads did we keep both pairs?
->
->> ## Solution
->> 1) 0.23%
->> 2) 79.96%
-> {: .solution}
-{: .challenge}
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+Use the output from your Trimmomatic command to answer the
+following questions.
+
+1) What percent of reads did we discard from our sample?
+2) What percent of reads did we keep both pairs?
+
+::::::::::::::: solution
+
+## Solution
+
+1) 0\.23%
+2) 79\.96%
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
You may have noticed that Trimmomatic automatically detected the
quality encoding of our sample. It is always a good idea to
@@ -187,50 +195,44 @@ double-check this or to enter the quality encoding manually.
We can confirm that we have our output files:
-~~~
+```bash
$ ls SRR2589044*
-~~~
-{: .bash}
+```
-~~~
+```output
SRR2589044_1.fastq.gz SRR2589044_1un.trim.fastq.gz SRR2589044_2.trim.fastq.gz
SRR2589044_1.trim.fastq.gz SRR2589044_2.fastq.gz SRR2589044_2un.trim.fastq.gz
-~~~
-{: .output}
+```
The output files are also FASTQ files. It should be smaller than our
input file, because we have removed reads. We can confirm this:
-~~~
+```bash
$ ls SRR2589044* -l -h
-~~~
-{: .bash}
+```
-~~~
+```output
-rw-rw-r-- 1 dcuser dcuser 124M Jul 6 20:22 SRR2589044_1.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 94M Jul 6 22:33 SRR2589044_1.trim.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 18M Jul 6 22:33 SRR2589044_1un.trim.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 128M Jul 6 20:24 SRR2589044_2.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 91M Jul 6 22:33 SRR2589044_2.trim.fastq.gz
-rw-rw-r-- 1 dcuser dcuser 271K Jul 6 22:33 SRR2589044_2un.trim.fastq.gz
-~~~
-{: .output}
-
+```
We have just successfully run Trimmomatic on one of our FASTQ files!
However, there is some bad news. Trimmomatic can only operate on
one sample at a time and we have more than one sample. The good news
is that we can use a `for` loop to iterate through our sample files
-quickly!
+quickly!
We unzipped one of our files before to work with it, let's compress it again before we run our for loop.
-~~~
+```bash
gzip SRR2584863_1.fastq
-~~~
-{: .bash}
+```
-~~~
+```bash
$ for infile in *_1.fastq.gz
> do
> base=$(basename ${infile} _1.fastq.gz)
@@ -239,20 +241,17 @@ $ for infile in *_1.fastq.gz
> ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
> SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
> done
-~~~
-{: .bash}
-
+```
Go ahead and run the for loop. It should take a few minutes for
Trimmomatic to run for each of our six input files. Once it is done
running, take a look at your directory contents. You will notice that even though we ran Trimmomatic on file `SRR2589044` before running the for loop, there is only one set of files for it. Because we matched the ending `_1.fastq.gz`, we re-ran Trimmomatic on this file, overwriting our first results. That is ok, but it is good to be aware that it happened.
-~~~
+```bash
$ ls
-~~~
-{: .bash}
+```
-~~~
+```output
NexteraPE-PE.fa SRR2584866_1.fastq.gz SRR2589044_1.trim.fastq.gz
SRR2584863_1.fastq.gz SRR2584866_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz
SRR2584863_1.trim.fastq.gz SRR2584866_1un.trim.fastq.gz SRR2589044_2.fastq.gz
@@ -260,88 +259,104 @@ SRR2584863_1un.trim.fastq.gz SRR2584866_2.fastq.gz SRR2589044_2.trim.fa
SRR2584863_2.fastq.gz SRR2584866_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz
SRR2584863_2.trim.fastq.gz SRR2584866_2un.trim.fastq.gz
SRR2584863_2un.trim.fastq.gz SRR2589044_1.fastq.gz
-~~~
-{: .output}
-
-> ## Exercise
-> We trimmed our fastq files with Nextera adapters,
-> but there are other adapters that are commonly used.
-> What other adapter files came with Trimmomatic?
->
->
->> ## Solution
->> ~~~
->> $ ls ~/miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/
->> ~~~
->> {: .bash}
->>
->> ~~~
->> NexteraPE-PE.fa TruSeq2-SE.fa TruSeq3-PE.fa
->> TruSeq2-PE.fa TruSeq3-PE-2.fa TruSeq3-SE.fa
->> ~~~
->> {: .output}
->>
-> {: .solution}
-{: .challenge}
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+We trimmed our fastq files with Nextera adapters,
+but there are other adapters that are commonly used.
+What other adapter files came with Trimmomatic?
+
+::::::::::::::: solution
+
+## Solution
+
+```bash
+$ ls ~/miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/
+```
+
+```output
+NexteraPE-PE.fa TruSeq2-SE.fa TruSeq3-PE.fa
+TruSeq2-PE.fa TruSeq3-PE-2.fa TruSeq3-SE.fa
+```
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
We have now completed the trimming and filtering steps of our quality
control process! Before we move on, let's move our trimmed FASTQ files
to a new subdirectory within our `data/` directory.
-~~~
+```bash
$ cd ~/dc_workshop/data/untrimmed_fastq
$ mkdir ../trimmed_fastq
$ mv *.trim* ../trimmed_fastq
$ cd ../trimmed_fastq
$ ls
-~~~
-{: .bash}
+```
-~~~
+```output
SRR2584863_1.trim.fastq.gz SRR2584866_1.trim.fastq.gz SRR2589044_1.trim.fastq.gz
SRR2584863_1un.trim.fastq.gz SRR2584866_1un.trim.fastq.gz SRR2589044_1un.trim.fastq.gz
SRR2584863_2.trim.fastq.gz SRR2584866_2.trim.fastq.gz SRR2589044_2.trim.fastq.gz
SRR2584863_2un.trim.fastq.gz SRR2584866_2un.trim.fastq.gz SRR2589044_2un.trim.fastq.gz
-~~~
-{: .output}
-
-> ## Bonus exercise (advanced)
->
-> Now that our samples have gone through quality control, they should perform
-> better on the quality tests run by FastQC. Go ahead and re-run
-> FastQC on your trimmed FASTQ files and visualize the HTML files
-> to see whether your per base sequence quality is higher after
-> trimming.
->
->> ## Solution
->>
->> In your AWS terminal window do:
->>
->> ~~~
->> $ fastqc ~/dc_workshop/data/trimmed_fastq/*.fastq*
->> ~~~
->> {: .bash}
->>
->> In a new tab in your terminal do:
->>
->> ~~~
->> $ mkdir ~/Desktop/fastqc_html/trimmed
->> $ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/data/trimmed_fastq/*.html ~/Desktop/fastqc_html/trimmed
->> ~~~
->> {: .bash}
->>
->> Then take a look at the html files in your browser.
->>
->> Remember to replace everything between the `@` and `:` in your scp
->> command with your AWS instance number.
->>
->> After trimming and filtering, our overall quality is much higher,
->> we have a distribution of sequence lengths, and more samples pass
->> adapter content. However, quality trimming is not perfect, and some
->> programs are better at removing some sequences than others. Because our
->> sequences still contain 3' adapters, it could be important to explore
->> other trimming tools like [cutadapt](http://cutadapt.readthedocs.io/en/stable/) to remove these, depending on your
->> downstream application. Trimmomatic did pretty well though, and its performance
->> is good enough for our workflow.
-> {: .solution}
-{: .challenge}
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Bonus exercise (advanced)
+
+Now that our samples have gone through quality control, they should perform
+better on the quality tests run by FastQC. Go ahead and re-run
+FastQC on your trimmed FASTQ files and visualize the HTML files
+to see whether your per base sequence quality is higher after
+trimming.
+
+::::::::::::::: solution
+
+## Solution
+
+In your AWS terminal window do:
+
+```bash
+$ fastqc ~/dc_workshop/data/trimmed_fastq/*.fastq*
+```
+
+In a new tab in your terminal do:
+
+```bash
+$ mkdir ~/Desktop/fastqc_html/trimmed
+$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/data/trimmed_fastq/*.html ~/Desktop/fastqc_html/trimmed
+```
+
+Then take a look at the html files in your browser.
+
+Remember to replace everything between the `@` and `:` in your scp
+command with your AWS instance number.
+
+After trimming and filtering, our overall quality is much higher,
+we have a distribution of sequence lengths, and more samples pass
+adapter content. However, quality trimming is not perfect, and some
+programs are better at removing some sequences than others. Because our
+sequences still contain 3' adapters, it could be important to explore
+other trimming tools like [cutadapt](https://cutadapt.readthedocs.io/en/stable/) to remove these, depending on your
+downstream application. Trimmomatic did pretty well though, and its performance
+is good enough for our workflow.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- The options you set for the command-line tools you use are important!
+- Data cleaning is an essential step in a genomics workflow.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/episodes/04-variant_calling.md b/episodes/04-variant_calling.md
index 1d072787..3b5b07d7 100644
--- a/episodes/04-variant_calling.md
+++ b/episodes/04-variant_calling.md
@@ -1,96 +1,102 @@
---
-title: "Variant Calling Workflow"
+title: Variant Calling Workflow
teaching: 35
exercises: 25
-questions:
-- "How do I find sequence variants between my sample and a reference genome?"
-objectives:
-- "Understand the steps involved in variant calling."
-- "Describe the types of data formats encountered during variant calling."
-- "Use command line tools to perform variant calling."
-keypoints:
-- "Bioinformatic command line tools are collections of commands that can be used to carry out bioinformatic analyses."
-- "To use most powerful bioinformatic tools, you will need to use the command line."
-- "There are many different file formats for storing genomics data. It is important to understand what type of information is contained in each file, and how it was derived."
---
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Understand the steps involved in variant calling.
+- Describe the types of data formats encountered during variant calling.
+- Use command line tools to perform variant calling.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How do I find sequence variants between my sample and a reference genome?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
We mentioned before that we are working with files from a long-term evolution study of an *E. coli* population (designated Ara-3). Now that we have looked at our data to make sure that it is high quality, and removed low-quality base calls, we can perform variant calling to see how the population changed over time. We care how this population changed relative to the original population, *E. coli* strain REL606. Therefore, we will align each of our samples to the *E. coli* REL606 reference genome, and see what differences exist in our reads versus the genome.
-# Alignment to a reference genome
+## Alignment to a reference genome
-
+{alt='workflow\_align'}
We perform read alignment or mapping to determine where in the genome our reads originated from. There are a number of tools to
choose from and, while there is no gold standard, there are some tools that are better suited for particular NGS analyses. We will be
-using the [Burrows Wheeler Aligner (BWA)](http://bio-bwa.sourceforge.net/), which is a software package for mapping low-divergent
-sequences against a large reference genome.
+using the [Burrows Wheeler Aligner (BWA)](https://bio-bwa.sourceforge.net/), which is a software package for mapping low-divergent
+sequences against a large reference genome.
The alignment process consists of two steps:
1. Indexing the reference genome
2. Aligning the reads to the reference genome
+## Setting up
-# Setting up
-
-First we download the reference genome for *E. coli* REL606. Although we could copy or move the file with `cp` or `mv`, most genomics workflows begin with a download step, so we will practice that here.
+First we download the reference genome for *E. coli* REL606. Although we could copy or move the file with `cp` or `mv`, most genomics workflows begin with a download step, so we will practice that here.
-~~~
+```bash
$ cd ~/dc_workshop
$ mkdir -p data/ref_genome
$ curl -L -o data/ref_genome/ecoli_rel606.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz
$ gunzip data/ref_genome/ecoli_rel606.fasta.gz
-~~~
-{: .bash}
-
-> ## Exercise
->
-> We saved this file as `data/ref_genome/ecoli_rel606.fasta.gz` and then decompressed it.
-> What is the real name of the genome?
->
->> ## Solution
->>
->> ~~~
->> $ head data/ref_genome/ecoli_rel606.fasta
->> ~~~
->> {: .bash}
->>
->> The name of the sequence follows the `>` character. The name is `CP000819.1 Escherichia coli B str. REL606, complete genome`.
->> Keep this chromosome name (`CP000819.1`) in mind, as we will use it later in the lesson.
-> {: .solution}
-{: .challenge}
-
-We will also download a set of trimmed FASTQ files to work with. These are small subsets of our real trimmed data,
-and will enable us to run our variant calling workflow quite quickly.
-
-~~~
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+We saved this file as `data/ref_genome/ecoli_rel606.fasta.gz` and then decompressed it.
+What is the real name of the genome?
+
+::::::::::::::: solution
+
+### Solution
+
+```bash
+$ head data/ref_genome/ecoli_rel606.fasta
+```
+
+The name of the sequence follows the `>` character. The name is `CP000819.1 Escherichia coli B str. REL606, complete genome`.
+Keep this chromosome name (`CP000819.1`) in mind, as we will use it later in the lesson.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+We will also download a set of trimmed FASTQ files to work with. These are small subsets of our real trimmed data,
+and will enable us to run our variant calling workflow quite quickly.
+
+```bash
$ curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
$ tar xvf sub.tar.gz
$ mv sub/ ~/dc_workshop/data/trimmed_fastq_small
-~~~
-{: .bash}
+```
You will also need to create directories for the results that will be generated as part of this workflow. We can do this in a single
line of code, because `mkdir` can accept multiple new directory
names as input.
-~~~
+```bash
$ mkdir -p results/sam results/bam results/bcf results/vcf
-~~~
-{: .bash}
+```
+#### Index the reference genome
-### Index the reference genome
Our first step is to index the reference genome for use by BWA. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment. Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference genome or you are using a different tool for alignment.
-~~~
+```bash
$ bwa index data/ref_genome/ecoli_rel606.fasta
-~~~
-{: .bash}
+```
While the index is created, you will see output that looks something like this:
-~~~
+```output
[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.05 seconds elapse.
@@ -100,38 +106,35 @@ While the index is created, you will see output that looks something like this:
[main] Version: 0.7.17-r1188
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.765 sec; CPU: 1.715 sec
-~~~
-{: .output}
+```
-### Align reads to reference genome
+#### Align reads to reference genome
-The alignment process consists of choosing an appropriate reference genome to map our reads against and then deciding on an
-aligner. We will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it
+The alignment process consists of choosing an appropriate reference genome to map our reads against and then deciding on an
+aligner. We will use the BWA-MEM algorithm, which is the latest and is generally recommended for high-quality queries as it
is faster and more accurate.
An example of what a `bwa` command looks like is below. This command will not run, as we do not have the files `ref_genome.fa`, `input_file_R1.fastq`, or `input_file_R2.fastq`.
-~~~
+```bash
$ bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam
-~~~
-{: .bash}
+```
-Have a look at the [bwa options page](http://bio-bwa.sourceforge.net/bwa.shtml). While we are running bwa with the default
-parameters here, your use case might require a change of parameters. *NOTE: Always read the manual page for any tool before using
+Have a look at the [bwa options page](https://bio-bwa.sourceforge.net/bwa.shtml). While we are running bwa with the default
+parameters here, your use case might require a change of parameters. *NOTE: Always read the manual page for any tool before using
and make sure the options you use are appropriate for your data.*
-We are going to start by aligning the reads from just one of the
-samples in our dataset (`SRR2584866`). Later, we will be
+We are going to start by aligning the reads from just one of the
+samples in our dataset (`SRR2584866`). Later, we will be
iterating this whole process on all of our sample files.
-~~~
+```bash
$ bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
-~~~
-{: .bash}
+```
-You will see output that starts like this:
+You will see output that starts like this:
-~~~
+```output
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 77446 sequences (10000033 bp)...
[M::process] read 77296 sequences (10000182 bp)...
@@ -142,70 +145,62 @@ You will see output that starts like this:
[M::mem_pestat] mean and std.dev: (784.68, 700.87)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 5836)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
-~~~
-{: .output}
+```
+##### SAM/BAM format
-#### SAM/BAM format
The [SAM file](https://genome.sph.umich.edu/wiki/SAM),
-is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we do not
-have time to go into detail about the features of the SAM format, the paper by
-[Heng Li et al.](http://bioinformatics.oxfordjournals.org/content/25/16/2078.full) provides a lot more detail on the specification.
+is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we do not
+have time to go into detail about the features of the SAM format, the paper by
+[Heng Li et al.](https://bioinformatics.oxfordjournals.org/content/25/16/2078.full) provides a lot more detail on the specification.
**The compressed binary version of SAM is called a BAM file.** We use this version to reduce size and to allow for *indexing*, which enables efficient random access of the data contained within the file.
The file begins with a **header**, which is optional. The header is used to describe the source of data, reference sequence, method of
alignment, etc., this will change depending on the aligner being used. Following the header is the **alignment section**. Each line
that follows corresponds to alignment information for a single read. Each alignment line has **11 mandatory fields** for essential
-mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is
+mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is
displayed below with the different fields highlighted.
-
-
+{alt='sam\_bam1'}
-
+{alt='sam\_bam2'}
-We will convert the SAM file to BAM format using the `samtools` program with the `view` command and tell this command that the input is in SAM format (`-S`) and to output BAM format (`-b`):
+We will convert the SAM file to BAM format using the `samtools` program with the `view` command and tell this command that the input is in SAM format (`-S`) and to output BAM format (`-b`):
-~~~
+```bash
$ samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam
-~~~
-{: .bash}
+```
-~~~
+```output
[samopen] SAM header is present: 1 sequences.
-~~~
-{: .output}
-
+```
-### Sort BAM file by coordinates
+#### Sort BAM file by coordinates
Next we sort the BAM file using the `sort` command from `samtools`. `-o` tells the command where to write the output.
-~~~
+```bash
$ samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam
-~~~
-{: .bash}
+```
Our files are pretty small, so we will not see this output. If you run the workflow with larger files, you will see something like this:
-~~~
-[bam_sort_core] merging from 2 files...
-~~~
-{: .output}
+```output
+[bam_sort_core] merging from 2 files...
+```
SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.
You can use samtools to learn more about this bam file as well.
-~~~
+```bash
samtools flagstat results/bam/SRR2584866.aligned.sorted.bam
-~~~
-{: .bash}
+```
This will give you the following statistics about your sorted bam file:
-~~~
+```output
351169 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
@@ -219,89 +214,86 @@ This will give you the following statistics about your sorted bam file:
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
-~~~
-{: .output}
-## Variant calling
+```
+
+### Variant calling
A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given position in an individual genome
-or transcriptome, often referred to as a Single Nucleotide Variant (SNV). The call is usually accompanied by an estimate of
-variant frequency and some measure of confidence. Similar to other steps in this workflow, there are a number of tools available for
-variant calling. In this workshop we will be using `bcftools`, but there are a few things we need to do before actually calling the
+or transcriptome, often referred to as a Single Nucleotide Variant (SNV). The call is usually accompanied by an estimate of
+variant frequency and some measure of confidence. Similar to other steps in this workflow, there are a number of tools available for
+variant calling. In this workshop we will be using `bcftools`, but there are a few things we need to do before actually calling the
variants.
-
+{alt='workflow'}
-### Step 1: Calculate the read coverage of positions in the genome
+#### Step 1: Calculate the read coverage of positions in the genome
-Do the first pass on variant calling by counting read coverage with
-[bcftools](https://samtools.github.io/bcftools/bcftools.html). We will
-use the command `mpileup`. The flag `-O b` tells bcftools to generate a
+Do the first pass on variant calling by counting read coverage with
+[bcftools](https://samtools.github.io/bcftools/bcftools.html). We will
+use the command `mpileup`. The flag `-O b` tells bcftools to generate a
bcf format output file, `-o` specifies where to write the output file, and `-f` flags the path to the reference genome:
-~~~
+```bash
$ bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
-~~~
-{: .bash}
+```
-~~~
+```output
[mpileup] 1 samples in 1 input files
-~~~
-{: .output}
+```
We have now generated a file with coverage information for every base.
-### Step 2: Detect the single nucleotide variants (SNVs)
+#### Step 2: Detect the single nucleotide variants (SNVs)
Identify SNVs using bcftools `call`. We have to specify ploidy with the flag `--ploidy`, which is one for the haploid *E. coli*. `-m` allows for multiallelic and rare-variant calling, `-v` tells the program to output variant sites only (not every site in the genome), and `-o` specifies where to write the output file:
-~~~
+```bash
$ bcftools call --ploidy 1 -m -v -o results/vcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf
-~~~
-{: .bash}
+```
-### Step 3: Filter and report the SNV variants in variant calling format (VCF)
+#### Step 3: Filter and report the SNV variants in variant calling format (VCF)
Filter the SNVs for the final output in VCF format, using `vcfutils.pl`:
-~~~
+```bash
$ vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf
-~~~
-{: .bash}
-
-
-> ## Filtering
-> The `vcfutils.pl varFilter` call filters out variants that do not meet minimum quality default criteria, which can be changed through
-> its options. Using `bcftools` we can verify that the quality of the variant call set has improved after this filtering step by
-> calculating the ratio of [transitions(TS)](https://en.wikipedia.org/wiki/Transition_%28genetics%29) to
-> [transversions (TV)](https://en.wikipedia.org/wiki/Transversion) ratio (TS/TV),
-> where transitions should be more likely to occur than transversions:
->
-> ~~~
-> $ bcftools stats results/bcf/SRR2584866_variants.vcf | grep TSTV
-> # TSTV, transitions/transversions:
-> # TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
-> TSTV 0 628 58 10.83 628 58 10.83
-> $ bcftools stats results/vcf/SRR2584866_final_variants.vcf | grep TSTV
-> # TSTV, transitions/transversions:
-> # TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
-> TSTV 0 621 54 11.50 621 54 11.50
-> ~~~
-> {: .bash}
-{: .callout}
-
-## Explore the VCF format:
-
-~~~
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Filtering
+
+The `vcfutils.pl varFilter` call filters out variants that do not meet minimum quality default criteria, which can be changed through
+its options. Using `bcftools` we can verify that the quality of the variant call set has improved after this filtering step by
+calculating the ratio of [transitions(TS)](https://en.wikipedia.org/wiki/Transition_%28genetics%29) to
+[transversions (TV)](https://en.wikipedia.org/wiki/Transversion) ratio (TS/TV),
+where transitions should be more likely to occur than transversions:
+
+```bash
+$ bcftools stats results/bcf/SRR2584866_variants.vcf | grep TSTV
+# TSTV, transitions/transversions:
+# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
+TSTV 0 628 58 10.83 628 58 10.83
+$ bcftools stats results/vcf/SRR2584866_final_variants.vcf | grep TSTV
+# TSTV, transitions/transversions:
+# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
+TSTV 0 621 54 11.50 621 54 11.50
+```
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Explore the VCF format:
+
+```bash
$ less -S results/vcf/SRR2584866_final_variants.vcf
-~~~
-{: .bash}
+```
You will see the header (which describes the format), the time and date the file was
-created, the version of bcftools that was used, the command line parameters used, and
+created, the version of bcftools that was used, the command line parameters used, and
some additional information:
-~~~
+```output
##fileformat=VCFv4.2
##FILTER=
##bcftoolsVersion=1.8+htslib-1.8
@@ -330,12 +322,11 @@ some additional information:
##INFO=
##bcftools_callVersion=1.8+htslib-1.8
##bcftools_callCommand=call --ploidy 1 -m -v -o results/bcf/SRR2584866_variants.vcf results/bcf/SRR2584866_raw.bcf; Date=Tue Oct 9 18:48:10 2018
-~~~
-{: .output}
+```
-Followed by information on each of the variations observed:
+Followed by information on each of the variations observed:
-~~~
+```output
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT results/bam/SRR2584866.aligned.sorted.bam
CP000819.1 1521 . C T 207 . DP=9;VDB=0.993024;SGB=-0.662043;MQSB=0.974597;MQ0F=0;AC=1;AN=1;DP4=0,0,4,5;MQ=60
CP000819.1 1612 . A G 225 . DP=13;VDB=0.52194;SGB=-0.676189;MQSB=0.950952;MQ0F=0;AC=1;AN=1;DP4=0,0,6,5;MQ=60
@@ -349,95 +340,98 @@ CP000819.1 45277 . A G 225 . DP=15;VDB=0.4709
CP000819.1 56613 . C G 183 . DP=12;VDB=0.879703;SGB=-0.676189;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,8,3;MQ=60 GT:PL
CP000819.1 62118 . A G 225 . DP=19;VDB=0.414981;SGB=-0.691153;MQSB=0.906029;MQ0F=0;AC=1;AN=1;DP4=0,0,8,10;MQ=59
CP000819.1 64042 . G A 225 . DP=18;VDB=0.451328;SGB=-0.689466;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,7,9;MQ=60 GT:PL
-~~~
-{: .output}
+```
This is a lot of information, so let's take some time to make sure we understand our output.
-The first few columns represent the information we have about a predicted variation.
+The first few columns represent the information we have about a predicted variation.
-| column | info |
-| ------- | ---------- |
-| CHROM | contig location where the variation occurs |
-| POS | position within the contig where the variation occurs |
-| ID | a `.` until we add annotation information |
-| REF | reference genotype (forward strand) |
-| ALT | sample genotype (forward strand) |
-| QUAL | Phred-scaled probability that the observed variant exists at this site (higher is better) |
-| FILTER | a `.` if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed |
+| column | info |
+| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| CHROM | contig location where the variation occurs |
+| POS | position within the contig where the variation occurs |
+| ID | a `.` until we add annotation information |
+| REF | reference genotype (forward strand) |
+| ALT | sample genotype (forward strand) |
+| QUAL | Phred-scaled probability that the observed variant exists at this site (higher is better) |
+| FILTER | a `.` if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed |
In an ideal world, the information in the `QUAL` column would be all we needed to filter out bad variant calls.
-However, in reality we need to filter on multiple other metrics.
+However, in reality we need to filter on multiple other metrics.
-The last two columns contain the genotypes and can be tricky to decode.
+The last two columns contain the genotypes and can be tricky to decode.
-| column | info |
-| ------- | ---------- |
-| FORMAT | lists in order the metrics presented in the final column |
-| results | lists the values associated with those metrics in order |
+| column | info |
+| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| FORMAT | lists in order the metrics presented in the final column |
+| results | lists the values associated with those metrics in order |
-For our file, the metrics presented are GT:PL:GQ.
+For our file, the metrics presented are GT:PL:GQ.
-| metric | definition |
-| ------- | ---------- |
-| AD, DP | the depth per allele by sample and coverage |
-| GT | the genotype for the sample at this loci. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. A 0/0 means homozygous reference, 0/1 is heterozygous, and 1/1 is homozygous for the alternate allele. |
-| PL | the likelihoods of the given genotypes |
-| GQ | the Phred-scaled confidence for the genotype |
+| metric | definition |
+| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| AD, DP | the depth per allele by sample and coverage |
+| GT | the genotype for the sample at this loci. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. A 0/0 means homozygous reference, 0/1 is heterozygous, and 1/1 is homozygous for the alternate allele. |
+| PL | the likelihoods of the given genotypes |
+| GQ | the Phred-scaled confidence for the genotype |
The Broad Institute's [VCF guide](https://www.broadinstitute.org/gatk/guide/article?id=1268) is an excellent place
to learn more about the VCF file format.
-> ## Exercise
->
-> Use the `grep` and `wc` commands you have learned to assess how many variants are in the vcf file.
->
->> ## Solution
->>
->> ~~~
->> $ grep -v "#" results/vcf/SRR2584866_final_variants.vcf | wc -l
->> ~~~
->> {: .bash}
->>
->> ~~~
->> 766
->> ~~~
->> {: .output}
->>
->> There are 766 variants in this file.
-> {: .solution}
-{: .challenge}
-
-## Assess the alignment (visualization) - optional step
-
-It is often instructive to look at your data in a genome browser. Visualization will allow you to get a "feel" for
-the data, as well as detecting abnormalities and problems. Also, exploring the data in such a way may give you
-ideas for further analyses. As such, visualization tools are useful for exploratory analysis. In this lesson we
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+Use the `grep` and `wc` commands you have learned to assess how many variants are in the vcf file.
+
+::::::::::::::: solution
+
+### Solution
+
+```bash
+$ grep -v "#" results/vcf/SRR2584866_final_variants.vcf | wc -l
+```
+
+```output
+766
+```
+
+There are 766 variants in this file.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Assess the alignment (visualization) - optional step
+
+It is often instructive to look at your data in a genome browser. Visualization will allow you to get a "feel" for
+the data, as well as detecting abnormalities and problems. Also, exploring the data in such a way may give you
+ideas for further analyses. As such, visualization tools are useful for exploratory analysis. In this lesson we
will describe two different tools for visualization: a light-weight command-line based one and the Broad
Institute's Integrative Genomics Viewer (IGV) which requires
software installation and transfer of files.
In order for us to visualize the alignment files, we will need to index the BAM file using `samtools`:
-~~~
+```bash
$ samtools index results/bam/SRR2584866.aligned.sorted.bam
-~~~
-{: .bash}
+```
-### Viewing with `tview`
+#### Viewing with `tview`
-[Samtools](http://www.htslib.org/) implements a very simple text alignment viewer based on the GNU
-`ncurses` library, called `tview`. This alignment viewer works with short indels and shows [MAQ](http://maq.sourceforge.net/) consensus.
+[Samtools](https://www.htslib.org/) implements a very simple text alignment viewer based on the GNU
+`ncurses` library, called `tview`. This alignment viewer works with short indels and shows [MAQ](https://maq.sourceforge.net/) consensus.
It uses different colors to display mapping quality or base quality, subjected to users' choice. Samtools viewer is known to work with a 130 GB alignment swiftly. Due to its text interface, displaying alignments over network is also very fast.
-In order to visualize our mapped reads, we use `tview`, giving it the sorted bam file and the reference file:
+In order to visualize our mapped reads, we use `tview`, giving it the sorted bam file and the reference file:
-~~~
+```bash
$ samtools tview results/bam/SRR2584866.aligned.sorted.bam data/ref_genome/ecoli_rel606.fasta
-~~~
-{: .bash}
+```
-~~~
+```output
1 11 21 31 41 51 61 71 81 91 101 111 121
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATAC
..................................................................................................................................
@@ -462,113 +456,134 @@ AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ..................................
.................................... ,,,,,,,,,,,,,,,,,,a,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ............................ ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
-~~~
-{: .output}
+```
The first line of output shows the genome coordinates in our reference genome. The second line shows the reference
genome sequence. The third line shows the consensus sequence determined from the sequence reads. A `.` indicates
a match to the reference sequence, so we can see that the consensus from our sample matches the reference in most
locations. That is good! If that was not the case, we should probably reconsider our choice of reference.
-Below the horizontal line, we can see all of the reads in our sample aligned with the reference genome. Only
+Below the horizontal line, we can see all of the reads in our sample aligned with the reference genome. Only
positions where the called base differs from the reference are shown. You can use the arrow keys on your keyboard
to scroll or type `?` for a help menu. To navigate to a specific position, type `g`. A dialogue box will appear. In
this box, type the name of the "chromosome" followed by a colon and the position of the variant you would like to view
-(e.g. for this sample, type `CP000819.1:50` to view the 50th base. Type `Ctrl^C` or `q` to exit `tview`.
-
-> ## Exercise
->
-> Visualize the alignment of the reads for our `SRR2584866` sample. What variant is present at
-> position 4377265? What is the canonical nucleotide in that position?
->
->> ## Solution
->>
->> ~~~
->> $ samtools tview ~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam ~/dc_workshop/data/ref_genome/ecoli_rel606.fasta
->> ~~~
->> {: .bash}
->>
->> Then type `g`. In the dialogue box, type `CP000819.1:4377265`.
->> `G` is the variant. `A` is canonical. This variant possibly changes the phenotype of this sample to hypermutable. It occurs
->> in the gene *mutL*, which controls DNA mismatch repair.
-> {: .solution}
-{: .challenge}
-
-### Viewing with IGV
-
-[IGV](http://www.broadinstitute.org/igv/) is a stand-alone browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like [Ensembl](http://www.ensembl.org/index.html) or the [UCSC browser](https://genome.ucsc.edu/), are slower, but provide more functionality. They not only allow for more polished and flexible visualization, but also provide easy access to a wealth of annotations and external data sources. This makes it straightforward to relate your data with information about repeat regions, known genes, epigenetic features or areas of cross-species conservation, to name just a few.
-
-In order to use IGV, we will need to transfer some files to our local machine. We know how to do this with `scp`.
-Open a new tab in your terminal window and create a new folder. We will put this folder on our Desktop for
-demonstration purposes, but in general you should avoide proliferating folders and files on your Desktop and
+(e.g. for this sample, type `CP000819.1:50` to view the 50th base. Type `Ctrl^C` or `q` to exit `tview`.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+Visualize the alignment of the reads for our `SRR2584866` sample. What variant is present at
+position 4377265? What is the canonical nucleotide in that position?
+
+::::::::::::::: solution
+
+### Solution
+
+```bash
+$ samtools tview ~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam ~/dc_workshop/data/ref_genome/ecoli_rel606.fasta
+```
+
+Then type `g`. In the dialogue box, type `CP000819.1:4377265`.
+`G` is the variant. `A` is canonical. This variant possibly changes the phenotype of this sample to hypermutable. It occurs
+in the gene *mutL*, which controls DNA mismatch repair.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+#### Viewing with IGV
+
+[IGV](https://www.broadinstitute.org/igv/) is a stand-alone browser, which has the advantage of being installed locally and providing fast access. Web-based genome browsers, like [Ensembl](https://www.ensembl.org/index.html) or the [UCSC browser](https://genome.ucsc.edu/), are slower, but provide more functionality. They not only allow for more polished and flexible visualization, but also provide easy access to a wealth of annotations and external data sources. This makes it straightforward to relate your data with information about repeat regions, known genes, epigenetic features or areas of cross-species conservation, to name just a few.
+
+In order to use IGV, we will need to transfer some files to our local machine. We know how to do this with `scp`.
+Open a new tab in your terminal window and create a new folder. We will put this folder on our Desktop for
+demonstration purposes, but in general you should avoide proliferating folders and files on your Desktop and
instead organize files within a directory structure like we have been using in our `dc_workshop` directory.
-~~~
+```bash
$ mkdir ~/Desktop/files_for_igv
$ cd ~/Desktop/files_for_igv
-~~~
-{: .bash}
+```
-Now we will transfer our files to that new directory. Remember to replace the text between the `@` and the `:`
+Now we will transfer our files to that new directory. Remember to replace the text between the `@` and the `:`
with your AWS instance number. The commands to `scp` always go in the terminal window that is connected to your
local computer (not your AWS instance).
-~~~
+```bash
$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam ~/Desktop/files_for_igv
$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/results/bam/SRR2584866.aligned.sorted.bam.bai ~/Desktop/files_for_igv
$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/data/ref_genome/ecoli_rel606.fasta ~/Desktop/files_for_igv
$ scp dcuser@ec2-34-203-203-131.compute-1.amazonaws.com:~/dc_workshop/results/vcf/SRR2584866_final_variants.vcf ~/Desktop/files_for_igv
-~~~
-{: .bash}
+```
-You will need to type the password for your AWS instance each time you call `scp`.
+You will need to type the password for your AWS instance each time you call `scp`.
Next, we need to open the IGV software. If you have not done so already, you can download IGV from the [Broad Institute's software page](https://www.broadinstitute.org/software/igv/download), double-click the `.zip` file
-to unzip it, and then drag the program into your Applications folder.
+to unzip it, and then drag the program into your Applications folder.
1. Open IGV.
2. Load our reference genome file (`ecoli_rel606.fasta`) into IGV using the **"Load Genomes from File..."** option under the **"Genomes"** pull-down menu.
-3. Load our BAM file (`SRR2584866.aligned.sorted.bam`) using the **"Load from File..."** option under the **"File"** pull-down menu.
-4. Do the same with our VCF file (`SRR2584866_final_variants.vcf`).
+3. Load our BAM file (`SRR2584866.aligned.sorted.bam`) using the **"Load from File..."** option under the **"File"** pull-down menu.
+4. Do the same with our VCF file (`SRR2584866_final_variants.vcf`).
Your IGV browser should look like the screenshot below:
-
+{alt='IGV'}
-There should be two tracks: one coresponding to our BAM file and the other for our VCF file.
+There should be two tracks: one coresponding to our BAM file and the other for our VCF file.
In the **VCF track**, each bar across the top of the plot shows the allele fraction for a single locus. The second bar shows
-the genotypes for each locus in each *sample*. We only have one sample called here, so we only see a single line. Dark blue =
+the genotypes for each locus in each *sample*. We only have one sample called here, so we only see a single line. Dark blue =
heterozygous, Cyan = homozygous variant, Grey = reference. Filtered entries are transparent.
-Zoom in to inspect variants you see in your filtered VCF file to become more familiar with IGV. See how quality information
+Zoom in to inspect variants you see in your filtered VCF file to become more familiar with IGV. See how quality information
corresponds to alignment information at those loci.
-Use [this website](http://software.broadinstitute.org/software/igv/AlignmentData) and the links therein to understand how IGV colors the alignments.
+Use [this website](https://software.broadinstitute.org/software/igv/AlignmentData) and the links therein to understand how IGV colors the alignments.
Now that we have run through our workflow for a single sample, we want to repeat this workflow for our other five
samples. However, we do not want to type each of these individual steps again five more times. That would be very
time consuming and error-prone, and would become impossible as we gathered more and more samples. Luckily, we
already know the tools we need to use to automate this workflow and run it on as many files as we want using a
-single line of code. Those tools are: wildcards, for loops, and bash scripts. We will use all three in the next
-lesson.
-
-> ## Installing software
->
-> It is worth noting that all of the software we are using for
-> this workshop has been pre-installed on our remote computer.
-> This saves us a lot of time - installing software can be a
-> time-consuming and frustrating task - however, this does mean that
-> you will not be able to walk out the door and start doing these
-> analyses on your own computer. You will need to install
-> the software first. Look at the [setup instructions](http://www.datacarpentry.org/wrangling-genomics/setup.html) for more information
-> on installing these software packages.
-{: .callout}
-
-> ## BWA alignment options
-> BWA consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence
-> reads up to 100bp, while the other two are for sequences ranging from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such
-> as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it
-> is faster and more accurate.
-{: .callout}
+single line of code. Those tools are: wildcards, for loops, and bash scripts. We will use all three in the next
+lesson.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Installing software
+
+It is worth noting that all of the software we are using for
+this workshop has been pre-installed on our remote computer.
+This saves us a lot of time - installing software can be a
+time-consuming and frustrating task - however, this does mean that
+you will not be able to walk out the door and start doing these
+analyses on your own computer. You will need to install
+the software first. Look at the [setup instructions](https://www.datacarpentry.org/wrangling-genomics/setup.html) for more information
+on installing these software packages.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### BWA alignment options
+
+BWA consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence
+reads up to 100bp, while the other two are for sequences ranging from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such
+as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it
+is faster and more accurate.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Bioinformatic command line tools are collections of commands that can be used to carry out bioinformatic analyses.
+- To use most powerful bioinformatic tools, you will need to use the command line.
+- There are many different file formats for storing genomics data. It is important to understand what type of information is contained in each file, and how it was derived.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
diff --git a/episodes/05-automation.md b/episodes/05-automation.md
index 643d6b12..9fc16cb9 100644
--- a/episodes/05-automation.md
+++ b/episodes/05-automation.md
@@ -1,56 +1,60 @@
---
-title: "Automating a Variant Calling Workflow"
+title: Automating a Variant Calling Workflow
teaching: 30
exercises: 15
-questions:
-- "How can I make my workflow more efficient and less error-prone?"
-objectives:
-- "Write a shell script with multiple variables."
-- "Incorporate a `for` loop into a shell script."
-keypoints:
-- "We can combine multiple commands into a shell script to automate a workflow."
-- "Use `echo` statements within your scripts to get an automated progress update."
---
-# What is a shell script?
-You wrote a simple shell script in a [previous lesson](http://www.datacarpentry.org/shell-genomics/05-writing-scripts/) that we used to extract bad reads from our
-FASTQ files and put them into a new file.
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Write a shell script with multiple variables.
+- Incorporate a `for` loop into a shell script.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How can I make my workflow more efficient and less error-prone?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## What is a shell script?
+
+You wrote a simple shell script in a [previous lesson](https://www.datacarpentry.org/shell-genomics/05-writing-scripts/) that we used to extract bad reads from our
+FASTQ files and put them into a new file.
Here is the script you wrote:
-~~~
+```bash
grep -B1 -A2 NNNNNNNNNN *.fastq > scripted_bad_reads.txt
echo "Script finished!"
-~~~
-{: .bash}
+```
That script was only two lines long, but shell scripts can be much more complicated
-than that and can be used to perform a large number of operations on one or many
+than that and can be used to perform a large number of operations on one or many
files. This saves you the effort of having to type each of those commands over for
-each of your data files and makes your work less error-prone and more reproducible.
+each of your data files and makes your work less error-prone and more reproducible.
For example, the variant calling workflow we just carried out had about eight steps
-where we had to type a command into our terminal. Most of these commands were pretty
+where we had to type a command into our terminal. Most of these commands were pretty
long. If we wanted to do this for all six of our data files, that would be forty-eight
steps. If we had 50 samples (a more realistic number), it would be 400 steps! You can
see why we want to automate this.
-We have also used `for` loops in previous lessons to iterate one or two commands over multiple input files.
+We have also used `for` loops in previous lessons to iterate one or two commands over multiple input files.
In these `for` loops, the filename was defined as a variable in the `for` statement, which enabled you to run the loop on multiple files. We will be using variable assignments like this in our new shell scripts.
-Here is the `for` loop you wrote for unzipping `.zip` files:
+Here is the `for` loop you wrote for unzipping `.zip` files:
-~~~
+```bash
$ for filename in *.zip
> do
> unzip $filename
> done
-~~~
-{: .bash}
+```
And here is the one you wrote for running Trimmomatic on all of our `.fastq` sample files:
-~~~
+```bash
$ for infile in *_1.fastq.gz
> do
> base=$(basename ${infile} _1.fastq.gz)
@@ -59,121 +63,117 @@ $ for infile in *_1.fastq.gz
> ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
> SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
> done
-~~~
-{: .bash}
+```
Notice that in this `for` loop, we used two variables, `infile`, which was defined in the `for` statement, and `base`, which was created from the filename during each iteration of the loop.
-> ## Creating variables
-> Within the Bash shell you can create variables at any time (as we did
-> above, and during the 'for' loop lesson). Assign any name and the
-> value using the assignment operator: '='. You can check the current
-> definition of your variable by typing into your script: echo $variable_name.
-{: .callout}
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Creating variables
+
+Within the Bash shell you can create variables at any time (as we did
+above, and during the 'for' loop lesson). Assign any name and the
+value using the assignment operator: '='. You can check the current
+definition of your variable by typing into your script: echo $variable\_name.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
In this lesson, we will use two shell scripts to automate the variant calling analysis: one for FastQC analysis (including creating our summary file), and a second for the remaining variant calling. To write a script to run our FastQC analysis, we will take each of the commands we entered to run FastQC and process the output files and put them into a single file with a `.sh` extension. The `.sh` is not essential, but serves as a reminder to ourselves and to the computer that this is a shell script.
-# Analyzing quality with FastQC
+## Analyzing quality with FastQC
We will use the command `touch` to create a new file where we will write our shell script. We will create this script in a new
directory called `scripts/`. Previously, we used
`nano` to create and open a new file. The command `touch` allows us to create a new file without opening that file.
-~~~
+```bash
$ mkdir -p ~/dc_workshop/scripts
$ cd ~/dc_workshop/scripts
$ touch read_qc.sh
$ ls
-~~~
-{: .bash}
+```
-~~~
+```output
read_qc.sh
-~~~
-{: .output}
+```
We now have an empty file called `read_qc.sh` in our `scripts/` directory. We will now open this file in `nano` and start
building our script.
-~~~
+```bash
$ nano read_qc.sh
-~~~
-{: .bash}
+```
**Enter the following pieces of code into your shell script (not into your terminal prompt).**
Our first line will ensure that our script will exit if an error occurs, and is a good idea to include at the beginning of your scripts. The second line will move us into the `untrimmed_fastq/` directory when we run our script.
-~~~
+```output
set -e
cd ~/dc_workshop/data/untrimmed_fastq/
-~~~
-{: .output}
+```
These next two lines will give us a status message to tell us that we are currently running FastQC, then will run FastQC
-on all of the files in our current directory with a `.fastq` extension.
+on all of the files in our current directory with a `.fastq` extension.
-~~~
+```output
echo "Running FastQC ..."
fastqc *.fastq*
-~~~
-{: .output}
+```
Our next line will create a new directory to hold our FastQC output files. Here we are using the `-p` option for `mkdir` again. It is a good idea to use this option in your shell scripts to avoid running into errors if you do not have the directory structure you think you do.
-~~~
+```output
mkdir -p ~/dc_workshop/results/fastqc_untrimmed_reads
-~~~
-{: .output}
+```
Our next three lines first give us a status message to tell us we are saving the results from FastQC, then moves all of the files
-with a `.zip` or a `.html` extension to the directory we just created for storing our FastQC results.
+with a `.zip` or a `.html` extension to the directory we just created for storing our FastQC results.
-~~~
+```output
echo "Saving FastQC results..."
mv *.zip ~/dc_workshop/results/fastqc_untrimmed_reads/
mv *.html ~/dc_workshop/results/fastqc_untrimmed_reads/
-~~~
-{: .output}
+```
The next line moves us to the results directory where we have stored our output.
-~~~
+```output
cd ~/dc_workshop/results/fastqc_untrimmed_reads/
-~~~
-{: .output}
+```
The next five lines should look very familiar. First we give ourselves a status message to tell us that we are unzipping our ZIP
files. Then we run our for loop to unzip all of the `.zip` files in this directory.
-~~~
+```output
echo "Unzipping..."
for filename in *.zip
do
unzip $filename
done
-~~~
-{: .output}
+```
-Next we concatenate all of our summary files into a single output file, with a status message to remind ourselves that this is
+Next we concatenate all of our summary files into a single output file, with a status message to remind ourselves that this is
what we are doing.
-~~~
+```output
echo "Saving summary..."
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
-~~~
-{: .output}
+```
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+### Using `echo` statements
-> ## Using `echo` statements
->
-> We have used `echo` statements to add progress statements to our script. Our script will print these statements
-> as it is running and therefore we will be able to see how far our script has progressed.
->
-{: .callout}
+We have used `echo` statements to add progress statements to our script. Our script will print these statements
+as it is running and therefore we will be able to see how far our script has progressed.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
Your full shell script should now look like this:
-~~~
+```output
set -e
cd ~/dc_workshop/data/untrimmed_fastq/
@@ -196,17 +196,15 @@ for filename in *.zip
echo "Saving summary..."
cat */summary.txt > ~/dc_workshop/docs/fastqc_summaries.txt
-~~~
-{: .output}
+```
Save your file and exit `nano`. We can now run our script:
-~~~
+```bash
$ bash read_qc.sh
-~~~
-{: .bash}
+```
-~~~
+```output
Running FastQC ...
Started analysis of SRR2584866.fastq
Approx 5% complete for SRR2584866.fastq
@@ -217,29 +215,25 @@ Approx 25% complete for SRR2584866.fastq
.
.
.
-~~~
-{: .output}
+```
-For each of your sample files, FastQC will ask if you want to replace the existing version with a new version. This is
+For each of your sample files, FastQC will ask if you want to replace the existing version with a new version. This is
because we have already run FastQC on this samples files and generated all of the outputs. We are now doing this again using
our scripts. Go ahead and select `A` each time this message appears. It will appear once per sample file (six times total).
-~~~
+```output
replace SRR2584866_fastqc/Icons/fastqc_icon.png? [y]es, [n]o, [A]ll, [N]one, [r]ename:
-~~~
-{: .output}
-
+```
-# Automating the rest of our variant calling workflow
+## Automating the rest of our variant calling workflow
We can extend these principles to the entire variant calling workflow. To do this, we will take all of the individual commands that we wrote before, put them into a single file, add variables so that the script knows to iterate through our input files and write to the appropriate output files. This is very similar to what we did with our `read_qc.sh` script, but will be a bit more complex.
Download the script from [here](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh). Download to `~/dc_workshop/scripts`.
-~~~
+```bash
curl -O https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/run_variant_calling.sh
-~~~
-{: .bash}
+```
Our variant calling workflow has the following steps:
@@ -252,15 +246,14 @@ Our variant calling workflow has the following steps:
Let's go through this script together:
-~~~
+```bash
$ cd ~/dc_workshop/scripts
$ less run_variant_calling.sh
-~~~
-{: .bash}
+```
The script should look like this:
-~~~
+```output
set -e
cd ~/dc_workshop/results
@@ -295,68 +288,61 @@ for fq1 in ~/dc_workshop/data/trimmed_fastq_small/*_1.trim.sub.fastq
vcfutils.pl varFilter $variants > $final_variants
done
-~~~
-{: .output}
+```
Now, we will go through each line in the script before running it.
First, notice that we change our working directory so that we can create new results subdirectories
-in the right location.
+in the right location.
-~~~
+```output
cd ~/dc_workshop/results
-~~~
-{: .output}
+```
-Next we tell our script where to find the reference genome by assigning the `genome` variable to
-the path to our reference genome:
+Next we tell our script where to find the reference genome by assigning the `genome` variable to
+the path to our reference genome:
-~~~
+```output
genome=~/dc_workshop/data/ref_genome/ecoli_rel606.fasta
-~~~
-{: .output}
+```
-Next we index our reference genome for BWA:
+Next we index our reference genome for BWA:
-~~~
+```output
bwa index $genome
-~~~
-{: .output}
+```
-And create the directory structure to store our results in:
+And create the directory structure to store our results in:
-~~~
+```output
mkdir -p sam bam bcf vcf
-~~~
-{: .output}
+```
Then, we use a loop to run the variant calling workflow on each of our FASTQ files. The full list of commands
-within the loop will be executed once for each of the FASTQ files in the
-`data/trimmed_fastq_small/` directory.
+within the loop will be executed once for each of the FASTQ files in the
+`data/trimmed_fastq_small/` directory.
We will include a few `echo` statements to give us status updates on our progress.
The first thing we do is assign the name of the FASTQ file we are currently working with to a variable called `fq1` and
tell the script to `echo` the filename back to us so we can check which file we are on.
-~~~
+```bash
for fq1 in ~/dc_workshop/data/trimmed_fastq_small/*_1.trim.sub.fastq
do
echo "working with file $fq1"
-~~~
-{: .bash}
+```
We then extract the base name of the file (excluding the path and `.fastq` extension) and assign it
-to a new variable called `base`.
-~~~
+to a new variable called `base`.
+
+```bash
base=$(basename $fq1 _1.trim.sub.fastq)
echo "base name is $base"
-~~~
-{: .bash}
+```
We can use the `base` variable to access both the `base_1.fastq` and `base_2.fastq` input files, and create variables to store the names of our output files. This makes the script easier to read because we do not need to type out the full name of each of the files: instead, we use the `base` variable, but add a different extension (e.g. `.sam`, `.bam`) for each file produced by our workflow.
-
-~~~
+```bash
#input fastq files
fq1=~/dc_workshop/data/trimmed_fastq_small/${base}_1.trim.sub.fastq
fq2=~/dc_workshop/data/trimmed_fastq_small/${base}_2.trim.sub.fastq
@@ -368,112 +354,119 @@ We can use the `base` variable to access both the `base_1.fastq` and `base_2.fas
raw_bcf=~/dc_workshop/results/bcf/${base}_raw.bcf
variants=~/dc_workshop/results/bcf/${base}_variants.vcf
final_variants=~/dc_workshop/results/vcf/${base}_final_variants.vcf
-~~~
-{: .bash}
-
+```
And finally, the actual workflow steps:
1) align the reads to the reference genome and output a `.sam` file:
-~~~
+```output
bwa mem $genome $fq1 $fq2 > $sam
-~~~
-{: .output}
+```
2) convert the SAM file to BAM format:
-~~~
+```output
samtools view -S -b $sam > $bam
-~~~
-{: .output}
+```
3) sort the BAM file:
-~~~
+```output
samtools sort -o $sorted_bam $bam
-~~~
-{: .output}
+```
4) index the BAM file for display purposes:
-~~~
+```output
samtools index $sorted_bam
-~~~
-{: .output}
+```
5) calculate the read coverage of positions in the genome:
-~~~
+```output
bcftools mpileup -O b -o $raw_bcf -f $genome $sorted_bam
-~~~
-{: .output}
+```
6) call SNVs with bcftools:
-~~~
+```output
bcftools call --ploidy 1 -m -v -o $variants $raw_bcf
-~~~
-{: .output}
+```
7) filter and report the SNVs in variant calling format (VCF):
-~~~
+```output
vcfutils.pl varFilter $variants > $final_variants
-~~~
-{: .output}
+```
+::::::::::::::::::::::::::::::::::::::: challenge
+### Exercise
-> ## Exercise
-> It is a good idea to add comments to your code so that you (or a collaborator) can make sense of what you did later.
-> Look through your existing script. Discuss with a neighbor where you should add comments. Add comments (anything following
-> a `#` character will be interpreted as a comment, bash will not try to run these comments as code).
-{: .challenge}
+It is a good idea to add comments to your code so that you (or a collaborator) can make sense of what you did later.
+Look through your existing script. Discuss with a neighbor where you should add comments. Add comments (anything following
+a `#` character will be interpreted as a comment, bash will not try to run these comments as code).
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
Now we can run our script:
-~~~
+```bash
$ bash run_variant_calling.sh
-~~~
-{: .bash}
-
-
-> ## Exercise
->
-> The samples we just performed variant calling on are part of the long-term evolution experiment introduced at the
-> beginning of our variant calling workflow. From the metadata table, we know that SRR2589044 was from generation 5000,
-> SRR2584863 was from generation 15000, and SRR2584866 was from generation 50000. How did the number of mutations per sample change
-> over time? Examine the metadata table. What is one reason the number of mutations may have changed the way they did?
->
-> Hint: You can find a copy of the output files for the subsampled trimmed FASTQ file variant calling in the
-> `~/.solutions/wrangling-solutions/variant_calling_auto/` directory.
->
->> ## Solution
->>
->> ~~~
->> $ for infile in ~/dc_workshop/results/vcf/*_final_variants.vcf
->> > do
->> > echo ${infile}
->> > grep -v "#" ${infile} | wc -l
->> > done
->> ~~~
->> {: .bash}
->>
->> For SRR2589044 from generation 5000 there were 10 mutations, for SRR2584863 from generation 15000 there were 25 mutations,
->> and SRR2584866 from generation 766 mutations. In the last generation, a hypermutable phenotype had evolved, causing this
->> strain to have more mutations.
-> {: .solution}
-{: .challenge}
-
-
-> ## Bonus exercise
->
-> If you have time after completing the previous exercise, use `run_variant_calling.sh` to run the variant calling pipeline
-> on the full-sized trimmed FASTQ files. You should have a copy of these already in `~/dc_workshop/data/trimmed_fastq`, but if
-> you do not, there is a copy in `~/.solutions/wrangling-solutions/trimmed_fastq`. Does the number of variants change per sample?
-{: .challenge}
+```
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Exercise
+
+The samples we just performed variant calling on are part of the long-term evolution experiment introduced at the
+beginning of our variant calling workflow. From the metadata table, we know that SRR2589044 was from generation 5000,
+SRR2584863 was from generation 15000, and SRR2584866 was from generation 50000. How did the number of mutations per sample change
+over time? Examine the metadata table. What is one reason the number of mutations may have changed the way they did?
+
+Hint: You can find a copy of the output files for the subsampled trimmed FASTQ file variant calling in the
+`~/.solutions/wrangling-solutions/variant_calling_auto/` directory.
+
+::::::::::::::: solution
+
+### Solution
+
+```bash
+$ for infile in ~/dc_workshop/results/vcf/*_final_variants.vcf
+> do
+> echo ${infile}
+> grep -v "#" ${infile} | wc -l
+> done
+```
+
+For SRR2589044 from generation 5000 there were 10 mutations, for SRR2584863 from generation 15000 there were 25 mutations,
+and SRR2584866 from generation 766 mutations. In the last generation, a hypermutable phenotype had evolved, causing this
+strain to have more mutations.
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+### Bonus exercise
+
+If you have time after completing the previous exercise, use `run_variant_calling.sh` to run the variant calling pipeline
+on the full-sized trimmed FASTQ files. You should have a copy of these already in `~/dc_workshop/data/trimmed_fastq`, but if
+you do not, there is a copy in `~/.solutions/wrangling-solutions/trimmed_fastq`. Does the number of variants change per sample?
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- We can combine multiple commands into a shell script to automate a workflow.
+- Use `echo` statements within your scripts to get an automated progress update.
+::::::::::::::::::::::::::::::::::::::::::::::::::
diff --git a/img/172px-EscherichiaColi_NIAID.jpg b/episodes/fig/172px-EscherichiaColi_NIAID.jpg
similarity index 100%
rename from img/172px-EscherichiaColi_NIAID.jpg
rename to episodes/fig/172px-EscherichiaColi_NIAID.jpg
diff --git a/img/DC1_logo_small.png b/episodes/fig/DC1_logo_small.png
similarity index 100%
rename from img/DC1_logo_small.png
rename to episodes/fig/DC1_logo_small.png
diff --git a/img/bad_quality.png b/episodes/fig/bad_quality.png
similarity index 100%
rename from img/bad_quality.png
rename to episodes/fig/bad_quality.png
diff --git a/img/bad_quality1.8.png b/episodes/fig/bad_quality1.8.png
similarity index 100%
rename from img/bad_quality1.8.png
rename to episodes/fig/bad_quality1.8.png
diff --git a/img/creative-commons-attribution-license.png b/episodes/fig/creative-commons-attribution-license.png
similarity index 100%
rename from img/creative-commons-attribution-license.png
rename to episodes/fig/creative-commons-attribution-license.png
diff --git a/img/good_quality.png b/episodes/fig/good_quality.png
similarity index 100%
rename from img/good_quality.png
rename to episodes/fig/good_quality.png
diff --git a/img/good_quality1.8.png b/episodes/fig/good_quality1.8.png
similarity index 100%
rename from img/good_quality1.8.png
rename to episodes/fig/good_quality1.8.png
diff --git a/img/igv-screenshot.png b/episodes/fig/igv-screenshot.png
similarity index 100%
rename from img/igv-screenshot.png
rename to episodes/fig/igv-screenshot.png
diff --git a/img/lenski_LTEE_timeline_May_28_2016.png b/episodes/fig/lenski_LTEE_timeline_May_28_2016.png
similarity index 100%
rename from img/lenski_LTEE_timeline_May_28_2016.png
rename to episodes/fig/lenski_LTEE_timeline_May_28_2016.png
diff --git a/img/putty_screenshot_1.png b/episodes/fig/putty_screenshot_1.png
similarity index 100%
rename from img/putty_screenshot_1.png
rename to episodes/fig/putty_screenshot_1.png
diff --git a/img/putty_screenshot_2.png b/episodes/fig/putty_screenshot_2.png
similarity index 100%
rename from img/putty_screenshot_2.png
rename to episodes/fig/putty_screenshot_2.png
diff --git a/img/putty_screenshot_3.png b/episodes/fig/putty_screenshot_3.png
similarity index 100%
rename from img/putty_screenshot_3.png
rename to episodes/fig/putty_screenshot_3.png
diff --git a/img/sam_bam.png b/episodes/fig/sam_bam.png
similarity index 100%
rename from img/sam_bam.png
rename to episodes/fig/sam_bam.png
diff --git a/img/sam_bam3.png b/episodes/fig/sam_bam3.png
similarity index 100%
rename from img/sam_bam3.png
rename to episodes/fig/sam_bam3.png
diff --git a/img/terminal.png b/episodes/fig/terminal.png
similarity index 100%
rename from img/terminal.png
rename to episodes/fig/terminal.png
diff --git a/img/var_calling_workflow_qc.png b/episodes/fig/var_calling_workflow_qc.png
similarity index 100%
rename from img/var_calling_workflow_qc.png
rename to episodes/fig/var_calling_workflow_qc.png
diff --git a/img/variant_calling_workflow.png b/episodes/fig/variant_calling_workflow.png
similarity index 100%
rename from img/variant_calling_workflow.png
rename to episodes/fig/variant_calling_workflow.png
diff --git a/img/variant_calling_workflow.svg b/episodes/fig/variant_calling_workflow.svg
similarity index 100%
rename from img/variant_calling_workflow.svg
rename to episodes/fig/variant_calling_workflow.svg
diff --git a/img/variant_calling_workflow_align.png b/episodes/fig/variant_calling_workflow_align.png
similarity index 100%
rename from img/variant_calling_workflow_align.png
rename to episodes/fig/variant_calling_workflow_align.png
diff --git a/img/variant_calling_workflow_cleanup.png b/episodes/fig/variant_calling_workflow_cleanup.png
similarity index 100%
rename from img/variant_calling_workflow_cleanup.png
rename to episodes/fig/variant_calling_workflow_cleanup.png
diff --git a/files/Ecoli_metadata_composite.csv b/episodes/files/Ecoli_metadata_composite.csv
similarity index 100%
rename from files/Ecoli_metadata_composite.csv
rename to episodes/files/Ecoli_metadata_composite.csv
diff --git a/files/Ecoli_metadata_composite.tsv b/episodes/files/Ecoli_metadata_composite.tsv
similarity index 100%
rename from files/Ecoli_metadata_composite.tsv
rename to episodes/files/Ecoli_metadata_composite.tsv
diff --git a/files/Ecoli_metadata_composite_README.md b/episodes/files/Ecoli_metadata_composite_README.md
similarity index 100%
rename from files/Ecoli_metadata_composite_README.md
rename to episodes/files/Ecoli_metadata_composite_README.md
diff --git a/files/NexteraPE-PE.fa b/episodes/files/NexteraPE-PE.fa
similarity index 100%
rename from files/NexteraPE-PE.fa
rename to episodes/files/NexteraPE-PE.fa
diff --git a/files/download-links-for-files.txt b/episodes/files/download-links-for-files.txt
similarity index 100%
rename from files/download-links-for-files.txt
rename to episodes/files/download-links-for-files.txt
diff --git a/files/run_variant_calling.sh b/episodes/files/run_variant_calling.sh
similarity index 100%
rename from files/run_variant_calling.sh
rename to episodes/files/run_variant_calling.sh
diff --git a/files/subsample-trimmed-fastq.txt b/episodes/files/subsample-trimmed-fastq.txt
similarity index 100%
rename from files/subsample-trimmed-fastq.txt
rename to episodes/files/subsample-trimmed-fastq.txt
diff --git a/index.md b/index.md
index cb1cf501..612f67d9 100644
--- a/index.md
+++ b/index.md
@@ -1,35 +1,41 @@
---
-layout: lesson
-root: .
+site: sandpaper::sandpaper_site
---
-A lot of genomics analysis is done using command-line tools for three reasons:
-1) you will often be working with a large number of files, and working through the command-line rather than
-through a graphical user interface (GUI) allows you to automate repetitive tasks,
-2) you will often need more compute power than is available on your personal computer, and
-connecting to and interacting with remote computers requires a command-line interface, and
-3) you will often need to customize your analyses, and command-line tools often enable more
-customization than the corresponding GUI tools (if in fact a GUI tool even exists).
+A lot of genomics analysis is done using command-line tools for three reasons:
-In a [previous lesson](http://www.datacarpentry.org/shell-genomics/), you learned how to use the bash shell to interact with your computer through a command line interface. In this
-lesson, you will be applying this new knowledge to carry out a common genomics workflow - identifying variants among sequencing samples
+1) you will often be working with a large number of files, and working through the command-line rather than
+ through a graphical user interface (GUI) allows you to automate repetitive tasks,
+2) you will often need more compute power than is available on your personal computer, and
+ connecting to and interacting with remote computers requires a command-line interface, and
+3) you will often need to customize your analyses, and command-line tools often enable more
+ customization than the corresponding GUI tools (if in fact a GUI tool even exists).
+
+In a [previous lesson](https://www.datacarpentry.org/shell-genomics/), you learned how to use the bash shell to interact with your computer through a command line interface. In this
+lesson, you will be applying this new knowledge to carry out a common genomics workflow - identifying variants among sequencing samples
taken from multiple individuals within a population. We will be starting with a set of sequenced reads (`.fastq` files), performing
some quality control steps, aligning those reads to a reference genome, and ending by identifying and visualizing variations among these
-samples.
+samples.
-As you progress through this lesson, keep in mind that, even if you aren't going to be doing this same workflow in your research,
-you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to
+As you progress through this lesson, keep in mind that, even if you aren't going to be doing this same workflow in your research,
+you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to
use a variety of bioinformatic tools with confidence and greatly enhance your research efficiency and productivity.
-> ## Prerequisites
->
-> This lesson assumes a working understanding of the bash shell. If you haven't already completed the [Shell Genomics](http://www.datacarpentry.org/shell-genomics/) lesson, and aren't familiar with the bash shell, please review those materials
-> before starting this lesson.
->
-> This lesson also assumes some familiarity with biological concepts, including the structure of DNA, nucleotide abbreviations, and the
-> concept of genomic variation within a population.
->
-> This lesson uses data hosted on an Amazon Machine Instance (AMI). Workshop participants will be given information on how
-> to log-in to the AMI during the workshop. Learners using these materials for self-directed study will need to set up their own
-> AMI. Information on setting up an AMI and accessing the required data is provided on the [Genomics Workshop setup page](http://www.datacarpentry.org/genomics-workshop/setup.html).
-{: .prereq}
+:::::::::::::::::::::::::::::::::::::::::: prereq
+
+## Prerequisites
+
+This lesson assumes a working understanding of the bash shell. If you haven't already completed the [Shell Genomics](https://www.datacarpentry.org/shell-genomics/) lesson, and aren't familiar with the bash shell, please review those materials
+before starting this lesson.
+
+This lesson also assumes some familiarity with biological concepts, including the structure of DNA, nucleotide abbreviations, and the
+concept of genomic variation within a population.
+
+This lesson uses data hosted on an Amazon Machine Instance (AMI). Workshop participants will be given information on how
+to log-in to the AMI during the workshop. Learners using these materials for self-directed study will need to set up their own
+AMI. Information on setting up an AMI and accessing the required data is provided on the [Genomics Workshop setup page](https://www.datacarpentry.org/genomics-workshop/setup.html).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/_extras/guide.md b/instructors/instructor-notes.md
similarity index 94%
rename from _extras/guide.md
rename to instructors/instructor-notes.md
index 720986c5..4f27185f 100644
--- a/_extras/guide.md
+++ b/instructors/instructor-notes.md
@@ -1,6 +1,5 @@
---
-layout: page
-title: "Instructor Notes"
+title: Instructor Notes
---
# Instructor Notes for Wrangling Genomics
@@ -9,78 +8,90 @@ title: "Instructor Notes"
Users are required to open *multiple* html files locally on their own browser and OS - will vary between users! Probably ctrl-click multiple selection within a file browser and right click will work on most systems?
-
## SAMtools or IGV?
-Some instructors chose to use SAMtools tview for visualization of variant calling results while other prefer than IGV. SAMtools is the default because installation of IGV can take up additional instruction time, and SAMtools tview is sufficient to visualize results. However, episode 04-variant_calling includes instructions for installation and using IGV.
+
+Some instructors chose to use SAMtools tview for visualization of variant calling results while other prefer than IGV. SAMtools is the default because installation of IGV can take up additional instruction time, and SAMtools tview is sufficient to visualize results. However, episode 04-variant\_calling includes instructions for installation and using IGV.
## Commands with Lengthy Run Times
#### Raw Data Downloads
+
The fastq files take about 15 minutes to download. This would be a good time to discuss the overall workflow of this lesson as illustrated by the graphic integrated on the page. It is recommended to start this lesson with the commmands to make and move to the /data/untrimmed-fastq directory and begin the download, and while files download, cover the "Bioinformatics Workflows" and "Starting with Data" texts. Beware that the last fastq file in the list takes the longest to download (~6-8 mins).
#### Running FastQC
+
The FastQC analysis on all raw reads takes about 10 minutes to run. It is a good idea to have learners start this command and cover the FastQC background material and images while FastQC runs.
#### Trimmomatic
+
The trimmomatic for loop will take about 10 minutes to run. Perhaps this would be a good time for a coffee break or a discussion about trimming.
#### bcftools mpileup
+
The bcftools mpileup command will take about 5 minutes to run. It is:
-~~~
+```
bcftools mpileup -O b -o results/bcf/SRR2584866_raw.bcf \
-f data/ref_genome/ecoli_rel606.fasta results/bam/SRR2584866.aligned.sorted.bam
-~~~
+```
## Commands that must be modified
+
There are several commands that are example commands that will not run correctly if copy and pasted directly to the terminal. These commands serve as example commands and will need to be modified to fit each user. There is text around the commands outlining how they need to be changed, but it's helpful to be aware of them ahead of time as an instructor so you can set them up properly.
#### scp Command to Download FastQC to local machines
-In the FastQC section, learners will download FastQC output files in order to open '.html `.html` summary files on their local machines in a web browser. The scp command currently contains a public DNS (for example, `ec2-34-238-162-94.compute-1.amazonaws.com`), but this will need to be replaced with the public DNS of the machine used by each learner. The Public DNS for each learner will be the same one they use to log in. The password will be provided to the Instructor when they receive instance information and will be the same for all learners.
-Command as is:
-~~~
+In the FastQC section, learners will download FastQC output files in order to open '.html `.html` summary files on their local machines in a web browser. The scp command currently contains a public DNS (for example, `ec2-34-238-162-94.compute-1.amazonaws.com`), but this will need to be replaced with the public DNS of the machine used by each learner. The Public DNS for each learner will be the same one they use to log in. The password will be provided to the Instructor when they receive instance information and will be the same for all learners.
+
+Command as is:
+
+```
scp dcuser@ec2-34-238-162-94.compute-1.amazonaws.com:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html
-~~~
+```
Command for learners to use:
-~~~
+
+```
scp dcuser@:~/dc_workshop/results/fastqc_untrimmed_reads/*.html ~/Desktop/fastqc_html
-~~~
+```
#### The unzip for loop
+
The for loop to unzip FastQC output will not work as directly copied pasted as:
-~~~
+```
$ for filename in *.zip
> do
> unzip $filename
> done
-~~~
+```
Because the `>` symbol will cause a syntax error when copied. This command will work correctly when typed at the command line! Learners may be surprised that a for loop takes multiple lines on the terminal.
#### unzip in Working with FastQC Output
+
The command `unzip *.zip` in the Working with FastQC Output section will run successfully for the first file, but fail for subsequent files. This error introduces the need for a for loop.
#### Example Trimmomatic Command
+
The first trimmomatic serves as an explanation for trimmomatic parameters and is not meant to be run. The command is:
-~~~
+```
$ trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq \
SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20
-~~~
+```
The correct syntax is outlined in the next section, Running Trimmomatic.
#### Actual Trimmomatic Command
+
The actual trimmomatic command is complicated for loop. It will need to be typed out by learners because the `>` symbols will raise an error if copy and pasted.
For reference, this command is:
-~~~
+```
$ for infile in *_1.fastq.gz
> do
> base=$(basename ${infile} _1.fastq.gz)
@@ -89,23 +100,26 @@ $ for infile in *_1.fastq.gz
> ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
> SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
> done
-~~~
+```
#### bwa mem Example Command
+
The first bwa mem command is an example and is not meant to be run. It is:
-~~~
+```
# bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam
-~~~
+```
The correct command follows:
-~~~
+```
$ bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam
-~~~
+```
#### The Automation Episode
-The code blocks at the beginning of the automation episode (05-automation.md) are examples of for loops and scripts and are not meant to be run by learners. The first code chunks that should be run are under Analyzing Quality with FastQC.
+
+The code blocks at the beginning of the automation episode (05-automation.md) are examples of for loops and scripts and are not meant to be run by learners. The first code chunks that should be run are under Analyzing Quality with FastQC.
Also, after the first code chunk of code meant to be run, there is a line that reads only `read_qc.sh` and will yield a message saying that this command wasn't found. After the creation of the script, this command will run the script that will be written.
+
diff --git a/_extras/discuss.md b/learners/discuss.md
similarity index 58%
rename from _extras/discuss.md
rename to learners/discuss.md
index bfc33c50..515e3baf 100644
--- a/_extras/discuss.md
+++ b/learners/discuss.md
@@ -1,6 +1,9 @@
---
title: Discussion
---
+
FIXME
-{% include links.md %}
+
+
+
diff --git a/reference.md b/learners/reference.md
similarity index 58%
rename from reference.md
rename to learners/reference.md
index 8138fe9f..456f9409 100644
--- a/reference.md
+++ b/learners/reference.md
@@ -1,7 +1,9 @@
---
-layout: reference
+title: 'Glossary'
---
## Glossary
FIXME
+
+
diff --git a/learners/setup.md b/learners/setup.md
new file mode 100644
index 00000000..244dced1
--- /dev/null
+++ b/learners/setup.md
@@ -0,0 +1,10 @@
+---
+title: Setup
+---
+
+This workshop is designed to be run on pre-imaged Amazon Web Services
+(AWS) instances. For information about how to
+use the workshop materials, see the
+[setup instructions](https://www.datacarpentry.org/genomics-workshop/setup.html) on the main workshop page.
+
+
diff --git a/profiles/learner-profiles.md b/profiles/learner-profiles.md
new file mode 100644
index 00000000..434e335a
--- /dev/null
+++ b/profiles/learner-profiles.md
@@ -0,0 +1,5 @@
+---
+title: FIXME
+---
+
+This is a placeholder file. Please add content here.
diff --git a/setup.md b/setup.md
deleted file mode 100644
index 200959b1..00000000
--- a/setup.md
+++ /dev/null
@@ -1,9 +0,0 @@
----
-layout: page
-title: Setup
----
-
-This workshop is designed to be run on pre-imaged Amazon Web Services
-(AWS) instances. For information about how to
-use the workshop materials, see the
-[setup instructions](http://www.datacarpentry.org/genomics-workshop/setup.html) on the main workshop page.
diff --git a/site/README.md b/site/README.md
new file mode 100644
index 00000000..42997e3d
--- /dev/null
+++ b/site/README.md
@@ -0,0 +1,2 @@
+This directory contains rendered lesson materials. Please do not edit files
+here.