Skip to content

Commit

Permalink
Merge pull request #78 from data-lessons/gh-pages
Browse files Browse the repository at this point in the history
Wrangling Genomics Integration
  • Loading branch information
ErinBecker authored Feb 3, 2019
2 parents d039258 + 6e81cb0 commit 160e45d
Show file tree
Hide file tree
Showing 8 changed files with 94 additions and 70 deletions.
45 changes: 15 additions & 30 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,28 +22,21 @@ kind: "lesson"

# Magic to make URLs resolve both locally and on GitHub.
# See https://help.github.com/articles/repository-metadata-on-github-pages/.
# Please don't change it: <USERNAME>/<PROJECT> is correct.
repository: <USERNAME>/<PROJECT>

# Email address, no mailto:
email: "[email protected]"

# Sites.
amy_site: "https://amy.software-carpentry.org/workshops"
carpentries_github: "https://github.com/carpentries"
carpentries_pages: "https://carpentries.github.io"
carpentries_site: "https://carpentries.org/"
dc_site: "http://datacarpentry.org"
example_repo: "https://github.com/carpentries/lesson-example"
example_site: "https://carpentries.github.io/lesson-example"
lc_site: "https://librarycarpentry.github.io/"
swc_github: "https://github.com/swcarpentry"
swc_pages: "https://swcarpentry.github.io"
swc_site: "https://software-carpentry.org"
template_repo: "https://github.com/carpentries/styles"
training_site: "https://carpentries.github.io/instructor-training"
workshop_repo: "https://github.com/carpentries/workshop-template"
workshop_site: "https://carpentries.github.io/workshop-template"
swc_pages: "https://swcarpentry.github.io"
lc_site: "http://librarycarpentry.github.io/"
template_repo: "https://github.com/swcarpentry/styles"
example_repo: "https://github.com/swcarpentry/lesson-example"
example_site: "https://swcarpentry.github.com/lesson-example"
workshop_repo: "https://github.com/swcarpentry/workshop-template"
workshop_site: "https://swcarpentry.github.io/workshop-template"
training_site: "https://swcarpentry.github.io/instructor-training"

# Surveys.
pre_survey: "https://www.surveymonkey.com/r/swc_pre_workshop_v1?workshop_id="
Expand All @@ -57,34 +50,26 @@ start_time: 0
collections:
episodes:
output: true
permalink: /:path/index.html
permalink: /:path/
extras:
output: true
permalink: /:path/index.html

# Set the default layout for things in the episodes collection.
defaults:
- values:
root: .
layout: page
root: ..
- scope:
path: ""
type: episodes
values:
root: ..
layout: episode
- scope:
path: ""
type: extras
values:
root: ..
layout: page

# Files and directories that are not to be copied.
exclude:
- Makefile
- bin/
- .Rproj.user/
- bin

# Turn off built-in syntax highlighting.
highlighter: false

# Turn on built-in syntax highlighting.
highlighter: rouge
theme: jekyll-theme-minimal
11 changes: 4 additions & 7 deletions _episodes/01-tidiness.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,30 +75,27 @@ Independent of the type of data you're collecting, there are standard ways to en

The cardinal rules of using spreadsheet programs for data:

- Leave the raw data raw - don’t change it!
- Put each observation or sample in its own row.
- Put all your variables in columns - the thing that vary between samples, like ‘strain’ or ‘DNA-concentration’.
- Have column names be explanatory, but not have spaces. Use '-', '_' or [camel case](https://en.wikipedia.org/wiki/Camel_case) instead of a space. For instance 'library-prep-method' or 'LibraryPrep'is better than 'library preparation method' or 'prep', because computers interpret spaces in particular ways.
- Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way
you’ll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
analysis you want to do, you may even separate the genus and species names into distinct columns.
- Leave the raw data raw - don’t change it!
- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.

[![Messy spreadsheet](../fig/01_tidiness_datasheet_example_messy.png)](https://github.com/datacarpentry/organization-genomics/blob/gh-pages/files/SampleSheet_Example_messy.csv?raw=true)
![Messy spreadsheet](../fig/01_tidiness_datasheet_example_messy.png)

> ## Exercise
> This is some potential spreadsheet data for an experiment being submitted for sequencing. The program [bcl2fastq](https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq_letterbooklet_15038058brpmi.pdf) requires this spreadsheet to use as input to demultiplex the sequencing data into separate files, one per sample. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above.
> This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to you computer via the link and open it in a spreadsheet reader like Excel.
>
>
> > ## Solution
> > A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. The main problem is there are characters in the ids that aren't allowed, e.g. ",", ".", "-", "&" or spaces. Here is a "clean" version of the same spreadsheet:
> > A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
> >
> >[Cleaned spreadsheet](https://github.com/datacarpentry/organization-genomics/blob/gh-pages/files/SampleSheet_Example_clean.csv?raw=true)
> >
> >File and info provided by [Dr. Olga Botvinnik](https://github.com/olgabot) at [CZ Biohub](https://github.com/czbiohub).
> >
> >
> {: .solution}
{: .challenge}

Expand Down
34 changes: 22 additions & 12 deletions _episodes/03-ncbi-sra.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,16 @@ There are many repositories for public data. Some model organisms or fields have

# Accessing the original archived data

The [sequencing dataset (from Blount paper) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra) which is a large (>3 quadrillion basepairs as of 2014) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often, as in the Blount paper, there will be a direct link (perhaps in the supplemental information) to where on the SRA the dataset can be found. E.g. the link from the Blount paper is [http://www.ncbi.nlm.nih.gov/sra?term=SRA026813](http://www.ncbi.nlm.nih.gov/sra?term=SRA026813)
The [sequencing dataset (from Tenaillon paper) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra) which is a large (>3 quadrillion basepairs as of 2014) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often, as in the Blount paper, there will be a direct link (perhaps in the supplemental information) to where on the SRA the dataset can be found. E.g. the link from the Tenaillon paper is [http://www.ncbi.nlm.nih.gov/sra?term=SRA026813](http://www.ncbi.nlm.nih.gov/sra?term=SRA026813)

## Locate the Run Accessory Table for the Lenski Dataset on the SRA

1. Access the Blount dataset from the provided link: [http://www.ncbi.nlm.nih.gov/sra?term=SRA026813](http://www.ncbi.nlm.nih.gov/sra?term=SRA026813).
You will be presented with a page for the overall SRA accession SRA026813 - this is a collection of all the experimental data.
1. Access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?acc=SRP064605).
You will be presented with a page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.

2. Click on the first entry ([ZDB30](http://www.ncbi.nlm.nih.gov/sra/SRX040669%5Baccn%5D)). This will take you to a page for an SRX (Sequence Read eXperiment). Take a few minutes to examine some of the descriptions on the page.
2. Click on the first entry ([REL4541B](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2591054)). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.

3. Click on the ['All runs'](http://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP004752) link under where it says **Study**. This is a description of all of the NGS datasets related to the experiment.

4. Go to the top of the page and in the **Total** row you will see there are 37 runs, 10.15Gb data, and 16.45 Gbases of data. Click the 'RunInfo Table' button and save the file locally.
3. Go back to the ['previous page'](https://trace.ncbi.nlm.nih.gov/Traces/sra/?acc=SRP064605). At the top of the page and in the **Total** row you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the 'RunInfo Table' button and save the file locally.

We are not downloading any actual sequence data here! This is only a text file that fully describes the entire
dataset.
Expand All @@ -52,8 +50,19 @@ Using your choice of spreadsheet program open the `SraRunTable.txt` file. If pro
> 5. Are you collecting this kind of information about your sequencing runs?
{: .challenge}

After answering the questions, you should avoid saving this file. We don't want to make any changes. If you were to save this file, make sure you save it as a plain `.txt` file.
After answering the questions, you should avoid saving any changes you might have made to this file. We don't want to make any changes. If you were to save this file, make sure you save it as a plain `.txt` file.

## Downloading a few sequencing files: EBML-EBI

The SRA does not support direct download of fastq files from its webpage. However, the [European Nucleotide Archive](https://www.ebi.ac.uk/ena) does. Let's see how we can get a download link to a file we are interested in.

1. Navigate to the [ENA]((https://www.ebi.ac.uk/ena).

2. In the search bar, type in `SRR2589044`. Make sure there are no spaces after the accession number, and press search.

3. You will see a table with information about the sample. In the table, there is a header "FASTQ files (FTP)". If you wanted to download the files to your computer, you could click on the links to download the files. Alternatively, right click and copy the URL to save it for later. We don't need to download these files right now, and because they are large we won't put them on our computers now.

We don't recommend downloading large numbers of sequencing files this way. For that, the NCBI has made a software package called the `sra-toolkit`. However, for a couple files, it's often easier to go through the ENA.

## Where to learn more

Expand All @@ -64,7 +73,8 @@ After answering the questions, you should avoid saving this file. We don't want

#### References

Blount, Z.D., Barrick, J.E., Davidson, C.J., Lenski, R.E.
Genomic analysis of a key innovation in an experimental Escherichia coli population (2012) Nature, 489 (7417), pp. 513-518.
[Paper](https://www.ncbi.nlm.nih.gov/pubmed/22992527), [Supplemental materials](https://www.nature.com/nature/journal/v489/n7417/full/nature11514.html#supplementary-information)
Data on NCBI SRA: [http://www.ncbi.nlm.nih.gov/sra?term=SRA026813](http://www.ncbi.nlm.nih.gov/sra?term=SRA026813)
Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, Lenski RE.
Tempo and mode of genome evolution in a 50,000-generation experiment (2016) Nature. 536(7615): 165–170.
[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), [Supplemental materials](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/#)
Data on NCBI SRA: [https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605)
Data on EMBL-EBI ENA: [https://www.ebi.ac.uk/ena/data/view/PRJNA295606](https://www.ebi.ac.uk/ena/data/view/PRJNA295606)
35 changes: 14 additions & 21 deletions _extras/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,34 +14,27 @@ This dataset was selected for our exercise on NGS Data Carpentry for several rea

# Introduction to the dataset

Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In [Blount et al 2012](https://www.ncbi.nlm.nih.gov/pubmed/22992527), 12 populations of *Escherichia coli* were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of *E.coli* (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the *E. coli* species. Eventually, Cit+ mutants became the dominant population as the experimental growth medium contained a high concentration of citrate relative to glucose.
Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In [Tenaillon et al 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), 12 populations of *Escherichia coli* were propagated for more than 50,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of *E.coli* (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the *E. coli* species. Eventually, Cit+ mutants became the dominant population as the experimental growth medium contained a high concentration of citrate relative to glucose. Around the same time that this mutation emerged, another phenotype become prominent in the Ara-3 population. Many *E. coli* began to develop excessive numbers of mutations, meaning they became hypermutable.

Strains from generation 0 to generation 40,000 were sequenced, including ones that were both Cit+ and Cit- after generation 31,000.
Strains from generation 0 to generation 50,000 were sequenced, including ones that were both Cit+ and Cit- and hypermutable in later generations.

For the purposes of this workshop we're going to be working with 6 of the sequence reads from this experiment. We also made up genome sizes for each of the strains, to look at the relationship between Cit status and genome size. **The genome sizes are not real data!!**
For the purposes of this workshop we're going to be working with 3 of the sequence reads from this experiment.

| SRA Run Number | Clone | Generation | Cit | Hypermutable | Read Length | Sequencing Depth |
| -------------- | ----- | ---------- | ---- | ----- |-------|--------|
| SRR2589044 | REL2181A | 5,000 | Unknown | None | 150 | 60.2 |
| SRR2584863 | REL7179B | 15000 | Unknown | None | 150 | 88 |
| SRR2584866 | REL11365 | 50000 | Cit+ | plus | 150 | 138.3 |

| SRA Run Number | Clone | Generation | Cit | GenomeSize |
| -------------- | ----- | ---------- | ----- | ----- |
| SRR098028 | REL1166A | 2,000 | Unknown | 4.63 |
| SRR098281 | ZDB409 | 5,000 | Unknown | 4.6 |
| SRR098283 | ZDB446 | 15,000 | Cit- | 4.66 |
| SRR097977 | CZB152 | 33,000 | Cit+ | 4.8 |
| SRR098026 | CZB154 | 33,000 | Cit+ | 4.76 |
| SRR098027 | CZB199 | 33,000 | Cit- | 4.59 |
We want to be able to look at differences in mutation rates between hypermutable and non-hypermutable strains. We also want to analyze the sequences to figure out what changes occurred in genomes to make the strains Cit+. Ultimately, we will answer the questions:


We want to be able to look at the genome size to see if there is a difference between genome size and the Cit status of the strain. We also want to analyze the sequences to figure out what changes occurred in genomes to make the strains Cit+. Ultimately, we will answer the questions:

- What is the distribution of genome sizes for all the strains?
- Is there a relationship between genome size and Cit status?
- How many base pair changes are there between the Cit+ and Cit- strains?
- What are the base pair changes between strains?


## References

Blount, Z.D., Barrick, J.E., Davidson, C.J., Lenski, R.E.
Genomic analysis of a key innovation in an experimental Escherichia coli population (2012) Nature, 489 (7417), pp. 513-518.
[Paper](https://www.ncbi.nlm.nih.gov/pubmed/22992527), [Supplemental materials](https://www.nature.com/nature/journal/v489/n7417/full/nature11514.html#supplementary-information)
Data on NCBI SRA: [http://www.ncbi.nlm.nih.gov/sra?term=SRA026813](http://www.ncbi.nlm.nih.gov/sra?term=SRA026813)
Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, Lenski RE.
Tempo and mode of genome evolution in a 50,000-generation experiment (2016) Nature. 536(7615): 165–170.
[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), [Supplemental materials](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/#)
Data on NCBI SRA: [https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605)
Data on EMBL-EBI ENA: [https://www.ebi.ac.uk/ena/data/view/PRJNA295606](https://www.ebi.ac.uk/ena/data/view/PRJNA295606)
Binary file modified fig/01_tidiness_datasheet_example_messy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions files/Ecoli_metadata_composite_messy.html

Large diffs are not rendered by default.

Binary file added files/Ecoli_metadata_composite_messy.pdf
Binary file not shown.
Binary file added files/Ecoli_metadata_composite_messy.xlsx
Binary file not shown.

0 comments on commit 160e45d

Please sign in to comment.