Skip to content

Commit

Permalink
Merge pull request #118 from BinxiePeterson/gh-pages
Browse files Browse the repository at this point in the history
Fixed typos and small things (@raynamharris suggestion to follow)
Thanks @BinxiePeterson !
  • Loading branch information
hoytpr authored Aug 1, 2019
2 parents 7460558 + 49bcb45 commit a0fa47d
Show file tree
Hide file tree
Showing 7 changed files with 37 additions and 37 deletions.
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ and to meet some of our community members.
There are many ways to contribute,
from writing new exercises and improving existing ones
to updating or filling in the documentation
and and submitting [bug reports][issues]
and submitting [bug reports][issues]
about things that don't work, aren't clear, or are missing.
If you are looking for ideas,
please see [the list of issues for this repository][issues],
Expand Down Expand Up @@ -130,11 +130,11 @@ and have final say over what gets merged into the lesson.

## Issues Labels

What issue labels in this repository mean? Think if you can asign one of the below labels to your issue.
What issue labels in this repository mean? Think if you can assign one of the below labels to your issue.
This will help maintainers to decide how to act and should result in quicker response to the issues.
If you don't assign the label to an issue it will be assigned by the maintainers.
If you don't assign the label to an issue, it will be assigned by the maintainers.

- **duplicate** means there is other issue reporting same problem
- **duplicate** means there is another issue reporting the same problem
- **enhancement** improvement to the existing content
- **help wanted** maintainers invite contributors to tackle this issue
- **invalid** is not considered an issue by the maintainers
Expand Down
4 changes: 2 additions & 2 deletions LICENSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ Under the following terms:

* **Attribution**---You must give appropriate credit (mentioning that
your work is derived from work that is Copyright © Software
Carpentry and, where practical, linking to
http://software-carpentry.org/), provide a [link to the
Carpentry or Data Carpentry and, where practical, linking to
http://software-carpentry.org/ or https://datacarpentry.org), provide a [link to the
license][cc-by-human], and indicate if changes were made. You may do
so in any reasonable manner, but not in any way that suggests the
licensor endorses you or your use.
Expand Down
26 changes: 13 additions & 13 deletions _episodes/01-tidiness.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ teaching: 20
exercises: 10
questions:
- "What metadata should I collect?"
- "How should I structure my sequencing data and metadata"
- "How should I structure my sequencing data and metadata?"
objectives:
- "Think about and understand the types of metadata a sequencing experiment will generate."
- "Understand the importance of metadata and potential metadata standards"
- "Explore common formatting challenges in spreadsheet data"
- "Understand the importance of metadata and potential metadata standards."
- "Explore common formatting challenges in spreadsheet data."
keypoints:
- "Metadata is key for you and others to be able to work with your data"
- "Tabular data needs to be structured to be able to work with it effectively"
- "Metadata is key for you and others to be able to work with your data."
- "Tabular data needs to be structured to be able to work with it effectively."
---

# Introduction
Expand All @@ -21,20 +21,20 @@ When we think about the data for a sequencing project, we often start by thinkin
> ## Discussion
> With the person next to you, discuss:
>
> What kinds of data and information have you generated before you send your DNA/RNA off for sequencing?
> What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
>
> > ## Solution
> > Types of files and information you have generated:
> > - spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study
> > - lab notebook notes about how you conducted those experiments
> > - spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
> > - lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq.
> > - Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study.
> > - Lab notebook notes about how you conducted those experiments.
> > - Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
> > - Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq.
> > There likely will be other ideas here too.
> > Was this more information and data than you were expecting?
> {: .solution}
{: .challenge}

All of the data and information just discussed can be considered metadata, data about the data. We want to follow a few guidelines for metadata.
All of the data and information just discussed can be considered metadata, i.e. data about the data. We want to follow a few guidelines for metadata.

## Notes

Expand Down Expand Up @@ -72,7 +72,7 @@ consistent and can be used across the field.

### Structuring data in spreadsheets

Independent of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet, to make it easier to analyze later. We often enter data that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it, so to use this data in a computational workflow, we need to think like computers when we use spreadsheets.
Independent of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet, to make it easier to analyze later. We often enter data that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets.

The cardinal rules of using spreadsheet programs for data:

Expand All @@ -96,7 +96,7 @@ analysis you want to do, you may even separate the genus and species names into
> > A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
> >
> >[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv)
> >
> >Download the file using right-click (PC)/command-click (Mac).
> {: .solution}
{: .challenge}

Expand Down
12 changes: 6 additions & 6 deletions _episodes/02-project-planning.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ questions:
- "What information does a sequencing facility need?"
- "What are the guidelines for data storage?"
objectives:
- Understand the data we send to and get back from a sequencing center
- Understand the data we send to and get back from a sequencing center.
- Make decisions about how (if) data will be stored, archived, shared, etc.
keypoints:
- "Data being sent to a sequencing center also needs to be structured so you can use it."
Expand All @@ -22,7 +22,7 @@ methods and approaches we need in bioinformatics are the same ones we need at th

> ## Discussion
>
> Before we go any further here are some important questions to consider. If you are learning at a workshop,
> Before we go any further, here are some important questions to consider. If you are learning at a workshop,
> please discuss these questions with your neighbor.
>
>
Expand Down Expand Up @@ -58,14 +58,14 @@ with Excel or another spreadsheet program.
> > - Capitalization of the replicate column changes
> > - Volume and concentration column headers have unusual (not allowed) characters
> > - Volume, concentration, and RIN column decimal accuracy changes
> > - The prep_date and ship_date formats are different, prep_date has multiple formats
> > - The prep_date and ship_date formats are different, and prep_date has multiple formats
> > - Are there others not mentioned?
> >
> > Improvements in naming
> > - Shorten client_sample_id names, and maybe just call them "names"
> > - For example: "wt" for "wild-type". Also, they are all "1hr" so that is superfluous information
> > - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information
> > - The prep_date and ship_date might not be needed
> > - Use "microliters" for "Volume (µL)" etc.
> > - Use "microliters" for "Volume (µL)" etc.
> >
> > Errors hard to spot:
> > - No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names
Expand Down Expand Up @@ -108,7 +108,7 @@ The raw data you get back from the sequencing center is the foundation of your s

### Guidelines for storing data

- Store the data in a place that is accessible by you and other members of your lab. At a minimum, you and the head of your lab should have access
- Store the data in a place that is accessible by you and other members of your lab. At a minimum, you and the head of your lab should have access.
- Store the data in a place that is redundantly backed up. It should be backed up in two locations that are in different physical areas.
- Leave the raw data raw. You will be working with this data, but you don't want to modify this stored copy of the original data. If you modify the data, you'll never be able to access those original files. We will cover how to avoid accidentally changing files in a later lesson in this workshop [(see File Permissions)](https://datacarpentry.org/shell-genomics/03-working-with-files/#file-permissions).

Expand Down
8 changes: 4 additions & 4 deletions _episodes/03-ncbi-sra.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ exercises: 10
questions:
- "How do I access public sequencing data?"
objectives:
- "Be aware that public genomic data is available"
- "Understand how to access and download this data"
- "Be aware that public genomic data is available."
- "Understand how to access and download this data."
keypoints:
- "Public data repositories are a great source of genomic data."
---
Expand All @@ -17,7 +17,7 @@ There are many repositories for public data. Some model organisms or fields have

# Accessing the original archived data

The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra) which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section).
The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra), which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section).

## Locate the Run Selector Table for the Lenski Dataset on the SRA

Expand Down Expand Up @@ -69,7 +69,7 @@ You should now have a file called `SraRunTable.txt`
## Review the SraRunTable in a spreadsheet program


Using your choice of spreadsheet program open the `SraRunTable.txt` file. If prompted this is a tab-delimited file (`.tsv`).
Using your choice of spreadsheet program, open the `SraRunTable.txt` file. If prompted, this is a tab-delimited file (`.tsv`).

> ## Discussion
> Discuss with the person next to you:
Expand Down
12 changes: 6 additions & 6 deletions _extras/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ some of it can be prevented with good strategies for data collection up front.
very manual and also not reproducible. That's why we're teaching bash shell scripting!
* Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that
does spreadsheets like Excel, LibreOffice, OpenOffice. Most learners are probably using Excel.
* Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest
* Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest.
of the data in the spreadsheet. What are the pain points!?
* As people answer highlight some of these issues with spreadsheets
* Go through the point about keeping track of your steps and keeping raw data raw
* Go through the cardinal rule of spreadsheets about columns, rows and cells
* As people answer, highlight some of these issues with spreadsheets.
* Go through the point about keeping track of your steps and keeping raw data raw.
* Go through the cardinal rule of spreadsheets about columns, rows and cells.
* Hand them a messy data file and have them pair up and work together to clean up the data.

#### [Planning for NGS projects](../02-project-planning/)
Expand Down Expand Up @@ -67,8 +67,8 @@ Excel. So, the presenter's environment will only be the same as some of the lear
We need better notes and screenshots of how things work on both Mac and PC. But we
likely won't be able to cover all the different versions of Excel.

If you have a helper who has experience with the other OS than you, it would be good
to prep them to help with this lesson and tell people how to do things in the other OS.
If you have a helper who has more experience with the other OS than you, it would be good
to prepare them to help with this lesson and tell people how to do things in the other OS.

#### People are not interactive or responsive on the exercises

Expand Down
4 changes: 2 additions & 2 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ Good data organization is the foundation of any research project. It not only se

Organizing a project that includes sequencing involves many components. There's the experimental setup and conditions metadata, measurements of experimental parameters, sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis. So much of the information of a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer. In this lesson, we'll go through the project organization and documentation that will make an efficient bioinformatics workflow possible. Not only will this make you a more effective bioinformatics researcher, it also prepares your data and project for publication, as grant agencies and publishers increasingly require this information.

In this lesson we'll be using data from a study of experimental evolution using *E. coli*. [More about this dataset](http://www.datacarpentry.org/organization-genomics/data/). In this study there are several types of files
In this lesson, we'll be using data from a study of experimental evolution using *E. coli*. More information about this dataset is available [here](http://www.datacarpentry.org/organization-genomics/data/). In this study there are several types of files:

- spreadsheet data from the experiment that tracks the strains and their phenotype over time
- spreadsheet data with information on the samples that were sequenced - the names of the samples, how they were prepared and the sequencing conditions
- the sequence data

Throughout the analysis we'll also generate files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used.
Throughout the analysis, we'll also generate files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used.

In this lesson you will learn:

Expand Down

0 comments on commit a0fa47d

Please sign in to comment.