From ef04e94e4ce8dc9376ef47236f015f6173fb4f05 Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 10:30:46 +0200 Subject: [PATCH 1/8] Fixed typos --- CONTRIBUTING.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4a184014..bb31fa8c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -69,7 +69,7 @@ and to meet some of our community members. There are many ways to contribute, from writing new exercises and improving existing ones to updating or filling in the documentation -and and submitting [bug reports][issues] +and submitting [bug reports][issues] about things that don't work, aren't clear, or are missing. If you are looking for ideas, please see [the list of issues for this repository][issues], @@ -130,11 +130,11 @@ and have final say over what gets merged into the lesson. ## Issues Labels -What issue labels in this repository mean? Think if you can asign one of the below labels to your issue. +What issue labels in this repository mean? Think if you can assign one of the below labels to your issue. This will help maintainers to decide how to act and should result in quicker response to the issues. -If you don't assign the label to an issue it will be assigned by the maintainers. +If you don't assign the label to an issue, it will be assigned by the maintainers. -- **duplicate** means there is other issue reporting same problem +- **duplicate** means there is another issue reporting the same problem - **enhancement** improvement to the existing content - **help wanted** maintainers invite contributors to tackle this issue - **invalid** is not considered an issue by the maintainers From 53a1b35ee45f24d958536e043731caed8be3219d Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 10:32:28 +0200 Subject: [PATCH 2/8] Added Data Carpentry to Attribution section Not sure if Data Carpentry should be included in the Attribution section - reject PR if I am getting this wrong. --- LICENSE.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/LICENSE.md b/LICENSE.md index 42f526a2..8c24b2a0 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -25,8 +25,8 @@ Under the following terms: * **Attribution**---You must give appropriate credit (mentioning that your work is derived from work that is Copyright © Software - Carpentry and, where practical, linking to - http://software-carpentry.org/), provide a [link to the + Carpentry or Data Carpentry and, where practical, linking to + http://software-carpentry.org/ or https://datacarpentry.org), provide a [link to the license][cc-by-human], and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. From bfd2619ec1f4aca8241ee00df7ea5eabbadf3a8d Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 11:14:44 +0200 Subject: [PATCH 3/8] Fixed typos and punctuation --- _extras/guide.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/_extras/guide.md b/_extras/guide.md index a98673ec..0a08d50b 100644 --- a/_extras/guide.md +++ b/_extras/guide.md @@ -28,11 +28,11 @@ some of it can be prevented with good strategies for data collection up front. very manual and also not reproducible. That's why we're teaching bash shell scripting! * Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that does spreadsheets like Excel, LibreOffice, OpenOffice. Most learners are probably using Excel. -* Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest +* Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest. of the data in the spreadsheet. What are the pain points!? -* As people answer highlight some of these issues with spreadsheets -* Go through the point about keeping track of your steps and keeping raw data raw -* Go through the cardinal rule of spreadsheets about columns, rows and cells +* As people answer, highlight some of these issues with spreadsheets. +* Go through the point about keeping track of your steps and keeping raw data raw. +* Go through the cardinal rule of spreadsheets about columns, rows and cells. * Hand them a messy data file and have them pair up and work together to clean up the data. #### [Planning for NGS projects](../02-project-planning/) @@ -67,8 +67,8 @@ Excel. So, the presenter's environment will only be the same as some of the lear We need better notes and screenshots of how things work on both Mac and PC. But we likely won't be able to cover all the different versions of Excel. -If you have a helper who has experience with the other OS than you, it would be good -to prep them to help with this lesson and tell people how to do things in the other OS. +If you have a helper who has more experience with the other OS than you, it would be good +to prepare them to help with this lesson and tell people how to do things in the other OS. #### People are not interactive or responsive on the exercises From cb78c11ccabf84dd2d93715317b5351079404f65 Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 11:19:06 +0200 Subject: [PATCH 4/8] Fixed a typo and rewrote a sentence --- index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/index.md b/index.md index 4a1b0d76..2ab1f11d 100644 --- a/index.md +++ b/index.md @@ -7,13 +7,13 @@ Good data organization is the foundation of any research project. It not only se Organizing a project that includes sequencing involves many components. There's the experimental setup and conditions metadata, measurements of experimental parameters, sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis. So much of the information of a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer. In this lesson, we'll go through the project organization and documentation that will make an efficient bioinformatics workflow possible. Not only will this make you a more effective bioinformatics researcher, it also prepares your data and project for publication, as grant agencies and publishers increasingly require this information. -In this lesson we'll be using data from a study of experimental evolution using *E. coli*. [More about this dataset](http://www.datacarpentry.org/organization-genomics/data/). In this study there are several types of files +In this lesson, we'll be using data from a study of experimental evolution using *E. coli*. More information about this dataset is available [here](http://www.datacarpentry.org/organization-genomics/data/). In this study there are several types of files: - spreadsheet data from the experiment that tracks the strains and their phenotype over time - spreadsheet data with information on the samples that were sequenced - the names of the samples, how they were prepared and the sequencing conditions - the sequence data -Throughout the analysis we'll also generate files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used. +Throughout the analysis, we'll also generate files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used. In this lesson you will learn: From 9ff06bddc3904becb066c1a8771131179f54755d Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 11:44:58 +0200 Subject: [PATCH 5/8] Fixed typos/punctuation --- _episodes/01-tidiness.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/_episodes/01-tidiness.md b/_episodes/01-tidiness.md index d49f2b79..8dd0e1b3 100644 --- a/_episodes/01-tidiness.md +++ b/_episodes/01-tidiness.md @@ -4,14 +4,14 @@ teaching: 20 exercises: 10 questions: - "What metadata should I collect?" -- "How should I structure my sequencing data and metadata" +- "How should I structure my sequencing data and metadata?" objectives: - "Think about and understand the types of metadata a sequencing experiment will generate." -- "Understand the importance of metadata and potential metadata standards" -- "Explore common formatting challenges in spreadsheet data" +- "Understand the importance of metadata and potential metadata standards." +- "Explore common formatting challenges in spreadsheet data." keypoints: -- "Metadata is key for you and others to be able to work with your data" -- "Tabular data needs to be structured to be able to work with it effectively" +- "Metadata is key for you and others to be able to work with your data." +- "Tabular data needs to be structured to be able to work with it effectively." --- # Introduction @@ -21,20 +21,20 @@ When we think about the data for a sequencing project, we often start by thinkin > ## Discussion > With the person next to you, discuss: > -> What kinds of data and information have you generated before you send your DNA/RNA off for sequencing? +> What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing? > > > ## Solution > > Types of files and information you have generated: -> > - spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study -> > - lab notebook notes about how you conducted those experiments -> > - spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information. -> > - lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq. +> > - Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study. +> > - Lab notebook notes about how you conducted those experiments. +> > - Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information. +> > - Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq. > > There likely will be other ideas here too. > > Was this more information and data than you were expecting? > {: .solution} {: .challenge} -All of the data and information just discussed can be considered metadata, data about the data. We want to follow a few guidelines for metadata. +All of the data and information just discussed can be considered metadata, i.e. data about the data. We want to follow a few guidelines for metadata. ## Notes @@ -72,7 +72,7 @@ consistent and can be used across the field. ### Structuring data in spreadsheets -Independent of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet, to make it easier to analyze later. We often enter data that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it, so to use this data in a computational workflow, we need to think like computers when we use spreadsheets. +Independent of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet, to make it easier to analyze later. We often enter data that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets. The cardinal rules of using spreadsheet programs for data: From 989e4094c5cae46830a28f246d0995bd7db96214 Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 11:48:48 +0200 Subject: [PATCH 6/8] Added instruction to download cleaned spreadsheet --- _episodes/01-tidiness.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_episodes/01-tidiness.md b/_episodes/01-tidiness.md index 8dd0e1b3..7a7a9159 100644 --- a/_episodes/01-tidiness.md +++ b/_episodes/01-tidiness.md @@ -96,7 +96,7 @@ analysis you want to do, you may even separate the genus and species names into > > A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet: > > > >[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv) -> > +> >Download the file using right-click (PC)/command-click (Mac). > {: .solution} {: .challenge} From 656eec9be6eadffc3bfec16787a05cc55e936e0a Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 14:26:06 +0200 Subject: [PATCH 7/8] Fixed typos --- _episodes/02-project-planning.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/_episodes/02-project-planning.md b/_episodes/02-project-planning.md index faab7b7f..b40e045a 100644 --- a/_episodes/02-project-planning.md +++ b/_episodes/02-project-planning.md @@ -7,7 +7,7 @@ questions: - "What information does a sequencing facility need?" - "What are the guidelines for data storage?" objectives: -- Understand the data we send to and get back from a sequencing center +- Understand the data we send to and get back from a sequencing center. - Make decisions about how (if) data will be stored, archived, shared, etc. keypoints: - "Data being sent to a sequencing center also needs to be structured so you can use it." @@ -22,7 +22,7 @@ methods and approaches we need in bioinformatics are the same ones we need at th > ## Discussion > -> Before we go any further here are some important questions to consider. If you are learning at a workshop, +> Before we go any further, here are some important questions to consider. If you are learning at a workshop, > please discuss these questions with your neighbor. > > @@ -58,14 +58,14 @@ with Excel or another spreadsheet program. > > - Capitalization of the replicate column changes > > - Volume and concentration column headers have unusual (not allowed) characters > > - Volume, concentration, and RIN column decimal accuracy changes -> > - The prep_date and ship_date formats are different, prep_date has multiple formats +> > - The prep_date and ship_date formats are different, and prep_date has multiple formats > > - Are there others not mentioned? > > > > Improvements in naming > > - Shorten client_sample_id names, and maybe just call them "names" -> > - For example: "wt" for "wild-type". Also, they are all "1hr" so that is superfluous information +> > - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information > > - The prep_date and ship_date might not be needed -> > - Use "microliters" for "Volume (µL)" etc. +> > - Use "microliters" for "Volume (µL)" etc. > > > > Errors hard to spot: > > - No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names @@ -108,7 +108,7 @@ The raw data you get back from the sequencing center is the foundation of your s ### Guidelines for storing data -- Store the data in a place that is accessible by you and other members of your lab. At a minimum, you and the head of your lab should have access +- Store the data in a place that is accessible by you and other members of your lab. At a minimum, you and the head of your lab should have access. - Store the data in a place that is redundantly backed up. It should be backed up in two locations that are in different physical areas. - Leave the raw data raw. You will be working with this data, but you don't want to modify this stored copy of the original data. If you modify the data, you'll never be able to access those original files. We will cover how to avoid accidentally changing files in a later lesson in this workshop [(see File Permissions)](https://datacarpentry.org/shell-genomics/03-working-with-files/#file-permissions). From 1c136654e16508da7eb44e822e8f6e40724ffa8d Mon Sep 17 00:00:00 2001 From: Bianca Peterson Date: Mon, 29 Jul 2019 14:42:38 +0200 Subject: [PATCH 8/8] Fixed punctuation --- _episodes/03-ncbi-sra.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_episodes/03-ncbi-sra.md b/_episodes/03-ncbi-sra.md index 9e346f35..a7f86dc5 100644 --- a/_episodes/03-ncbi-sra.md +++ b/_episodes/03-ncbi-sra.md @@ -5,8 +5,8 @@ exercises: 10 questions: - "How do I access public sequencing data?" objectives: -- "Be aware that public genomic data is available" -- "Understand how to access and download this data" +- "Be aware that public genomic data is available." +- "Understand how to access and download this data." keypoints: - "Public data repositories are a great source of genomic data." --- @@ -17,7 +17,7 @@ There are many repositories for public data. Some model organisms or fields have # Accessing the original archived data -The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra) which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section). +The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](http://www.datacarpentry.org/organization-genomics/data/) was obtained from the [NCBI Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra), which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section). ## Locate the Run Selector Table for the Lenski Dataset on the SRA @@ -69,7 +69,7 @@ You should now have a file called `SraRunTable.txt` ## Review the SraRunTable in a spreadsheet program -Using your choice of spreadsheet program open the `SraRunTable.txt` file. If prompted this is a tab-delimited file (`.tsv`). +Using your choice of spreadsheet program, open the `SraRunTable.txt` file. If prompted, this is a tab-delimited file (`.tsv`). > ## Discussion > Discuss with the person next to you: