source commit: 9942882

datacarpentry · Jun 5, 2024 · 3649701 · 3649701
commit 3649701
Show file tree

Hide file tree

Showing 36 changed files with 1,316 additions and 0 deletions.
diff --git a/01-tidiness.md b/01-tidiness.md
@@ -0,0 +1,152 @@
+---
+title: Data Tidiness
+teaching: 20
+exercises: 10
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Think about and understand the types of metadata a sequencing experiment will generate.
+- Understand the importance of metadata and potential metadata standards.
+- Explore common formatting challenges in spreadsheet data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- What metadata should I collect?
+- How should I structure my sequencing data and metadata?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Introduction
+
+When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center. However, equally or more important is the data you've generated *about* the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Discussion
+
+With the person next to you, discuss:
+
+What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
+
+:::::::::::::::  solution
+
+## Solution
+
+Types of files and information you have generated:
+
+- Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study.
+- Lab notebook notes about how you conducted those experiments.
+- Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
+- Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq.
+  There likely will be other ideas here too.
+  Was this more information and data than you were expecting?
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+All of the data and information just discussed can be considered metadata, i.e. data about the data. We want to follow a few guidelines for metadata.
+
+## Notes
+
+Notes about your experiment, including how you prepared your samples for sequencing, should be in your lab notebook, whether that's a physical lab notebook or electronic lab notebook. For guidelines on good lab notebooks, see the Howard Hughes Medical Institute "Making the Right Moves: A Practical Guide to Scientifıc Management for Postdocs and New Faculty" section on
+[Data Management and Laboratory Notebooks](https://www.hhmi.org/sites/default/files/Educational%20Materials/Lab%20Management/Making%20the%20Right%20Moves/moves2_ch8.pdf).
+
+Ensure to include dates on your lab notebook pages, the samples themselves, and in
+any records about those samples. This will help you correctly associate samples
+other later. Using dates also helps create unique identifiers, because even
+if you process the same sample twice, you do not usually do it on the same
+day, or if you do, you're aware of it and give them names like A and B.
+
+:::::::::::::::::::::::::::::::::::::::::  callout
+
+## Unique identifiers
+
+Unique identifiers are a unique name for a sample or set of sequencing data.
+They are names for that data that only exist for that data. Having these
+unique names makes them much easier to track later.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Data about the experiment
+
+Data about the experiment is usually collected in spreadsheets, like Excel.
+
+What type of data to collect depends on your experiment and there are often guidelines from metadata standards.
+
+:::::::::::::::::::::::::::::::::::::::::  callout
+
+## Metadata standards
+
+Many fields have particular ways that they structure their metadata so it's
+consistent and can be used across the field.
+
+The Digital Curation Center maintains [a list of metadata  standards](https://www.dcc.ac.uk/resources/metadata-standards/list) and some that are particularly relevant for genomics data are available from the [Genomics Standards Consortium](https://www.gensc.org/pages/projects.html).
+
+If there are not metadata standards already, you can think about what the minimum amount of information is that someone would need to know about your data to be able to work with it, without talking to you.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Structuring data in spreadsheets
+
+Regardless of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet to make it easier to analyze later. We often enter data in a way that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets.
+
+The cardinal rules of using spreadsheet programs for data:
+
+- Leave the raw data raw - do not change it!
+- Put each observation or sample in its own row.
+- Put all your variables in columns - the thing that vary between samples, like ‘strain' or ‘DNA-concentration'.
+- Have column names be explanatory, but without spaces. Use '-', '\_' or [camel case](https://en.wikipedia.org/wiki/Camel_case) instead of a space. For instance 'library-prep-method' or 'LibraryPrep'is better than 'library preparation method' or 'prep', because computers interpret spaces in particular ways.
+- Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that's the only way
+  you'll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
+  K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
+  analysis you want to do, you may even separate the genus and species names into distinct columns.
+- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.
+
+[![](fig/01_tidiness_datasheet_example_messy.png){alt='Messy spreadsheet'}](files/Ecoli_metadata_composite_messy.xlsx)
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Discussion
+
+This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to your computer via this [link](files/Ecoli_metadata_composite_messy.xlsx) and open it in a spreadsheet reader like Excel.
+
+:::::::::::::::  solution
+
+## Solution
+
+A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](https://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
+
+[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv)
+Download the file using right-click (PC)/command-click (Mac).
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Further notes on data tidiness
+
+Organizing your data properly at this point of your experiment will help your analysis later. It will also prepare your data and notes for data deposition, which is often required by journals and funding agencies. If this is a collaborative project, as most projects are now, it's also vital information for your collaborators. Well organized data is very useful for communication and efficiency.
+
+Fear not! If you have already started your project and it's not set up this way, there are still opportunities to make updates. One of the biggest challenges is tabular data that is not formatted so computers can use it, or has inconsistencies that make it hard to analyze.
+
+More practice on how to structure data is outlined in our [Data Carpentry Ecology spreadsheet lesson](https://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes)
+
+Tools like [OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) can help you clean your data.
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Metadata is key for you and others to be able to work with your data.
+- Tabular data needs to be structured to be able to work with it effectively.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/02-project-planning.md b/02-project-planning.md
@@ -0,0 +1,173 @@
+---
+title: Planning for NGS Projects
+teaching: 20
+exercises: 10
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Understand the data we send to and get back from a sequencing center.
+- Make decisions about how (if) data will be stored, archived, shared, etc.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How do I plan and organize a genome sequencing project?
+- What information does a sequencing facility need?
+- What are the guidelines for data storage?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+There are a variety of ways to work with a large sequencing dataset. You may be a novice who has not used
+bioinformatics tools beyond doing BLAST searches. You may have bioinformatics experience with other types of data
+and are working with high-throughput (NGS) sequence data for the first time. In the most important ways, the
+methods and approaches we need in bioinformatics are the same ones we need at the bench or in the field -
+*planning, documenting, and organizing* are the key to good reproducible science.
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Discussion
+
+Before we go any further, here are some important questions to consider. If you are learning at a workshop,
+please discuss these questions with your neighbor.
+
+**Working with sequence data**
+
+What challenges do you think you'll face (or have already faced) in working with a large sequence dataset?
+What is your strategy for saving and sharing your sequence files?
+How can you be sure that your raw data have not been unintentionally corrupted?
+Where/how will you (did you) analyze your data - what software, what computer(s)?
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Sending samples to the facility
+
+The first step in sending your sample for sequencing will be to complete a form documenting the metadata for the
+facility. Take a look at the following example submission spreadsheet.
+
+[Sample submission sheet](files/sample_submission.txt)
+
+Download the file using right-click (PC)/command-click (Mac). This is a tab-delimited text file. Try opening it
+with Excel or another spreadsheet program.
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Exercise
+
+1. What are some errors you can spot in the data? Typos, missing data, inconsistencies?
+2. What improvements could be made to the choices in naming?
+3. What are some errors in the spreadsheet that would be difficult to spot? Is there any way you can test this?
+
+:::::::::::::::  solution
+
+## Solution
+
+Errors:
+
+- Sequential order of well\_position changes
+- Format of client\_sample\_id changes and cannot have spaces, slashes, non-standard ASCII characters
+- Capitalization of the replicate column changes
+- Volume and concentration column headers have unusual (not allowed) characters
+- Volume, concentration, and RIN column decimal accuracy changes
+- The prep\_date and ship\_date formats are different, and prep\_date has multiple formats
+- Are there others not mentioned?
+
+Improvements in naming
+
+- Shorten client\_sample\_id names, and maybe just call them "names"
+  - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information
+- The prep\_date and ship\_date might not be needed
+- Use "microliters" for "Volume (µL)" etc.
+
+Errors hard to spot:
+
+- No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names
+- Find by sorting, or counting
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Retrieving sample sequencing data from the facility
+
+When the data come back from the sequencing facility, you will receive some documentation (metadata) as well as
+the sequence files themselves. Download and examine the following example file - here provided as a text file and
+Excel file:
+
+- [Sequencing results - text](files/sequencing_results_metadata.txt)
+- [Sequencing results - Excel](files/sequencing_results_metadata.xls)
+
+:::::::::::::::::::::::::::::::::::::::  challenge
+
+## Exercise
+
+1. How are these samples organized?
+2. If you wanted to associate the sequence file names with their corresponding sample names from the submission sheet, could you do so? How?
+3. What do the \_R1/\_R2 extensions mean in the file names?
+4. What does the '.gz' extension on the filenames indicate?
+5. What is the total file size - what challenges in downloading and sharing these data might exist?
+
+:::::::::::::::  solution
+
+## Solution
+
+1. Samples are organized by sample\_id
+2. To relate filenames use the sample\_id, and do a VLOOKUP on submission sheet
+3. The \_R1/\_R2 extensions mean "read 1" and "read 2" of each sample. These
+  typically refer to forward and reverse reads of the same DNA fragment from
+  the sequencer, i.e. during paired-end sequencing.
+4. The '.gz' extension means it is a compressed "gzip" type format to save disk space
+5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large you should validate the file size following transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Storing data
+
+The raw data you get back from the sequencing center is the foundation of your sequencing analysis. You need to keep this data, so that you can always come back to it if there are any questions or you need to re-run an analysis, or try a new analysis approach.
+
+### Guidelines for storing data
+
+- Store the data in a place that is accessible by you and other members of your lab. At a minimum, you and the head of your lab should have access.
+- Store the data in a place that is redundantly backed up. It should be backed up in two locations that are in different physical areas.
+- Leave the raw data raw. You will be working with this data, but you do not want to modify this stored copy of the original data. If you modify the data, you'll never be able to access those original files. We will cover how to avoid accidentally changing files in a later lesson in this workshop [(see File Permissions)](https://datacarpentry.org/shell-genomics/03-working-with-files#file-permissions).
+
+#### Some data storage solutions
+
+If you have a local high performance computing center or data storage facility on your campus or with your organization, those are ideal locations. Get in touch with the people who support those facilities to ask for information.
+
+If you do not have access to these resources, you can back up on hard drives. Have two backups, and keep the hard drives in different physical locations.
+
+You can also use resources like [Amazon S3](https://aws.amazon.com/s3/),  [Microsoft Azure](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/),  [Google Cloud](https://cloud.google.com/storage/) or others for cloud storage. The [open science framework](https://osf.io) is a free option for storing files up to 5 GB. See more in the lesson ["Introduction to Cloud Computing for Genomics"](https://www.datacarpentry.org/cloud-genomics/04-which-cloud).
+
+## Summary
+
+Before analysis of data has begun, there are already many potential areas for errors and omissions. Keeping
+organized and keeping a critical eye can help catch mistakes.
+
+One of Data Carpentry's goals is to help you achieve *competency* in working with bioinformatics. This means that
+you can accomplish routine tasks, under normal conditions, in an acceptable amount of time. While an expert might
+be able to get to a solution on instinct alone - taking your time, using Google or another Internet search engine,
+and asking for help are all valid ways of solving your problems. As you complete the lessons you'll be able to use all of those methods more efficiently.
+
+:::::::::::::::::::::::::::::::::::::::::  callout
+
+## Where to go from here?
+
+More reading about core competencies
+
+L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta and M. Schneider, '[Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3945096/)', PLoS Comput Biol, vol. 10, no. 3, p. e1003496, 2014.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Data being sent to a sequencing center also needs to be structured so you can use it.
+- Raw sequencing data should be kept raw somewhere, so you can always go back to the original files.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+