diff --git a/01-tidiness.md b/01-tidiness.md
new file mode 100644
index 00000000..3dba341b
--- /dev/null
+++ b/01-tidiness.md
@@ -0,0 +1,152 @@
+---
+title: Data Tidiness
+teaching: 20
+exercises: 10
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Think about and understand the types of metadata a sequencing experiment will generate.
+- Understand the importance of metadata and potential metadata standards.
+- Explore common formatting challenges in spreadsheet data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- What metadata should I collect?
+- How should I structure my sequencing data and metadata?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Introduction
+
+When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing center. However, equally or more important is the data you've generated *about* the sequences before it ever goes to the sequencing center. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+With the person next to you, discuss:
+
+What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
+
+::::::::::::::: solution
+
+## Solution
+
+Types of files and information you have generated:
+
+- Spreadsheet or tabular data with the data from your experiment and whatever you were measuring for your study.
+- Lab notebook notes about how you conducted those experiments.
+- Spreadsheet or tabular data about the samples you sent off for sequencing. Sequencing centers often have a particular format they need with the name of the sample, DNA concentration and other information.
+- Lab notebook notes about how you prepared the DNA/RNA for sequencing and what type of sequencing you're doing, e.g. paired end Illumina HiSeq.
+ There likely will be other ideas here too.
+ Was this more information and data than you were expecting?
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+All of the data and information just discussed can be considered metadata, i.e. data about the data. We want to follow a few guidelines for metadata.
+
+## Notes
+
+Notes about your experiment, including how you prepared your samples for sequencing, should be in your lab notebook, whether that's a physical lab notebook or electronic lab notebook. For guidelines on good lab notebooks, see the Howard Hughes Medical Institute "Making the Right Moves: A Practical Guide to Scientifıc Management for Postdocs and New Faculty" section on
+[Data Management and Laboratory Notebooks](https://www.hhmi.org/sites/default/files/Educational%20Materials/Lab%20Management/Making%20the%20Right%20Moves/moves2_ch8.pdf).
+
+Ensure to include dates on your lab notebook pages, the samples themselves, and in
+any records about those samples. This will help you correctly associate samples
+other later. Using dates also helps create unique identifiers, because even
+if you process the same sample twice, you do not usually do it on the same
+day, or if you do, you're aware of it and give them names like A and B.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Unique identifiers
+
+Unique identifiers are a unique name for a sample or set of sequencing data.
+They are names for that data that only exist for that data. Having these
+unique names makes them much easier to track later.
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Data about the experiment
+
+Data about the experiment is usually collected in spreadsheets, like Excel.
+
+What type of data to collect depends on your experiment and there are often guidelines from metadata standards.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Metadata standards
+
+Many fields have particular ways that they structure their metadata so it's
+consistent and can be used across the field.
+
+The Digital Curation Center maintains [a list of metadata standards](https://www.dcc.ac.uk/resources/metadata-standards/list) and some that are particularly relevant for genomics data are available from the [Genomics Standards Consortium](https://www.gensc.org/pages/projects.html).
+
+If there are not metadata standards already, you can think about what the minimum amount of information is that someone would need to know about your data to be able to work with it, without talking to you.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Structuring data in spreadsheets
+
+Regardless of the type of data you're collecting, there are standard ways to enter that data into the spreadsheet to make it easier to analyze later. We often enter data in a way that makes it easy for us as humans to read and work with it, because we're human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets.
+
+The cardinal rules of using spreadsheet programs for data:
+
+- Leave the raw data raw - do not change it!
+- Put each observation or sample in its own row.
+- Put all your variables in columns - the thing that vary between samples, like ‘strain' or ‘DNA-concentration'.
+- Have column names be explanatory, but without spaces. Use '-', '\_' or [camel case](https://en.wikipedia.org/wiki/Camel_case) instead of a space. For instance 'library-prep-method' or 'LibraryPrep'is better than 'library preparation method' or 'prep', because computers interpret spaces in particular ways.
+- Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that's the only way
+ you'll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. *E. coli*
+ K12) you would have one column with the species name (*E. coli*) and another with the strain name (K12). Depending on the type of
+ analysis you want to do, you may even separate the genus and species names into distinct columns.
+- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.
+
+[data:image/s3,"s3://crabby-images/2589a/2589aa6f5f73dc9f534b220fcc29ea706d1b91f5" alt=""{alt='Messy spreadsheet'}](files/Ecoli_metadata_composite_messy.xlsx)
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+This is some potential spreadsheet data generated about a sequencing experiment. With the person next to you, for about 2 minutes, discuss some of the problems with the spreadsheet data shown above. You can look at the image, or download the file to your computer via this [link](files/Ecoli_metadata_composite_messy.xlsx) and open it in a spreadsheet reader like Excel.
+
+::::::::::::::: solution
+
+## Solution
+
+A full set of types of issues with spreadsheet data is at the [Data Carpentry Ecology spreadsheet lesson](https://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/). Not all are present in this example. Discuss with the group what they found. Some problems include not all data sets having the same columns, datasets split into their own tables, color to encode information, different column names, spaces in some columns names. Here is a "clean" version of the same spreadsheet:
+
+[Cleaned spreadsheet](https://raw.githubusercontent.com/datacarpentry/wrangling-genomics/gh-pages/files/Ecoli_metadata_composite.tsv)
+Download the file using right-click (PC)/command-click (Mac).
+
+
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+### Further notes on data tidiness
+
+Organizing your data properly at this point of your experiment will help your analysis later. It will also prepare your data and notes for data deposition, which is often required by journals and funding agencies. If this is a collaborative project, as most projects are now, it's also vital information for your collaborators. Well organized data is very useful for communication and efficiency.
+
+Fear not! If you have already started your project and it's not set up this way, there are still opportunities to make updates. One of the biggest challenges is tabular data that is not formatted so computers can use it, or has inconsistencies that make it hard to analyze.
+
+More practice on how to structure data is outlined in our [Data Carpentry Ecology spreadsheet lesson](https://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes)
+
+Tools like [OpenRefine](https://www.datacarpentry.org/OpenRefine-ecology-lesson/) can help you clean your data.
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Metadata is key for you and others to be able to work with your data.
+- Tabular data needs to be structured to be able to work with it effectively.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/02-project-planning.md b/02-project-planning.md
new file mode 100644
index 00000000..8da5d352
--- /dev/null
+++ b/02-project-planning.md
@@ -0,0 +1,173 @@
+---
+title: Planning for NGS Projects
+teaching: 20
+exercises: 10
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Understand the data we send to and get back from a sequencing center.
+- Make decisions about how (if) data will be stored, archived, shared, etc.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How do I plan and organize a genome sequencing project?
+- What information does a sequencing facility need?
+- What are the guidelines for data storage?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+There are a variety of ways to work with a large sequencing dataset. You may be a novice who has not used
+bioinformatics tools beyond doing BLAST searches. You may have bioinformatics experience with other types of data
+and are working with high-throughput (NGS) sequence data for the first time. In the most important ways, the
+methods and approaches we need in bioinformatics are the same ones we need at the bench or in the field -
+*planning, documenting, and organizing* are the key to good reproducible science.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+Before we go any further, here are some important questions to consider. If you are learning at a workshop,
+please discuss these questions with your neighbor.
+
+**Working with sequence data**
+
+What challenges do you think you'll face (or have already faced) in working with a large sequence dataset?
+What is your strategy for saving and sharing your sequence files?
+How can you be sure that your raw data have not been unintentionally corrupted?
+Where/how will you (did you) analyze your data - what software, what computer(s)?
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Sending samples to the facility
+
+The first step in sending your sample for sequencing will be to complete a form documenting the metadata for the
+facility. Take a look at the following example submission spreadsheet.
+
+[Sample submission sheet](files/sample_submission.txt)
+
+Download the file using right-click (PC)/command-click (Mac). This is a tab-delimited text file. Try opening it
+with Excel or another spreadsheet program.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+1. What are some errors you can spot in the data? Typos, missing data, inconsistencies?
+2. What improvements could be made to the choices in naming?
+3. What are some errors in the spreadsheet that would be difficult to spot? Is there any way you can test this?
+
+::::::::::::::: solution
+
+## Solution
+
+Errors:
+
+- Sequential order of well\_position changes
+- Format of client\_sample\_id changes and cannot have spaces, slashes, non-standard ASCII characters
+- Capitalization of the replicate column changes
+- Volume and concentration column headers have unusual (not allowed) characters
+- Volume, concentration, and RIN column decimal accuracy changes
+- The prep\_date and ship\_date formats are different, and prep\_date has multiple formats
+- Are there others not mentioned?
+
+Improvements in naming
+
+- Shorten client\_sample\_id names, and maybe just call them "names"
+ - For example: "wt" for "wild-type". Also, they are all "1hr", so that is superfluous information
+- The prep\_date and ship\_date might not be needed
+- Use "microliters" for "Volume (µL)" etc.
+
+Errors hard to spot:
+
+- No space between "wild" and "type", repeated barcode numbers, missing data, duplicate names
+- Find by sorting, or counting
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Retrieving sample sequencing data from the facility
+
+When the data come back from the sequencing facility, you will receive some documentation (metadata) as well as
+the sequence files themselves. Download and examine the following example file - here provided as a text file and
+Excel file:
+
+- [Sequencing results - text](files/sequencing_results_metadata.txt)
+- [Sequencing results - Excel](files/sequencing_results_metadata.xls)
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Exercise
+
+1. How are these samples organized?
+2. If you wanted to associate the sequence file names with their corresponding sample names from the submission sheet, could you do so? How?
+3. What do the \_R1/\_R2 extensions mean in the file names?
+4. What does the '.gz' extension on the filenames indicate?
+5. What is the total file size - what challenges in downloading and sharing these data might exist?
+
+::::::::::::::: solution
+
+## Solution
+
+1. Samples are organized by sample\_id
+2. To relate filenames use the sample\_id, and do a VLOOKUP on submission sheet
+3. The \_R1/\_R2 extensions mean "read 1" and "read 2" of each sample. These
+ typically refer to forward and reverse reads of the same DNA fragment from
+ the sequencer, i.e. during paired-end sequencing.
+4. The '.gz' extension means it is a compressed "gzip" type format to save disk space
+5. The size of all the files combined is 1113.60 Gb (over a terabyte!). To transfer files this large you should validate the file size following transfer. Absolute file integrity checks following transfers and methods for faster file transfers are possible but beyond the scope of this lesson.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+## Storing data
+
+The raw data you get back from the sequencing center is the foundation of your sequencing analysis. You need to keep this data, so that you can always come back to it if there are any questions or you need to re-run an analysis, or try a new analysis approach.
+
+### Guidelines for storing data
+
+- Store the data in a place that is accessible by you and other members of your lab. At a minimum, you and the head of your lab should have access.
+- Store the data in a place that is redundantly backed up. It should be backed up in two locations that are in different physical areas.
+- Leave the raw data raw. You will be working with this data, but you do not want to modify this stored copy of the original data. If you modify the data, you'll never be able to access those original files. We will cover how to avoid accidentally changing files in a later lesson in this workshop [(see File Permissions)](https://datacarpentry.org/shell-genomics/03-working-with-files#file-permissions).
+
+#### Some data storage solutions
+
+If you have a local high performance computing center or data storage facility on your campus or with your organization, those are ideal locations. Get in touch with the people who support those facilities to ask for information.
+
+If you do not have access to these resources, you can back up on hard drives. Have two backups, and keep the hard drives in different physical locations.
+
+You can also use resources like [Amazon S3](https://aws.amazon.com/s3/), [Microsoft Azure](https://azure.microsoft.com/en-us/pricing/details/storage/blobs/), [Google Cloud](https://cloud.google.com/storage/) or others for cloud storage. The [open science framework](https://osf.io) is a free option for storing files up to 5 GB. See more in the lesson ["Introduction to Cloud Computing for Genomics"](https://www.datacarpentry.org/cloud-genomics/04-which-cloud).
+
+## Summary
+
+Before analysis of data has begun, there are already many potential areas for errors and omissions. Keeping
+organized and keeping a critical eye can help catch mistakes.
+
+One of Data Carpentry's goals is to help you achieve *competency* in working with bioinformatics. This means that
+you can accomplish routine tasks, under normal conditions, in an acceptable amount of time. While an expert might
+be able to get to a solution on instinct alone - taking your time, using Google or another Internet search engine,
+and asking for help are all valid ways of solving your problems. As you complete the lessons you'll be able to use all of those methods more efficiently.
+
+::::::::::::::::::::::::::::::::::::::::: callout
+
+## Where to go from here?
+
+More reading about core competencies
+
+L. Welch, F. Lewitter, R. Schwartz, C. Brooksbank, P. Radivojac, B. Gaeta and M. Schneider, '[Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3945096/)', PLoS Comput Biol, vol. 10, no. 3, p. e1003496, 2014.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Data being sent to a sequencing center also needs to be structured so you can use it.
+- Raw sequencing data should be kept raw somewhere, so you can always go back to the original files.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/03-ncbi-sra.md b/03-ncbi-sra.md
new file mode 100644
index 00000000..4e207533
--- /dev/null
+++ b/03-ncbi-sra.md
@@ -0,0 +1,161 @@
+---
+title: Examining Data on the NCBI SRA Database
+teaching: 20
+exercises: 10
+---
+
+::::::::::::::::::::::::::::::::::::::: objectives
+
+- Be aware that public genomic data is available.
+- Understand how to access and download this data.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::: questions
+
+- How do I access public sequencing data?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+In our experiments we usually think about generating our own sequencing data. However, almost all analyses use reference data, and you may want to use it to compare your results or annotate your data with publicly available data. You may also want to do a full project or set of analyses using publicly available data. This data is a great, and essential, resource for genomic data analysis.
+
+When you come to publish a paper including your sequencing data, most journals and funders require that you place your data on a public repository. Sharing your data makes it more likely that your work will be re-used and cited. It helps to prepare for this early!
+
+There are many repositories for public data. Some model organisms or fields have specific databases, and there are ones for particular types of data. Two of the most comprehensive public repositories are provided by the [National Center for Biotechnology Information (NCBI)](https://www.ncbi.nlm.nih.gov) and the [European Bioinformatics Institute (EMBL-EBI)](https://www.ebi.ac.uk/). The NCBI's [Sequence Read Archive (SRA)](https://trace.ncbi.nlm.nih.gov/Traces/sra/) is the database we will be using for this lesson, but the EMBL-EBI's Nucleotide Archive (ENA) is also useful. The general processes are similar for any database.
+
+## Accessing the original archived data
+
+The [sequencing dataset (from Tenaillon, *et al.* 2016) adapted for this lesson](https://www.datacarpentry.org/organization-genomics/data) was obtained from the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), which is a large (~27 petabasepairs/2.7 x 10^16 basepairs as of April 2019) repository for next-generation sequence data. Like many NCBI databases, it is complex and mastering its use is greater than the scope of this lesson. Very often there will be a direct link (perhaps in the supplemental information) to where the SRA dataset can be found. We are only using a small part of these data, so a direct link cannot be found. If you have time, go through the following detailed description of finding the data we are using today (otherwise skip to the next section).
+
+### Locate the Run Selector Table for the Lenski Dataset on the SRA
+
+See the figures below for how information about data access is provided within the original paper.
+
+
+
+The **above image** shows the title of the study, as well as the authors.
+
+The excerpt from the paper below includes information on how to locate the sequence data. In this case, the text appears just before the reference section.
+
+> **Author Information** All sequencing data sets are available in the NCBI
+> BioProject database under accession number PRJNA294072. The *breseq*
+> analysis pipeline is available at GitHub ([http://github.com/barricklab/breseq](https://github.com/barricklab/breseq/)).
+> Other analysis scripts are available at the Dryad Digital Repository ([http://dx.doi.org/10.5061/dryad.6226d](https://doi.org/10.5061/dryad.6226d)). R.E.L. will make strains available to qualified
+> recipients, subject to a material transfer agreement. Reprints and permissions
+> information is available at [www.nature.com/reprints](https://www.nature.com/reprints). The authors declare no
+> competing financial interests. Readers are welcome to comment on the online
+> version of the paper. Correspondence and requests for materials should be
+> addressed to R.E.L. (lenski *at* msu.edu)
+
+**At the beginning of this workshop we gave you [experimental information about these data](https://www.datacarpentry.org/organization-genomics/data). This lesson uses a *subset* of SRA files, from a small *subproject* of the BioProject database
+"PRJNA294072". To find these data you can follow the instructions below:**
+
+1. Notice that the paper references "PRJNA294072" as a "BioProject" at NCBI. If you go to the [NCBI website](https://www.ncbi.nlm.nih.gov/) and search for "PRJNA294072" you will be shown a link to the "Long-Term Evolution Experiment with E. coli" BioProject. Here is the link to that database: [https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).
+
+2. Once on the BioProject page, scroll down to the table under **"This project encompasses the
+ following 15 sub-projects:"**.
+
+3. In this table, select **subproject**
+ *"[PRJNA295606](https://www.ncbi.nlm.nih.gov/bioproject/295606) SRA or Trace Escherichia coli B str. REL606 E. coli genome evolution over 50,000 generations (The University of Texas at...)"*.
+
+4. This will take you to a page with the subproject description, and a table **"Project Data"**
+ that has a link to the 224 SRA files for this subproject.
+
+5. Click on the number
+ ["224"](https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=295606) next to "SRA Experiments" and it will take you to the SRA page for this subproject.
+ data:image/s3,"s3://crabby-images/097bf/097bf6a92e12b0a1aa23790342d5ef2912158600" alt=""{alt='03\_send\_results.png'}
+
+6. For a more organized table, select "Send results to Run selector". This
+ takes you to the Run Selector page for BioProject PRJNA295606 (the BioProject number for the experiment SRP064605) that is used in the next section.
+
+### Download the Lenski SRA data from the SRA Run Selector Table
+
+1. Make sure you access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). This is NCBI's cloud-based SRA interface. You will be presented with a page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.
+
+2. Notice on this page there are three sections. "Common Fields" "Select", and "Found 312 Items". Within "Found 312 Items", click on the first Run Number (Column "Run" Row "1").
+ data:image/s3,"s3://crabby-images/bf02b/bf02b390f3624858f3036e8ab72287e5958e5ac7" alt=""{alt='ncbi-new-tables2.png'}
+
+3. This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
+ data:image/s3,"s3://crabby-images/4c7ed/4c7eddaa429dbad147d719c3d314004d6fdbdedc" alt=""{alt='ncbi-run-browser.png'}
+
+4. Use the browser's back button to go back to the 'previous page'. As shown in the figure below, the second section of the page ("Select") has the **Total** row showing you the current number of "Runs", "Bytes", and "Bases" in the dataset to date. On 2022-12-06 there were 312 runs, 109.58 Gb data, and 177.17 Gbases of data.
+ data:image/s3,"s3://crabby-images/f6560/f656080915897e9d738fb2c5e5b4705e991b6250" alt=""{alt='ncbi-new-metadata.png'}
+
+5. Click on the "Metadata" button to download the data for this lesson. The filename is "SraRunTable.txt" and save it on your computer Desktop. This text-based file is actually a "comma-delimited" file, so you should rename the file to "SraRunTable.csv" for your spreadsheet software to open it correctly.
+
+**You should now have a file called `SraRunTable.csv`** on your desktop.
+
+> Now you know that comma-separated (and tab-separated) files can be designated as "text" (`.txt`)
+> files but use either commas (or tabs) as **delimiters**, respectively. Sometimes you
+> might need to use a text-editor (*e.g.* Notepad) to determine if a file suffixed with `.txt` is
+> actually comma-delimited or tab-delimited.
+
+### Review the SraRunTable metadata in a spreadsheet program
+
+Using your choice of spreadsheet program, open the `SraRunTable.csv` file.
+
+::::::::::::::::::::::::::::::::::::::: challenge
+
+## Discussion
+
+Discuss with the person next to you:
+
+1. What strain of *E. coli* was used in this experiment?
+2. What was the sequencing platform used for this experiment?
+3. What samples in the experiment contain
+ [paired end](https://www.illumina.com/technology/next-generation-sequencing/paired-end-sequencing_assay.html)
+ sequencing data?
+4. What other kind of data is available?
+5. Why are you collecting this kind of information about your sequencing runs?
+
+::::::::::::::: solution
+
+## Solution
+
+1. Escherichia coli B str. REL606 shown under the "organism" column. This is a tricky question because the column labeled "strain" actually has sample names
+2. The Illumina sequencing platform was used shown in the column "Platform". But notice they used multiple instrument types listed under "Instrument"
+3. Sort by LibraryLayout and the column "DATASTORE\_filetype" shows that "realign,sra,wgmlst\_sig" were used for paired-end data, while "fastq,sra" were used for all single-end reads. (Also notice the Illumina Genome Analyzer IIx was never used for paired-end sequencing)
+4. There are several columns including: megabases of sequence per sample, Assay type, BioSample Model, and more.
+5. These are examples of "metadata" that you should collect for sequencing projects that are sent to public databases.
+
+:::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+After answering the questions, you should avoid saving any changes you might have made to the metadata file. We do not want to make any changes. If you were to save this file, make sure you save it as a text-based `.csv` file format.
+
+## Downloading a few sequencing files: EMBL-EBI
+
+The SRA does not support direct download of fastq files from its webpage. However, the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home) does. Let's see how we can get a download link to a file we are interested in.
+
+1. Navigate to the [ENA](https://www.ebi.ac.uk/ena/browser/home).
+
+2. Near the top right, in the box next to "View", type in `SRR2589044` and click the "View" button.
+
+3. This will take you to a page with information about the data. Near the bottom you will have the option to download the data by FTP. You could download the `.fastq` read files here, but we do not need to download these files right now and they are large. Alternatively, right click and copy the URL to save it for later.
+
+We do not recommend downloading large numbers of sequencing files this way. For that, the NCBI has made a software package called the `sra-toolkit`. However, for a couple files, it's often easier to go through the ENA.
+
+## Where to learn more
+
+### About the Sequence Read Archive
+
+- You can learn more about the SRA by reading the [SRA Documentation](https://www.ncbi.nlm.nih.gov/Traces/sra/)
+- The best way to transfer a large SRA dataset is by using the [SRA Toolkit](https://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc)
+
+## References
+
+Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, Lenski RE.
+Tempo and mode of genome evolution in a 50,000-generation experiment (2016) Nature. 536(7615): 165–170.
+[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), [Supplemental materials](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/#)
+Data on NCBI SRA: [https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605)
+Data on EMBL-EBI ENA: [https://www.ebi.ac.uk/ena/data/view/PRJNA295606](https://www.ebi.ac.uk/ena/data/view/PRJNA295606)
+
+:::::::::::::::::::::::::::::::::::::::: keypoints
+
+- Public data repositories are a great source of genomic data.
+- You are likely to put your own data on a public repository.
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
new file mode 100644
index 00000000..f19b8049
--- /dev/null
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,13 @@
+---
+title: "Contributor Code of Conduct"
+---
+
+As contributors and maintainers of this project,
+we pledge to follow the [The Carpentries Code of Conduct][coc].
+
+Instances of abusive, harassing, or otherwise unacceptable behavior
+may be reported by following our [reporting guidelines][coc-reporting].
+
+
+[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
+[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
diff --git a/LICENSE.md b/LICENSE.md
new file mode 100644
index 00000000..7632871f
--- /dev/null
+++ b/LICENSE.md
@@ -0,0 +1,79 @@
+---
+title: "Licenses"
+---
+
+## Instructional Material
+
+All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
+instructional material is made available under the [Creative Commons
+Attribution license][cc-by-human]. The following is a human-readable summary of
+(and not a substitute for) the [full legal text of the CC BY 4.0
+license][cc-by-legal].
+
+You are free:
+
+- to **Share**---copy and redistribute the material in any medium or format
+- to **Adapt**---remix, transform, and build upon the material
+
+for any purpose, even commercially.
+
+The licensor cannot revoke these freedoms as long as you follow the license
+terms.
+
+Under the following terms:
+
+- **Attribution**---You must give appropriate credit (mentioning that your work
+ is derived from work that is Copyright (c) The Carpentries and, where
+ practical, linking to ), provide a [link to the
+ license][cc-by-human], and indicate if changes were made. You may do so in
+ any reasonable manner, but not in any way that suggests the licensor endorses
+ you or your use.
+
+- **No additional restrictions**---You may not apply legal terms or
+ technological measures that legally restrict others from doing anything the
+ license permits. With the understanding that:
+
+Notices:
+
+* You do not have to comply with the license for elements of the material in
+ the public domain or where your use is permitted by an applicable exception
+ or limitation.
+* No warranties are given. The license may not give you all of the permissions
+ necessary for your intended use. For example, other rights such as publicity,
+ privacy, or moral rights may limit how you use the material.
+
+## Software
+
+Except where otherwise noted, the example programs and other software provided
+by The Carpentries are made available under the [OSI][osi]-approved [MIT
+license][mit-license].
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+## Trademark
+
+"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
+Carpentry" and their respective logos are registered trademarks of [Community
+Initiatives][ci].
+
+[cc-by-human]: https://creativecommons.org/licenses/by/4.0/
+[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
+[mit-license]: https://opensource.org/licenses/mit-license.html
+[ci]: https://communityin.org/
+[osi]: https://opensource.org
diff --git a/config.yaml b/config.yaml
new file mode 100644
index 00000000..126f9958
--- /dev/null
+++ b/config.yaml
@@ -0,0 +1,83 @@
+#------------------------------------------------------------
+# Values for this lesson.
+#------------------------------------------------------------
+
+# Which carpentry is this (swc, dc, lc, or cp)?
+# swc: Software Carpentry
+# dc: Data Carpentry
+# lc: Library Carpentry
+# cp: Carpentries (to use for instructor training for instance)
+# incubator: The Carpentries Incubator
+carpentry: 'dc'
+
+# Overall title for pages.
+title: 'Project Organization and Management for Genomics'
+
+# Date the lesson was created (YYYY-MM-DD, this is empty by default)
+created: '2015-03-24'
+
+# Comma-separated list of keywords for the lesson
+keywords: 'software, data, lesson, The Carpentries'
+
+# Life cycle stage of the lesson
+# possible values: pre-alpha, alpha, beta, stable
+life_cycle: 'stable'
+
+# License of the lesson materials (recommended CC-BY 4.0)
+license: 'CC-BY 4.0'
+
+# Link to the source repository for this lesson
+source: 'https://github.com/datacarpentry/organization-genomics'
+
+# Default branch of your lesson
+branch: 'main'
+
+# Who to contact if there are any issues
+contact: 'team@carpentries.org'
+
+# Navigation ------------------------------------------------
+#
+# Use the following menu items to specify the order of
+# individual pages in each dropdown section. Leave blank to
+# include all pages in the folder.
+#
+# Example -------------
+#
+# episodes:
+# - introduction.md
+# - first-steps.md
+#
+# learners:
+# - setup.md
+#
+# instructors:
+# - instructor-notes.md
+#
+# profiles:
+# - one-learner.md
+# - another-learner.md
+
+# Order of episodes in your lesson
+episodes:
+- 01-tidiness.md
+- 02-project-planning.md
+- 03-ncbi-sra.md
+
+# Information for Learners
+learners:
+
+# Information for Instructors
+instructors:
+
+# Learner Profiles
+profiles:
+
+# Customisation ---------------------------------------------
+#
+# This space below is where custom yaml items (e.g. pinning
+# sandpaper and varnish versions) should live
+
+
+url: 'https://datacarpentry.github.io/organization-genomics'
+analytics: carpentries
+lang: en
diff --git a/data.md b/data.md
new file mode 100644
index 00000000..1c6f6826
--- /dev/null
+++ b/data.md
@@ -0,0 +1,40 @@
+---
+title: Data
+---
+
+# Features of the dataset
+
+This dataset was selected for our exercise on NGS Data Carpentry for several reasons, including:
+
+- Simple, but iconic NGS-problem: Examine a population where we want to characterize changes in sequence *a priori*
+- Dataset publicly available - in this case through the NCBI Sequence Read Archive ([http://www.ncbi.nlm.nih.gov/sra](https://www.ncbi.nlm.nih.gov/sra))
+- Small file sizes - while several of related files may still be hundreds of MBs, overall we will be able to get through more quickly than if we worked with a larger eukaryotic genome
+
+# Introduction to the dataset
+
+Microbes are ideal organisms for exploring 'Long-term Evolution Experiments' (LTEEs) - thousands of generations can be generated and stored in a way that would be virtually impossible for more complex eukaryotic systems. In [Tenaillon et al 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), 12 populations of *Escherichia coli* were propagated for more than 50,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which *E. coli* cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared in a population of *E.coli* (designated Ara-3) at around 31,000 generations. It should be noted that spontaneous Cit+ mutants are extraordinarily rare - inability to metabolize citrate is one of the defining characters of the *E. coli* species. Eventually, Cit+ mutants became the dominant population as the experimental growth medium contained a high concentration of citrate relative to glucose. Around the same time that this mutation emerged, another phenotype become prominent in the Ara-3 population. Many *E. coli* began to develop excessive numbers of mutations, meaning they became hypermutable.
+
+Strains from generation 0 to generation 50,000 were sequenced, including ones that were both Cit+ and Cit- and hypermutable in later generations.
+
+For the purposes of this workshop we're going to be working with 3 of the sequence reads from this experiment.
+
+| SRA Run Number | Clone | Generation | Cit | Hypermutable | Read Length | Sequencing Depth |
+| -------------- | -------- | ---------- | ------- | ------------ | ----------- | ---------------- |
+| SRR2589044 | REL2181A | 5,000 | Unknown | None | 150 | 60\.2 |
+| SRR2584863 | REL7179B | 15,000 | Unknown | None | 150 | 88 |
+| SRR2584866 | REL11365 | 50,000 | Cit+ | plus | 150 | 138\.3 |
+
+We want to be able to look at differences in mutation rates between hypermutable and non-hypermutable strains. We also want to analyze the sequences to figure out what changes occurred in genomes to make the strains Cit+. Ultimately, we will answer the questions:
+
+- How many base pair changes are there between the Cit+ and Cit- strains?
+- What are the base pair changes between strains?
+
+## References
+
+Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, Lenski RE.
+Tempo and mode of genome evolution in a 50,000-generation experiment (2016) Nature. 536(7615): 165–170.
+[Paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/), [Supplemental materials](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/#)
+Data on NCBI SRA: [https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP064605)
+Data on EMBL-EBI ENA: [https://www.ebi.ac.uk/ena/data/view/PRJNA295606](https://www.ebi.ac.uk/ena/data/view/PRJNA295606)
+
+
diff --git a/discuss.md b/discuss.md
new file mode 100644
index 00000000..bd4eb222
--- /dev/null
+++ b/discuss.md
@@ -0,0 +1,7 @@
+---
+title: Discussion
+---
+
+No current discussion
+
+
diff --git a/fig/01_tidiness_datasheet_example_clean.png b/fig/01_tidiness_datasheet_example_clean.png
new file mode 100644
index 00000000..9974dac9
Binary files /dev/null and b/fig/01_tidiness_datasheet_example_clean.png differ
diff --git a/fig/01_tidiness_datasheet_example_messy.png b/fig/01_tidiness_datasheet_example_messy.png
new file mode 100644
index 00000000..5b04ec77
Binary files /dev/null and b/fig/01_tidiness_datasheet_example_messy.png differ
diff --git a/fig/03_acc_info.png b/fig/03_acc_info.png
new file mode 100644
index 00000000..2177d929
Binary files /dev/null and b/fig/03_acc_info.png differ
diff --git a/fig/03_ncbi_new_metadata.png b/fig/03_ncbi_new_metadata.png
new file mode 100644
index 00000000..230e01b6
Binary files /dev/null and b/fig/03_ncbi_new_metadata.png differ
diff --git a/fig/03_ncbi_new_run_browser.png b/fig/03_ncbi_new_run_browser.png
new file mode 100644
index 00000000..e3007530
Binary files /dev/null and b/fig/03_ncbi_new_run_browser.png differ
diff --git a/fig/03_ncbi_new_tables2.png b/fig/03_ncbi_new_tables2.png
new file mode 100644
index 00000000..2e91206f
Binary files /dev/null and b/fig/03_ncbi_new_tables2.png differ
diff --git a/fig/03_ncbi_new_top.png b/fig/03_ncbi_new_top.png
new file mode 100644
index 00000000..97d87f9d
Binary files /dev/null and b/fig/03_ncbi_new_top.png differ
diff --git a/fig/03_ncbi_new_top2.png b/fig/03_ncbi_new_top2.png
new file mode 100644
index 00000000..663774a8
Binary files /dev/null and b/fig/03_ncbi_new_top2.png differ
diff --git a/fig/03_ncbi_old_run_selector.png b/fig/03_ncbi_old_run_selector.png
new file mode 100644
index 00000000..9ee99404
Binary files /dev/null and b/fig/03_ncbi_old_run_selector.png differ
diff --git a/fig/03_ncbi_old_runtable_button.png b/fig/03_ncbi_old_runtable_button.png
new file mode 100644
index 00000000..7aad4abc
Binary files /dev/null and b/fig/03_ncbi_old_runtable_button.png differ
diff --git a/fig/03_ncbi_run_browser.png b/fig/03_ncbi_run_browser.png
new file mode 100644
index 00000000..970e3577
Binary files /dev/null and b/fig/03_ncbi_run_browser.png differ
diff --git a/fig/03_ncbi_send_results.png b/fig/03_ncbi_send_results.png
new file mode 100644
index 00000000..963de973
Binary files /dev/null and b/fig/03_ncbi_send_results.png differ
diff --git a/fig/03_paper_header.png b/fig/03_paper_header.png
new file mode 100644
index 00000000..4d3ed4ae
Binary files /dev/null and b/fig/03_paper_header.png differ
diff --git a/fig/2_datasheet_example.jpg b/fig/2_datasheet_example.jpg
new file mode 100644
index 00000000..00e8f53b
Binary files /dev/null and b/fig/2_datasheet_example.jpg differ
diff --git a/files/Ecoli_metadata_composite_messy.pdf b/files/Ecoli_metadata_composite_messy.pdf
new file mode 100644
index 00000000..e36f0371
Binary files /dev/null and b/files/Ecoli_metadata_composite_messy.pdf differ
diff --git a/files/Ecoli_metadata_composite_messy.xlsx b/files/Ecoli_metadata_composite_messy.xlsx
new file mode 100644
index 00000000..df7a40df
Binary files /dev/null and b/files/Ecoli_metadata_composite_messy.xlsx differ
diff --git a/files/SampleSheet_Example_clean.csv b/files/SampleSheet_Example_clean.csv
new file mode 100644
index 00000000..9f69806e
--- /dev/null
+++ b/files/SampleSheet_Example_clean.csv
@@ -0,0 +1,13 @@
+[Data],,,,,,,,,,,,,,,
+Study_ID,Study_Description,BioSample_ID,BioSample_Description,Sample_ID,Sample_Name,Sample_Owner,Index_ID,Index,Index2_ID,Index2,Organism,Host,Gender,Tissue_Source,FACS_Markers
+3T3_L1,Gene Expression Profiling in differentiating cells,1_2,Day 1 Replicate 2,Day_1_Replicate_2,Day_1_Replicate_2,Owner 1,TruSeq_Adapter_Index_2,ATACTACAGAAG,TruSeq_Adapter_Index_21,CTCAAAGTAGGG,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,1_3,Day 1 Replicate 3,Day_1_Replicate_3,Day_1_Replicate_3,Owner 1,TruSeq_Adapter_Index_3,AACGAATCCACT,TruSeq_Adapter_Index_22,TTATGATAGTCC,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,2_1,Day 2 Replicate 1,Day_2_Replicate_1,Day_2_Replicate_1,Owner 1,TruSeq_Adapter_Index_4,TGACAGGTAATC,TruSeq_Adapter_Index_23,TTTGGTTATTGC,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,2_2,Day 2 Replicate 2,Day_2_Replicate_2,Day_2_Replicate_2,Owner 1,TruSeq_Adapter_Index_5,CTACATAGACCT,TruSeq_Adapter_Index_24,TTTCCCTCCGTC,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,2_3,Day 2 Replicate 3,Day_2_Replicate_3,Day_2_Replicate_3,Owner 1,TruSeq_Adapter_Index_6,GTGCGATTTATC,TruSeq_Adapter_Index_25,GCGGTCGCTTAG,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,4_1,Day 4 Replicate 1,Day_4_Replicate_1,Day_4_Replicate_1,Owner 1,TruSeq_Adapter_Index_7,CAGGGCGGGTGT,TruSeq_Adapter_Index_26,TCCGATGGCAGC,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,4_2,Day 4 Replicate 2,Day_4_Replicate_2,Day_4_Replicate_2,Owner 1,TruSeq_Adapter_Index_8,TGATGCCTCGGG,TruSeq_Adapter_Index_27,TCGGTGACTACT,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,4_3,Day 4 Replicate 3,Day_4_Replicate_3,Day_4_Replicate_3,Owner 1,TruSeq_Adapter_Index_9,TTTATTGCTTGT,TruSeq_Adapter_Index_28,TTATGTGAGAAA,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,5_1,Day 5 Replicate 1,Day_5_Replicate_1,Day_5_Replicate_1,Owner 1,TruSeq_Adapter_Index_10,TTCTTTATGAAC,TruSeq_Adapter_Index_29,TTGGTGGGCGTG,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,5_2,Day 5 Replicate 2,Day_5_Replicate_2,Day_5_Replicate_2,Owner 1,TruSeq_Adapter_Index_11,TACGAAGAGGCG,TruSeq_Adapter_Index_30,AAGTTCGCAGAT,Mouse,,,,
+3T3_L1,Gene Expression Profiling in differentiating cells,5_3,Day 5 Replicate 3,Day_5_Replicate_3,Day_5_Replicate_3,Owner 1,TruSeq_Adapter_Index_12,GAAATCGGCGAC,TruSeq_Adapter_Index_31,AGTTCGTGGTGG,Mouse,,,,
\ No newline at end of file
diff --git a/files/SampleSheet_Example_messy.csv b/files/SampleSheet_Example_messy.csv
new file mode 100644
index 00000000..0e072f95
--- /dev/null
+++ b/files/SampleSheet_Example_messy.csv
@@ -0,0 +1,13 @@
+[Data],,,,,,,,,,,,,,,
+Study_ID,Study_Description,BioSample_ID,BioSample_Description,Sample_ID,Sample_Name,Sample_Owner,Index_ID,Index,Index2_ID,Index2,Organism,Host,Gender,Tissue_Source,FACS_Markers
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",1&2,Day 1 Replicate 2,"Day 1, Replicate 2","Day 1, Replicate 2",Owner 1,TruSeq_Adapter_Index_2,ATACTACAGAAG,TruSeq_Adapter_Index_21,CTCAAAGTAGGG,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",1&3,Day 1 Replicate 3,"Day 1, Replicate 3","Day 1, Replicate 3",Owner 1,TruSeq_Adapter_Index_3,AACGAATCCACT,TruSeq_Adapter_Index_22,TTATGATAGTCC,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",2&1,Day 2 Replicate 1,"Day 2, Replicate 1","Day 2, Replicate 1",Owner 1,TruSeq_Adapter_Index_4,TGACAGGTAATC,TruSeq_Adapter_Index_23,TTTGGTTATTGC,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",2&2,Day 2 Replicate 2,"Day 2, Replicate 2","Day 2, Replicate 2",Owner 1,TruSeq_Adapter_Index_5,CTACATAGACCT,TruSeq_Adapter_Index_24,TTTCCCTCCGTC,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",2&3,Day 2 Replicate 3,"Day 2, Replicate 3","Day 2, Replicate 3",Owner 1,TruSeq_Adapter_Index_6,GTGCGATTTATC,TruSeq_Adapter_Index_25,GCGGTCGCTTAG,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",4&1,Day 4 Replicate 1,"Day 4, Replicate 1","Day 4, Replicate 1",Owner 1,TruSeq_Adapter_Index_7,CAGGGCGGGTGT,TruSeq_Adapter_Index_26,TCCGATGGCAGC,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",4&2,Day 4 Replicate 2,"Day 4, Replicate 2","Day 4, Replicate 2",Owner 1,TruSeq_Adapter_Index_8,TGATGCCTCGGG,TruSeq_Adapter_Index_27,TCGGTGACTACT,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",4&3,Day 4 Replicate 3,"Day 4, Replicate 3","Day 4, Replicate 3",Owner 1,TruSeq_Adapter_Index_9,TTTATTGCTTGT,TruSeq_Adapter_Index_28,TTATGTGAGAAA,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",5&1,Water Control,Water Control & Negative Control,Water Control & Negative Control,Owner 1,TruSeq_Adapter_Index_10,TTCTTTATGAAC,TruSeq_Adapter_Index_29,TTGGTGGGCGTG,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",5&2,Water Control,Water Control & Negative Control,Water Control & Negative Control,Owner 1,TruSeq_Adapter_Index_11,TACGAAGAGGCG,TruSeq_Adapter_Index_30,AAGTTCGCAGAT,Mouse,,,,
+3T3_L1,"Gene expression profiling in differentiating cells, on different days.",5&3,Water Control,Postive Control,Positive Control,Owner 1,TruSeq_Adapter_Index_12,GAAATCGGCGAC,TruSeq_Adapter_Index_31,AGTTCGTGGTGG,Mouse,,,,
\ No newline at end of file
diff --git a/files/sample_submission.txt b/files/sample_submission.txt
new file mode 100644
index 00000000..ecd4398d
--- /dev/null
+++ b/files/sample_submission.txt
@@ -0,0 +1,97 @@
+well_position tube_barcode plate_barcode client_sample_id replicate Volume (µL) concentration (ng/µL) RIN prep_date ship_date
+A1 151017990 LP-10624 wild type 1h1 a 64.2 211.07 8.1 6-Jul-15 20-Jul
+B1 151101577 LP-10624 wild type 1h1 b 63.7 220.21 9.4 6-Jul-15 20-Jul
+C1 151142725 LP-10624 wild type 1h1 c 60.2 207.57 8.9 6-Jul-15 20-Jul
+D1 151232891 LP-10624 wild type 1h-2 A 55.8 180.62 9 6-Jul-15 20-Jul
+E1 151236606 LP-10624 wild type 1h-2 B 60.8 190.86 8.1 6-Jul-15 20-Jul
+F1 151323716 LP-10624 wild type 1h-2 C 57.5 192.97 8.6 6-Jul-15 20-Jul
+G1 151346588 LP-10624 wild type 1h-3 A 64.9 218.88 8.6 6-Jul-15 20-Jul
+H1 151423653 LP-10624 wild type 1h-3 B 62.5 173.44 8.8 6-Jul-15 20-Jul
+A2 151462684 LP-10624 wild type 1h-3 C 53.9 214.11 9.5 6-Jul-15 20-Jul
+B2 151508377 LP-10624 wild type 1h-4 A 62.4 209.63 8.1 6-Jul-15 20-Jul
+C2 151539039 LP-10624 wild type 1h-4 B 66 222.44 8.8 6-Jul-15 20-Jul
+D2 151545962 LP-10624 wild type 1h-4 C 61.5 206.27 8 6-Jul-15 20-Jul
+E2 151588038 LP-10624 wild type 1h-5 A 58.2 157.67 8.9 6-Jul-15 20-Jul
+F2 151666965 LP-10624 wild type 1h-5 B 68 206.45 8.3 6-Jul-15 20-Jul
+G2 151719126 LP-10624 wild type 1h-5 C 56.6 220.84 8.4 6-Jul-15 20-Jul
+H2 151767622 LP-10624 wild type 1h-6 A 54 179.47 8.3 6-Jul-15 20-Jul
+A3 151781088 LP-10624 wild type 1h-6 B 59.6 197.08 8.5 6-Jul-15 20-Jul
+B3 151796026 LP-10624 wild type 1h-6 C 56.8 219.34 8 6-Jul-15 20-Jul
+C3 151882778 LP-10624 wild type 1h-7 A 57.2 182.17 7.9 7-Jun-15 20-Jul
+D3 151944346 LP-10624 wildtype 1h-7 B 630.1 186.98 9.2 7-Jun-15 20-Jul
+E3 151970881 LP-10624 wildtype 1h-7 C 63.4 194.28 8.5 7-Jun-15 20-Jul
+F3 151988549 LP-10624 wild type 1h-8 A 66.1 225.71 9 6-Jul-15 20-Jul
+G3 152065746 LP-10624 wild type 1h-8 B 57.9 166.64 8.8 6-Jul-15 20-Jul
+H3 152123617 LP-10624 wild type 1h-8 C 66.4 194.22 8.4 6-Jul-15 20-Jul
+A4 152123671 LP-10624 wild type 1h-9 A 57.6 237.12 8.7 6-Jul-15 20-Jul
+B4 152198331 LP-10624 wild type 1h-9 B 58.9 224.77 9 6-Jul-15 20-Jul
+C4 152285738 LP-10624 wild type 1h-9 C 51.9 199.85 7.6 6-Jul-15 20-Jul
+D4 152346677 LP-10624 wild type 1h-10 A 63.3 179.52 9.4 6-Jul-15 20-Jul
+E4 152417492 LP-10624 wild type 1h-10 B 61.2 192.97 8.9 6-Jul-15 20-Jul
+F4 152504414 LP-10624 wild type 1h-10 C 59.3 194.48 8.3 6-Jul-15 20-Jul
+G4 152534255 LP-10624 wild type 1h-11 A 53 164.08 9.4 6-Jul-15 20-Jul
+H4 152601388 LP-10624 wild type 1h-11 B 59.4 193.95 8.9 6-Jul-15 20-Jul
+A5 152601390 LP-10624 wild type 1h-11 C 53.6 173.56 8.1 6-Jul-15 20-Jul
+B5 152605954 LP-10624 wild type 1h-12 A 57.4 197.52 8.6 6-Jul-15 20-Jul
+C5 152628849 LP-10624 wild type 1h-12 B 68.1 189.44 9.1 6-Jul-15 20-Jul
+D5 152712999 LP-10624 wild type 1h-12 C 54.4 170.9 9.4 6-Jul-15 20-Jul
+E5 152768132 LP-10624 wild type 1h-13 A 61.7 209.67 8.2 6-Jul-15 20-Jul
+F5 152811001 LP-10624 wild type 1h-13 B 62.9 217.76 8.1 6-Jul-15 20-Jul
+G5 152907755 LP-10624 wild type 1h-13 C 60.9 171 8.8 6-Jul-15 20-Jul
+H5 153005304 LP-10624 wild type 1h-14 A 64.2 213.42 7.3 6-Jul-15 20-Jul
+A6 153016225 LP-10624 wild type 1h-14 B 54.7 190.99 7.7 6-Jul-15 20-Jul
+B6 153068500 LP-10624 wild type 1h-14 C 56.6 233.84 8.2 6-Jul-15 20-Jul
+C6 153072132 LP-10624 wild type 1h-15 A 59.8 200.45 8.9 6-Jul-15 20-Jul
+D6 153101681 LP-10624 wild type 1h-15 B 56.1 175.29 8.3 6-Jul-15 20-Jul
+E6 153185446 LP-10624 wild type 1h-15 C 63.1 185.36 8.7 6-Jul-15 20-Jul
+F6 153260940 LP-10624 wild type 1h-16 A 56.7 212.8 8.1 6-Jul-15 20-Jul
+G6 153355386 LP-10624 wild type 1h-16 B 54.6 200.15 8.4 6-Jul-15 20-Jul
+H6 153378044 LP-10624 wild type 1h-16 C 59.8 218.49 8.3 6-Jul-15 20-Jul
+A7 153395738 LP-10624 k255N_1h-1 A 61.7 176.24 9 7/8/15 20-Jul
+B7 153488303 LP-10624 k255N_1h-1 B 57.6 201.22 8.7 7/8/15 20-Jul
+C7 153494132 LP-10624 k255N_1h-1 C 62.5 196.93 8.1 7/8/15 20-Jul
+D7 153539022 LP-10624 k255M_1h-2 A 64.8 197.46 8.4 7/8/15 20-Jul
+E7 153548916 LP-10624 k255M_1h-2 B 57.5 188.13 8.2 7/8/15 20-Jul
+F7 153599270 LP-10624 k255M_1h-2 C 59.2 230.02 8.6 7/8/15 20-Jul
+G7 153697489 LP-10624 k255N_1h-3 A 57.7 189.79 8.9 7/8/15 20-Jul
+H7 153762036 LP-10624 k255N_1h-3 B 59.8 202.37 8.3 7/8/15 20-Jul
+A8 153807929 LP-10624 k255N_1h-3 C 58.5 208.35 8.7 7/8/15 20-Jul
+B8 153830049 LP-10624 k255N_1h-4 A 63.9 186.9 8.5 7/8/15 20-Jul
+C8 153862046 LP-10624 k255N_1h-4 B 66.3 158.63 8.4 7/8/15 20-Jul
+D8 153907755 LP-10624 k255N_1h-4 C 58.8 161.09 8.6 7/8/15 20-Jul
+E8 153928847 LP-10624 k255N_1h-5 A 58.3 204.73 8.9 7/8/15 20-Jul
+F8 153946500 LP-10624 k255N_1h-5 B 59 184.21 7.8 7/8/15 20-Jul
+G8 153998950 LP-10624 k255N_1h-5 C 63.5 225.36 8.7 7/8/15 20-Jul
+H8 154084806 LP-10624 k255N_1h-6 A 62.2 98.46 6.8 7/8/15 20-Jul
+A9 154140578 LP-10624 k255N_1h-6 B 54.2 18.98 6.8 7/8/15 20-Jul
+B9 154161941 LP-10624 k255N_1h-6 C 54.8 15.82 5.6 7/8/15 20-Jul
+C9 154197341 LP-10624 k255N_1h-7 A 57.9 176.3 8.2 7/8/15 20-Jul
+D9 154243529 LP-10624 k255N_1h-77 B 66.5 193.3 8 7/8/15 20-Jul
+E9 154300938 LP-10624 k255N_1h-7 C 63 217.95 8.3 7/8/15 20-Jul
+F9 154314067 LP-10624 k255N_1h-8 A 61.2 217.17 8.9 7/8/15 20-Jul
+G9 154407877 LP-10624 k255N_1h-8 B 59.3 181 8.7 7/8/15 20-Jul
+H9 154423297 LP-10624 k255N_1h-8 C 60.7 208.92 9.5 7/8/15 20-Jul
+A10 154511591 LP-10624 k255N_1h-9 A 60.7 170.67 8.1 7/8/15 20-Jul
+B10 154516528 LP-10624 k255N_1h-9 B 61.4 206.11 9 7/8/15 20-Jul
+C10 154529002 LP-10624 k255N_1h-9 C 59.5 192.2 8.1 7/8/15 20-Jul
+D10 154544444 LP-10624 k255N_1h-10 A 62.3 207.34 7.9 7/8/15 20-Jul
+E10 154570812 LP-10624 k255N_1h-10 B 63.4 196.96 8.1 7/8/15 20-Jul
+F10 154572077 LP-10624 k255N_1h-10 C 66.4 207.62 8.6 7/8/15 20-Jul
+G10 154670025 LP-10624 k255N_1h-11 A 60.7 212.68 8.6 7/8/15 20-Jul
+H10 154688043 LP-10624 k255N_1h-11 B 55.5 202.99 8.9 7/8/15 20-Jul
+A11 154708451 LP-10624 k255N_1h-11 C 58.8 212.33 8.5 7/8/15 20-Jul
+B11 154734108 LP-10624 k255N_1h-12 A 56.6 189.1 8.6 7/8/15 20-Jul
+C11 154781404 LP-10624 k255N_1h-12 B 61 233.74 8.2 7/8/15 20-Jul
+D11 154853271 LP-10624 k255N_1h-12 C 61.1 221.11 9.6 7/8/15 20-Jul
+E11 154936145 LP-10624 k255N_1h-13 A 57.4 201.63 8.8 7/8/15 20-Jul
+F11 154988540 LP-10624 k255N_1h-13 B 62.1 202.21 8.5 7/8/15 20-Jul
+G11 155057129 LP-10624 k255N_1h-13 C 58.4 199.56 8.3 7/8/15 20-Jul
+H11 155087342 LP-10624 k255N_1h-14 A 52.6 210.89 9 7/8/15 20-Jul
+A12 155185967 LP-10624 k255N_1h-14 B 58.9 172.12 8.6 7/8/15 20-Jul
+B12 155192028 LP-10624 k255N_1h-14 C 57.2 207.28 8.5 7/8/15 20-Jul
+C12 155285966 LP-10624 k255N_1h-15 A 57.2 169.6 8.7 7/8/15 20-Jul
+D12 155350639 LP-10624 k255N_1h-15 B 52.5 185.97 8.6 7/8/15 20-Jul
+E12 155426989 LP-10624 k255N_1h-15 C 59.6 179.59 8.3 7/8/15 20-Jul
+F12 155436477 LP-10624 k255N_1h-16 A 63.5 204.78 7.6 7/8/15 20-Jul
+G12 155526790 LP-10624 k255N_1h-16 B 61.5 191.81 8.4 7/8/15 20-Jul
+H12 155537812 LP-10624 k255N_1h-16 C 0.5 190.04 8.9 7/8/15 20-Jul
\ No newline at end of file
diff --git a/files/sequencing_results_metadata.txt b/files/sequencing_results_metadata.txt
new file mode 100644
index 00000000..0465e8ce
--- /dev/null
+++ b/files/sequencing_results_metadata.txt
@@ -0,0 +1,193 @@
+sample_id seq_platform sequencing layout barcode number_of_reads rRNA_rate(%) filename file_size(gb)
+151017990 ILLUMINA RNA-Seq PE GTTAAG 5469882 3.37 151017990_GTTAAG_ACA4RRCXX_R1.fastq.gz 5.77
+151017990 ILLUMINA RNA-Seq PE GTTAAG 5469882 3.37 151017990_GTTAAG_ACA4RRCXX_R2.fastq.gz 5.77
+151101577 ILLUMINA RNA-Seq PE AAATTG 5789648 2.41 151101577_AAATTG_ACA4RRCXX_R1.fastq.gz 6.09
+151101577 ILLUMINA RNA-Seq PE AAATTG 5789648 2.41 151101577_AAATTG_ACA4RRCXX_R2.fastq.gz 6.09
+151142725 ILLUMINA RNA-Seq PE TGCTAG 5043882 3.08 151142725_TGCTAG_ACA4RRCXX_R1.fastq.gz 5.34
+151142725 ILLUMINA RNA-Seq PE TGCTAG 5043882 3.08 151142725_TGCTAG_ACA4RRCXX_R2.fastq.gz 5.34
+151232891 ILLUMINA RNA-Seq PE CCCCCT 5977039 2.74 151232891_CCCCCT_ACA4RRCXX_R1.fastq.gz 6.28
+151232891 ILLUMINA RNA-Seq PE CCCCCT 5977039 2.74 151232891_CCCCCT_ACA4RRCXX_R2.fastq.gz 6.28
+151236606 ILLUMINA RNA-Seq PE ATGGCC 5771384 1.81 151236606_ATGGCC_ACA4RRCXX_R1.fastq.gz 6.07
+151236606 ILLUMINA RNA-Seq PE ATGGCC 5771384 1.81 151236606_ATGGCC_ACA4RRCXX_R2.fastq.gz 6.07
+151323716 ILLUMINA RNA-Seq PE TCTTTA 5112674 2.01 151323716_TCTTTA_ACA4RRCXX_R1.fastq.gz 5.41
+151323716 ILLUMINA RNA-Seq PE TCTTTA 5112674 2.01 151323716_TCTTTA_ACA4RRCXX_R2.fastq.gz 5.41
+151346588 ILLUMINA RNA-Seq PE CTGAAG 5224770 2.69 151346588_CTGAAG_ACA4RRCXX_R1.fastq.gz 5.52
+151346588 ILLUMINA RNA-Seq PE CTGAAG 5224770 2.69 151346588_CTGAAG_ACA4RRCXX_R2.fastq.gz 5.52
+151423653 ILLUMINA RNA-Seq PE CGAGGG 5382850 3.72 151423653_CGAGGG_ACA4RRCXX_R1.fastq.gz 5.68
+151423653 ILLUMINA RNA-Seq PE CGAGGG 5382850 3.72 151423653_CGAGGG_ACA4RRCXX_R2.fastq.gz 5.68
+151462684 ILLUMINA RNA-Seq PE GAGGGT 5202728 4.7 151462684_GAGGGT_ACA4RRCXX_R1.fastq.gz 5.50
+151462684 ILLUMINA RNA-Seq PE GAGGGT 5202728 4.7 151462684_GAGGGT_ACA4RRCXX_R2.fastq.gz 5.50
+151508377 ILLUMINA RNA-Seq PE CAGCGC 5484301 3.73 151508377_CAGCGC_ACA4RRCXX_R1.fastq.gz 5.78
+151508377 ILLUMINA RNA-Seq PE CAGCGC 5484301 3.73 151508377_CAGCGC_ACA4RRCXX_R2.fastq.gz 5.78
+151539039 ILLUMINA RNA-Seq PE TTTTAA 5370524 3.64 151539039_TTTTAA_ACA4RRCXX_R1.fastq.gz 5.67
+151539039 ILLUMINA RNA-Seq PE TTTTAA 5370524 3.64 151539039_TTTTAA_ACA4RRCXX_R2.fastq.gz 5.67
+151545962 ILLUMINA RNA-Seq PE TGCTCC 5792457 2.33 151545962_TGCTCC_ACA4RRCXX_R1.fastq.gz 6.09
+151545962 ILLUMINA RNA-Seq PE TGCTCC 5792457 2.33 151545962_TGCTCC_ACA4RRCXX_R2.fastq.gz 6.09
+151588038 ILLUMINA RNA-Seq PE AACCGG 5072470 3.25 151588038_AACCGG_ACA4RRCXX_R1.fastq.gz 5.37
+151588038 ILLUMINA RNA-Seq PE AACCGG 5072470 3.25 151588038_AACCGG_ACA4RRCXX_R2.fastq.gz 5.37
+151666965 ILLUMINA RNA-Seq PE ATACCT 5430767 1.65 151666965_ATACCT_ACA4RRCXX_R1.fastq.gz 5.73
+151666965 ILLUMINA RNA-Seq PE ATACCT 5430767 1.65 151666965_ATACCT_ACA4RRCXX_R2.fastq.gz 5.73
+151719126 ILLUMINA RNA-Seq PE GTGGGA 5549234 4.15 151719126_GTGGGA_ACA4RRCXX_R1.fastq.gz 5.85
+151719126 ILLUMINA RNA-Seq PE GTGGGA 5549234 4.15 151719126_GTGGGA_ACA4RRCXX_R2.fastq.gz 5.85
+151767622 ILLUMINA RNA-Seq PE TTGAGT 5894815 2.06 151767622_TTGAGT_ACA4RRCXX_R1.fastq.gz 6.19
+151767622 ILLUMINA RNA-Seq PE TTGAGT 5894815 2.06 151767622_TTGAGT_ACA4RRCXX_R2.fastq.gz 6.19
+151781088 ILLUMINA RNA-Seq PE GGAATA 5554950 3.5 151781088_GGAATA_ACA4RRCXX_R1.fastq.gz 5.85
+151781088 ILLUMINA RNA-Seq PE GGAATA 5554950 3.5 151781088_GGAATA_ACA4RRCXX_R2.fastq.gz 5.85
+151796026 ILLUMINA RNA-Seq PE TCCAGG 5819498 2.32 151796026_TCCAGG_ACA4RRCXX_R1.fastq.gz 6.12
+151796026 ILLUMINA RNA-Seq PE TCCAGG 5819498 2.32 151796026_TCCAGG_ACA4RRCXX_R2.fastq.gz 6.12
+151882778 ILLUMINA RNA-Seq PE CTCCTC 5550894 1.2 151882778_CTCCTC_ACA4RRCXX_R1.fastq.gz 5.85
+151882778 ILLUMINA RNA-Seq PE CTCCTC 5550894 1.2 151882778_CTCCTC_ACA4RRCXX_R2.fastq.gz 5.85
+151944346 ILLUMINA RNA-Seq PE TGAGGC 5194294 3.57 151944346_TGAGGC_ACA4RRCXX_R1.fastq.gz 5.49
+151944346 ILLUMINA RNA-Seq PE TGAGGC 5194294 3.57 151944346_TGAGGC_ACA4RRCXX_R2.fastq.gz 5.49
+151970881 ILLUMINA RNA-Seq PE GGGTTT 5287298 2.38 151970881_GGGTTT_ACA4RRCXX_R1.fastq.gz 5.59
+151970881 ILLUMINA RNA-Seq PE GGGTTT 5287298 2.38 151970881_GGGTTT_ACA4RRCXX_R2.fastq.gz 5.59
+151988549 ILLUMINA RNA-Seq PE TTGTAC 5272721 2.43 151988549_TTGTAC_ACA4RRCXX_R1.fastq.gz 5.57
+151988549 ILLUMINA RNA-Seq PE TTGTAC 5272721 2.43 151988549_TTGTAC_ACA4RRCXX_R2.fastq.gz 5.57
+152065746 ILLUMINA RNA-Seq PE ACGTCT 5103515 2.75 152065746_ACGTCT_ACA4RRCXX_R1.fastq.gz 5.40
+152065746 ILLUMINA RNA-Seq PE ACGTCT 5103515 2.75 152065746_ACGTCT_ACA4RRCXX_R2.fastq.gz 5.40
+152123617 ILLUMINA RNA-Seq PE CTGGTT 5618765 1.87 152123617_CTGGTT_ACA4RRCXX_R1.fastq.gz 5.92
+152123617 ILLUMINA RNA-Seq PE CTGGTT 5618765 1.87 152123617_CTGGTT_ACA4RRCXX_R2.fastq.gz 5.92
+152123671 ILLUMINA RNA-Seq PE TCTAAT 5798694 2.55 152123671_TCTAAT_ACA4RRCXX_R1.fastq.gz 6.10
+152123671 ILLUMINA RNA-Seq PE TCTAAT 5798694 2.55 152123671_TCTAAT_ACA4RRCXX_R2.fastq.gz 6.10
+152198331 ILLUMINA RNA-Seq PE GGGTTA 5148671 2.37 152198331_GGGTTA_ACA4RRCXX_R1.fastq.gz 5.45
+152198331 ILLUMINA RNA-Seq PE GGGTTA 5148671 2.37 152198331_GGGTTA_ACA4RRCXX_R2.fastq.gz 5.45
+152285738 ILLUMINA RNA-Seq PE CCCTTC 5901034 3.83 152285738_CCCTTC_ACA4RRCXX_R1.fastq.gz 6.20
+152285738 ILLUMINA RNA-Seq PE CCCTTC 5901034 3.83 152285738_CCCTTC_ACA4RRCXX_R2.fastq.gz 6.20
+152346677 ILLUMINA RNA-Seq PE GAATTT 5326258 4.71 152346677_GAATTT_ACA4RRCXX_R1.fastq.gz 5.63
+152346677 ILLUMINA RNA-Seq PE GAATTT 5326258 4.71 152346677_GAATTT_ACA4RRCXX_R2.fastq.gz 5.63
+152417492 ILLUMINA RNA-Seq PE ACCCTT 5919299 3.02 152417492_ACCCTT_ACA4RRCXX_R1.fastq.gz 6.22
+152417492 ILLUMINA RNA-Seq PE ACCCTT 5919299 3.02 152417492_ACCCTT_ACA4RRCXX_R2.fastq.gz 6.22
+152504414 ILLUMINA RNA-Seq PE CCCCAT 5031262 3.21 152504414_CCCCAT_ACA4RRCXX_R1.fastq.gz 5.33
+152504414 ILLUMINA RNA-Seq PE CCCCAT 5031262 3.21 152504414_CCCCAT_ACA4RRCXX_R2.fastq.gz 5.33
+152534255 ILLUMINA RNA-Seq PE CTATGT 5640965 4.61 152534255_CTATGT_ACA4RRCXX_R1.fastq.gz 5.94
+152534255 ILLUMINA RNA-Seq PE CTATGT 5640965 4.61 152534255_CTATGT_ACA4RRCXX_R2.fastq.gz 5.94
+152601388 ILLUMINA RNA-Seq PE AGCACA 5387573 2.92 152601388_AGCACA_ACA4RRCXX_R1.fastq.gz 5.69
+152601388 ILLUMINA RNA-Seq PE AGCACA 5387573 2.92 152601388_AGCACA_ACA4RRCXX_R2.fastq.gz 5.69
+152601390 ILLUMINA RNA-Seq PE AAAGGT 5793187 4.81 152601390_AAAGGT_ACA4RRCXX_R1.fastq.gz 6.09
+152601390 ILLUMINA RNA-Seq PE AAAGGT 5793187 4.81 152601390_AAAGGT_ACA4RRCXX_R2.fastq.gz 6.09
+152605954 ILLUMINA RNA-Seq PE GATACT 5659129 3.07 152605954_GATACT_ACA4RRCXX_R1.fastq.gz 5.96
+152605954 ILLUMINA RNA-Seq PE GATACT 5659129 3.07 152605954_GATACT_ACA4RRCXX_R2.fastq.gz 5.96
+152628849 ILLUMINA RNA-Seq PE TTGAAA 5817189 1.73 152628849_TTGAAA_ACA4RRCXX_R1.fastq.gz 6.12
+152628849 ILLUMINA RNA-Seq PE TTGAAA 5817189 1.73 152628849_TTGAAA_ACA4RRCXX_R2.fastq.gz 6.12
+152712999 ILLUMINA RNA-Seq PE CAATAC 5839874 3.91 152712999_CAATAC_ACA4RRCXX_R1.fastq.gz 6.14
+152712999 ILLUMINA RNA-Seq PE CAATAC 5839874 3.91 152712999_CAATAC_ACA4RRCXX_R2.fastq.gz 6.14
+152768132 ILLUMINA RNA-Seq PE CAGCCG 5850150 1.84 152768132_CAGCCG_ACA4RRCXX_R1.fastq.gz 6.15
+152768132 ILLUMINA RNA-Seq PE CAGCCG 5850150 1.84 152768132_CAGCCG_ACA4RRCXX_R2.fastq.gz 6.15
+152811001 ILLUMINA RNA-Seq PE TGTGGG 5689755 3.2 152811001_TGTGGG_ACA4RRCXX_R1.fastq.gz 5.99
+152811001 ILLUMINA RNA-Seq PE TGTGGG 5689755 3.2 152811001_TGTGGG_ACA4RRCXX_R2.fastq.gz 5.99
+152907755 ILLUMINA RNA-Seq PE GCCTGG 5059831 2.07 152907755_GCCTGG_ACA4RRCXX_R1.fastq.gz 5.36
+152907755 ILLUMINA RNA-Seq PE GCCTGG 5059831 2.07 152907755_GCCTGG_ACA4RRCXX_R2.fastq.gz 5.36
+153005304 ILLUMINA RNA-Seq PE CGAGAG 5391345 2.95 153005304_CGAGAG_ACA4RRCXX_R1.fastq.gz 5.69
+153005304 ILLUMINA RNA-Seq PE CGAGAG 5391345 2.95 153005304_CGAGAG_ACA4RRCXX_R2.fastq.gz 5.69
+153016225 ILLUMINA RNA-Seq PE CGACCA 5920348 2.87 153016225_CGACCA_ACA4RRCXX_R1.fastq.gz 6.22
+153016225 ILLUMINA RNA-Seq PE CGACCA 5920348 2.87 153016225_CGACCA_ACA4RRCXX_R2.fastq.gz 6.22
+153068500 ILLUMINA RNA-Seq PE CCCCGT 5254088 2.66 153068500_CCCCGT_ACA4RRCXX_R1.fastq.gz 5.55
+153068500 ILLUMINA RNA-Seq PE CCCCGT 5254088 2.66 153068500_CCCCGT_ACA4RRCXX_R2.fastq.gz 5.55
+153072132 ILLUMINA RNA-Seq PE CCACTA 5599987 3.85 153072132_CCACTA_ACA4RRCXX_R1.fastq.gz 5.90
+153072132 ILLUMINA RNA-Seq PE CCACTA 5599987 3.85 153072132_CCACTA_ACA4RRCXX_R2.fastq.gz 5.90
+153101681 ILLUMINA RNA-Seq PE GCGGGT 5552118 3.85 153101681_GCGGGT_ACA4RRCXX_R1.fastq.gz 5.85
+153101681 ILLUMINA RNA-Seq PE GCGGGT 5552118 3.85 153101681_GCGGGT_ACA4RRCXX_R2.fastq.gz 5.85
+153185446 ILLUMINA RNA-Seq PE TCATGA 5673384 4.47 153185446_TCATGA_ACA4RRCXX_R1.fastq.gz 5.97
+153185446 ILLUMINA RNA-Seq PE TCATGA 5673384 4.47 153185446_TCATGA_ACA4RRCXX_R2.fastq.gz 5.97
+153260940 ILLUMINA RNA-Seq PE CACAGG 5572657 4.17 153260940_CACAGG_ACA4RRCXX_R1.fastq.gz 5.87
+153260940 ILLUMINA RNA-Seq PE CACAGG 5572657 4.17 153260940_CACAGG_ACA4RRCXX_R2.fastq.gz 5.87
+153355386 ILLUMINA RNA-Seq PE TTCGTG 5351718 3.97 153355386_TTCGTG_ACA4RRCXX_R1.fastq.gz 5.65
+153355386 ILLUMINA RNA-Seq PE TTCGTG 5351718 3.97 153355386_TTCGTG_ACA4RRCXX_R2.fastq.gz 5.65
+153378044 ILLUMINA RNA-Seq PE ACCAGG 5446860 3.4 153378044_ACCAGG_ACA4RRCXX_R1.fastq.gz 5.75
+153378044 ILLUMINA RNA-Seq PE ACCAGG 5446860 3.4 153378044_ACCAGG_ACA4RRCXX_R2.fastq.gz 5.75
+153395738 ILLUMINA RNA-Seq PE ATTAGC 5948661 2.98 153395738_ATTAGC_ACA4RRCXX_R1.fastq.gz 6.25
+153395738 ILLUMINA RNA-Seq PE ATTAGC 5948661 2.98 153395738_ATTAGC_ACA4RRCXX_R2.fastq.gz 6.25
+153488303 ILLUMINA RNA-Seq PE GATCTT 5200472 3.87 153488303_GATCTT_ACA4RRCXX_R1.fastq.gz 5.50
+153488303 ILLUMINA RNA-Seq PE GATCTT 5200472 3.87 153488303_GATCTT_ACA4RRCXX_R2.fastq.gz 5.50
+153494132 ILLUMINA RNA-Seq PE CCCCAC 5740352 2.73 153494132_CCCCAC_ACA4RRCXX_R1.fastq.gz 6.04
+153494132 ILLUMINA RNA-Seq PE CCCCAC 5740352 2.73 153494132_CCCCAC_ACA4RRCXX_R2.fastq.gz 6.04
+153539022 ILLUMINA RNA-Seq PE AGGCAA 5039386 3.34 153539022_AGGCAA_ACA4RRCXX_R1.fastq.gz 5.34
+153539022 ILLUMINA RNA-Seq PE AGGCAA 5039386 3.34 153539022_AGGCAA_ACA4RRCXX_R2.fastq.gz 5.34
+153548916 ILLUMINA RNA-Seq PE AATGGT 5028110 2.78 153548916_AATGGT_ACA4RRCXX_R1.fastq.gz 5.33
+153548916 ILLUMINA RNA-Seq PE AATGGT 5028110 2.78 153548916_AATGGT_ACA4RRCXX_R2.fastq.gz 5.33
+153599270 ILLUMINA RNA-Seq PE AACGCG 5859470 2.77 153599270_AACGCG_ACA4RRCXX_R1.fastq.gz 6.16
+153599270 ILLUMINA RNA-Seq PE AACGCG 5859470 2.77 153599270_AACGCG_ACA4RRCXX_R2.fastq.gz 6.16
+153697489 ILLUMINA RNA-Seq PE AGACTA 5372500 3.4 153697489_AGACTA_ACA4RRCXX_R1.fastq.gz 5.67
+153697489 ILLUMINA RNA-Seq PE AGACTA 5372500 3.4 153697489_AGACTA_ACA4RRCXX_R2.fastq.gz 5.67
+153762036 ILLUMINA RNA-Seq PE CGAGTC 5063246 3.44 153762036_CGAGTC_ACA4RRCXX_R1.fastq.gz 5.36
+153762036 ILLUMINA RNA-Seq PE CGAGTC 5063246 3.44 153762036_CGAGTC_ACA4RRCXX_R2.fastq.gz 5.36
+153807929 ILLUMINA RNA-Seq PE CGTCTG 5317406 3.52 153807929_CGTCTG_ACA4RRCXX_R1.fastq.gz 5.62
+153807929 ILLUMINA RNA-Seq PE CGTCTG 5317406 3.52 153807929_CGTCTG_ACA4RRCXX_R2.fastq.gz 5.62
+153830049 ILLUMINA RNA-Seq PE GATGCG 5595130 2.51 153830049_GATGCG_ACA4RRCXX_R1.fastq.gz 5.90
+153830049 ILLUMINA RNA-Seq PE GATGCG 5595130 2.51 153830049_GATGCG_ACA4RRCXX_R2.fastq.gz 5.90
+153862046 ILLUMINA RNA-Seq PE ACGGTG 5766952 3.39 153862046_ACGGTG_ACA4RRCXX_R1.fastq.gz 6.07
+153862046 ILLUMINA RNA-Seq PE ACGGTG 5766952 3.39 153862046_ACGGTG_ACA4RRCXX_R2.fastq.gz 6.07
+153907755 ILLUMINA RNA-Seq PE AGCCTT 5490989 3.99 153907755_AGCCTT_ACA4RRCXX_R1.fastq.gz 5.79
+153907755 ILLUMINA RNA-Seq PE AGCCTT 5490989 3.99 153907755_AGCCTT_ACA4RRCXX_R2.fastq.gz 5.79
+153928847 ILLUMINA RNA-Seq PE CATGAA 5280028 3.08 153928847_CATGAA_ACA4RRCXX_R1.fastq.gz 5.58
+153928847 ILLUMINA RNA-Seq PE CATGAA 5280028 3.08 153928847_CATGAA_ACA4RRCXX_R2.fastq.gz 5.58
+153946500 ILLUMINA RNA-Seq PE TACACG 5757118 2.32 153946500_TACACG_ACA4RRCXX_R1.fastq.gz 6.06
+153946500 ILLUMINA RNA-Seq PE TACACG 5757118 2.32 153946500_TACACG_ACA4RRCXX_R2.fastq.gz 6.06
+153998950 ILLUMINA RNA-Seq PE AGTCGT 5130755 1.95 153998950_AGTCGT_ACA4RRCXX_R1.fastq.gz 5.43
+153998950 ILLUMINA RNA-Seq PE AGTCGT 5130755 1.95 153998950_AGTCGT_ACA4RRCXX_R2.fastq.gz 5.43
+154084806 ILLUMINA RNA-Seq PE TGACGC 5334082 2.49 154084806_TGACGC_ACA4RRCXX_R1.fastq.gz 5.63
+154084806 ILLUMINA RNA-Seq PE TGACGC 5334082 2.49 154084806_TGACGC_ACA4RRCXX_R2.fastq.gz 5.63
+154140578 ILLUMINA RNA-Seq PE CGTGAC 5816112 2.37 154140578_CGTGAC_ACA4RRCXX_R1.fastq.gz 6.12
+154140578 ILLUMINA RNA-Seq PE CGTGAC 5816112 2.37 154140578_CGTGAC_ACA4RRCXX_R2.fastq.gz 6.12
+154161941 ILLUMINA RNA-Seq PE GCCCGG 5307084 3.17 154161941_GCCCGG_ACA4RRCXX_R1.fastq.gz 5.61
+154161941 ILLUMINA RNA-Seq PE GCCCGG 5307084 3.17 154161941_GCCCGG_ACA4RRCXX_R2.fastq.gz 5.61
+154197341 ILLUMINA RNA-Seq PE CAACAT 5197431 1.82 154197341_CAACAT_ACA4RRCXX_R1.fastq.gz 5.50
+154197341 ILLUMINA RNA-Seq PE CAACAT 5197431 1.82 154197341_CAACAT_ACA4RRCXX_R2.fastq.gz 5.50
+154243529 ILLUMINA RNA-Seq PE CCCAAA 5576137 2.11 154243529_CCCAAA_ACA4RRCXX_R1.fastq.gz 5.88
+154243529 ILLUMINA RNA-Seq PE CCCAAA 5576137 2.11 154243529_CCCAAA_ACA4RRCXX_R2.fastq.gz 5.88
+154300938 ILLUMINA RNA-Seq PE GCTAGT 5300084 3.72 154300938_GCTAGT_ACA4RRCXX_R1.fastq.gz 5.60
+154300938 ILLUMINA RNA-Seq PE GCTAGT 5300084 3.72 154300938_GCTAGT_ACA4RRCXX_R2.fastq.gz 5.60
+154314067 ILLUMINA RNA-Seq PE TCACGC 5645685 1.74 154314067_TCACGC_ACA4RRCXX_R1.fastq.gz 5.95
+154314067 ILLUMINA RNA-Seq PE TCACGC 5645685 1.74 154314067_TCACGC_ACA4RRCXX_R2.fastq.gz 5.95
+154407877 ILLUMINA RNA-Seq PE GTAATA 5014452 1.94 154407877_GTAATA_ACA4RRCXX_R1.fastq.gz 5.31
+154407877 ILLUMINA RNA-Seq PE GTAATA 5014452 1.94 154407877_GTAATA_ACA4RRCXX_R2.fastq.gz 5.31
+154423297 ILLUMINA RNA-Seq PE AACACC 5750083 1.36 154423297_AACACC_ACA4RRCXX_R1.fastq.gz 6.05
+154423297 ILLUMINA RNA-Seq PE AACACC 5750083 1.36 154423297_AACACC_ACA4RRCXX_R2.fastq.gz 6.05
+154511591 ILLUMINA RNA-Seq PE GCTTGA 5480978 3.1 154511591_GCTTGA_ACA4RRCXX_R1.fastq.gz 5.78
+154511591 ILLUMINA RNA-Seq PE GCTTGA 5480978 3.1 154511591_GCTTGA_ACA4RRCXX_R2.fastq.gz 5.78
+154516528 ILLUMINA RNA-Seq PE TTTACT 5602873 2.02 154516528_TTTACT_ACA4RRCXX_R1.fastq.gz 5.90
+154516528 ILLUMINA RNA-Seq PE TTTACT 5602873 2.02 154516528_TTTACT_ACA4RRCXX_R2.fastq.gz 5.90
+154529002 ILLUMINA RNA-Seq PE AACGAG 5700829 3.18 154529002_AACGAG_ACA4RRCXX_R1.fastq.gz 6.00
+154529002 ILLUMINA RNA-Seq PE AACGAG 5700829 3.18 154529002_AACGAG_ACA4RRCXX_R2.fastq.gz 6.00
+154544444 ILLUMINA RNA-Seq PE TCTATA 5800069 2.75 154544444_TCTATA_ACA4RRCXX_R1.fastq.gz 6.10
+154544444 ILLUMINA RNA-Seq PE TCTATA 5800069 2.75 154544444_TCTATA_ACA4RRCXX_R2.fastq.gz 6.10
+154570812 ILLUMINA RNA-Seq PE GCTTCA 5214335 3.39 154570812_GCTTCA_ACA4RRCXX_R1.fastq.gz 5.51
+154570812 ILLUMINA RNA-Seq PE GCTTCA 5214335 3.39 154570812_GCTTCA_ACA4RRCXX_R2.fastq.gz 5.51
+154572077 ILLUMINA RNA-Seq PE TTTGCC 5294203 3.14 154572077_TTTGCC_ACA4RRCXX_R1.fastq.gz 5.59
+154572077 ILLUMINA RNA-Seq PE TTTGCC 5294203 3.14 154572077_TTTGCC_ACA4RRCXX_R2.fastq.gz 5.59
+154670025 ILLUMINA RNA-Seq PE TGGTGA 5715935 3.86 154670025_TGGTGA_ACA4RRCXX_R1.fastq.gz 6.02
+154670025 ILLUMINA RNA-Seq PE TGGTGA 5715935 3.86 154670025_TGGTGA_ACA4RRCXX_R2.fastq.gz 6.02
+154688043 ILLUMINA RNA-Seq PE ACGAAA 5762474 2.33 154688043_ACGAAA_ACA4RRCXX_R1.fastq.gz 6.06
+154688043 ILLUMINA RNA-Seq PE ACGAAA 5762474 2.33 154688043_ACGAAA_ACA4RRCXX_R2.fastq.gz 6.06
+154708451 ILLUMINA RNA-Seq PE TCGGAA 5370555 3.24 154708451_TCGGAA_ACA4RRCXX_R1.fastq.gz 5.67
+154708451 ILLUMINA RNA-Seq PE TCGGAA 5370555 3.24 154708451_TCGGAA_ACA4RRCXX_R2.fastq.gz 5.67
+154734108 ILLUMINA RNA-Seq PE GTTCTT 5209055 2.73 154734108_GTTCTT_ACA4RRCXX_R1.fastq.gz 5.51
+154734108 ILLUMINA RNA-Seq PE GTTCTT 5209055 2.73 154734108_GTTCTT_ACA4RRCXX_R2.fastq.gz 5.51
+154781404 ILLUMINA RNA-Seq PE CTCGCC 5171969 3.31 154781404_CTCGCC_ACA4RRCXX_R1.fastq.gz 5.47
+154781404 ILLUMINA RNA-Seq PE CTCGCC 5171969 3.31 154781404_CTCGCC_ACA4RRCXX_R2.fastq.gz 5.47
+154853271 ILLUMINA RNA-Seq PE GAACAA 5353787 1.13 154853271_GAACAA_ACA4RRCXX_R1.fastq.gz 5.65
+154853271 ILLUMINA RNA-Seq PE GAACAA 5353787 1.13 154853271_GAACAA_ACA4RRCXX_R2.fastq.gz 5.65
+154936145 ILLUMINA RNA-Seq PE GGGGGC 5491229 3.2 154936145_GGGGGC_ACA4RRCXX_R1.fastq.gz 5.79
+154936145 ILLUMINA RNA-Seq PE GGGGGC 5491229 3.2 154936145_GGGGGC_ACA4RRCXX_R2.fastq.gz 5.79
+154988540 ILLUMINA RNA-Seq PE TTGAAT 5548455 2.65 154988540_TTGAAT_ACA4RRCXX_R1.fastq.gz 5.85
+154988540 ILLUMINA RNA-Seq PE TTGAAT 5548455 2.65 154988540_TTGAAT_ACA4RRCXX_R2.fastq.gz 5.85
+155057129 ILLUMINA RNA-Seq PE GGAAGC 5451072 3 155057129_GGAAGC_ACA4RRCXX_R1.fastq.gz 5.75
+155057129 ILLUMINA RNA-Seq PE GGAAGC 5451072 3 155057129_GGAAGC_ACA4RRCXX_R2.fastq.gz 5.75
+155087342 ILLUMINA RNA-Seq PE CATTGC 5243239 3.38 155087342_CATTGC_ACA4RRCXX_R1.fastq.gz 5.54
+155087342 ILLUMINA RNA-Seq PE CATTGC 5243239 3.38 155087342_CATTGC_ACA4RRCXX_R2.fastq.gz 5.54
+155185967 ILLUMINA RNA-Seq PE ACCTGT 5436965 3.35 155185967_ACCTGT_ACA4RRCXX_R1.fastq.gz 5.74
+155185967 ILLUMINA RNA-Seq PE ACCTGT 5436965 3.35 155185967_ACCTGT_ACA4RRCXX_R2.fastq.gz 5.74
+155192028 ILLUMINA RNA-Seq PE CTGGAA 5609840 4.2 155192028_CTGGAA_ACA4RRCXX_R1.fastq.gz 5.91
+155192028 ILLUMINA RNA-Seq PE CTGGAA 5609840 4.2 155192028_CTGGAA_ACA4RRCXX_R2.fastq.gz 5.91
+155285966 ILLUMINA RNA-Seq PE TCGGCA 5760880 3.45 155285966_TCGGCA_ACA4RRCXX_R1.fastq.gz 6.06
+155285966 ILLUMINA RNA-Seq PE TCGGCA 5760880 3.45 155285966_TCGGCA_ACA4RRCXX_R2.fastq.gz 6.06
+155350639 ILLUMINA RNA-Seq PE CCTTTG 5719170 3.55 155350639_CCTTTG_ACA4RRCXX_R1.fastq.gz 6.02
+155350639 ILLUMINA RNA-Seq PE CCTTTG 5719170 3.55 155350639_CCTTTG_ACA4RRCXX_R2.fastq.gz 6.02
+155426989 ILLUMINA RNA-Seq PE CCAGGT 5442382 2.28 155426989_CCAGGT_ACA4RRCXX_R1.fastq.gz 5.74
+155426989 ILLUMINA RNA-Seq PE CCAGGT 5442382 2.28 155426989_CCAGGT_ACA4RRCXX_R2.fastq.gz 5.74
+155436477 ILLUMINA RNA-Seq PE AACCGC 5969635 4.51 155436477_AACCGC_ACA4RRCXX_R1.fastq.gz 6.27
+155436477 ILLUMINA RNA-Seq PE AACCGC 5969635 4.51 155436477_AACCGC_ACA4RRCXX_R2.fastq.gz 6.27
+155526790 ILLUMINA RNA-Seq PE TTTCTA 5737554 4.86 155526790_TTTCTA_ACA4RRCXX_R1.fastq.gz 6.04
+155526790 ILLUMINA RNA-Seq PE TTTCTA 5737554 4.86 155526790_TTTCTA_ACA4RRCXX_R2.fastq.gz 6.04
+155537812 ILLUMINA RNA-Seq PE CCCTAA 5827993 3.36 155537812_CCCTAA_ACA4RRCXX_R1.fastq.gz 6.13
+155537812 ILLUMINA RNA-Seq PE CCCTAA 5827993 3.36 155537812_CCCTAA_ACA4RRCXX_R2.fastq.gz 6.13
\ No newline at end of file
diff --git a/files/sequencing_results_metadata.xls b/files/sequencing_results_metadata.xls
new file mode 100644
index 00000000..13e6ac84
Binary files /dev/null and b/files/sequencing_results_metadata.xls differ
diff --git a/index.md b/index.md
new file mode 100644
index 00000000..1d682de1
--- /dev/null
+++ b/index.md
@@ -0,0 +1,51 @@
+---
+site: sandpaper::sandpaper_site
+---
+
+Good data organization is the foundation of any research project. It not only sets you up well for an analysis, but it also makes it easier to come back to the project later and share with collaborators, including your most important collaborator - future you.
+
+Organizing a project that includes sequencing involves many components. There's the experimental setup and conditions metadata, measurements of experimental parameters, sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis. So much of the information of a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer. In this lesson, we'll go through the project organization and documentation that will make an efficient bioinformatics workflow possible. Not only will this make you a more effective bioinformatics researcher, it also prepares your data and project for publication, as grant agencies and publishers increasingly require this information.
+
+In this lesson, we'll be using data from a study of experimental evolution using *E. coli*. [More information about this dataset is available here](https://www.datacarpentry.org/organization-genomics/data). In this study there are several types of files:
+
+- spreadsheet data from the experiment that tracks the strains and their phenotype over time
+- spreadsheet data with information on the samples that were sequenced - the names of the samples, how they were prepared and the sequencing conditions
+- the sequence data
+
+Throughout the analysis, we'll also generate files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used.
+
+In this lesson you will learn:
+
+- How to structure your metadata, tabular data and information about the experiment. The metadata is the information about the experiment and the samples you're sequencing.
+- How to prepare for, understand, organize and store the sequencing data that comes back from the sequencing center
+- How to access and download publicly available data that may need to be used in your bioinformatics analysis
+- The concepts of organizing the files and documenting the workflow of your bioinformatics analysis
+
+:::::::::::::::::::::::::::::::::::::::::: prereq
+
+## Getting Started
+
+This lesson assumes no prior experience with the tools covered in the workshop.
+However, learners are expected to have some familiarity with biological concepts,
+including the
+concept of genomic variation within a population. Participants should bring their laptops and plan to participate actively.
+
+This lesson is part of a workshop that uses data hosted on an Amazon Machine Instance (AMI). Workshop participants will be given
+information on how
+to log-in to the AMI during the workshop. Learners using these materials for self-directed study will need to set up their own
+AMI. Information on setting up an AMI and accessing the required data is provided on the [Genomics Workshop setup page](https://www.datacarpentry.org/genomics-workshop/index.html).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::::::::::::::::::::::::::::::: prereq
+
+## For Instructors
+
+If you are teaching this lesson in a workshop, please see the
+[Instructor notes](instructors/instructor-notes.md).
+
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
diff --git a/instructor-notes.md b/instructor-notes.md
new file mode 100644
index 00000000..1ec471b0
--- /dev/null
+++ b/instructor-notes.md
@@ -0,0 +1,110 @@
+---
+title: Instructor Notes
+---
+
+## Instructor notes
+
+## Lesson motivation and learning objectives
+
+The purpose of this lesson is *not* to teach how to do data analysis in spreadsheets,
+but to teach good data organization and how to do some data cleaning and
+quality control in a spreadsheet program.
+
+## Lesson design
+
+#### [Data tidiness](../episodes/01-tidiness.md)
+
+- Introduce that we're teaching data organization, and that we're using
+ spreadsheets, because most people do data entry in spreadsheets or
+ have data in spreadsheets.
+- Emphasize that we are teaching good practice in data organization and that
+ this is the foundation of their research practice. Without organized and clean
+ data, it will be difficult for them to apply the things we're teaching in the
+ rest of the workshop to their data.
+- Much of their lives as a researcher will be spent on this 'data wrangling' stage, but
+ some of it can be prevented with good strategies for data collection up front.
+- Tell that we're not teaching data analysis or plotting in spreadsheets, because it's
+ very manual and also not reproducible. That's why we're teaching bash shell scripting!
+- Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that
+ does spreadsheets like Excel, LibreOffice, OpenOffice. Most learners are probably using Excel.
+- Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest.
+ of the data in the spreadsheet. What are the pain points!?
+- As people answer, highlight some of these issues with spreadsheets.
+- Go through the point about keeping track of your steps and keeping raw data raw.
+- Go through the cardinal rule of spreadsheets about columns, rows and cells.
+- Hand them a messy data file and have them pair up and work together to clean up the data.
+
+#### [Planning for NGS projects](../episodes/02-project-planning.md)
+
+- This episode depends on learners discussing exercises with one another. Be sure to give plenty of time for this discussion.
+
+#### [Examining Data on the NCBI SRA Database](../episodes/03-ncbi-sra.md)
+
+- Learners should *not* actually download the ENA files in the "Downloading a few sequencing files: EMBL-EBI" section.
+
+#### Concluding points
+
+- Now your data is organized so that a computer can read and understand it. This
+ lets you use the full power of the computer for your analyses as we'll see in the
+ rest of the workshop.
+
+## Working with participants' level of expertise
+
+Learners may be taking this lesson for many reasons - they may be just thinking of maybe doing a sequencing experiment, they may be trying to analyse public data, they may have already generated their own data, they may be speculatively acquiring new skills.
+
+You should feel free to "read the room", and it can be helpful to ask more specifics in a pre-workshop survey.
+
+#### [Data tidiness](../episodes/01-tidiness.md)
+
+Discussion 1, "What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?" can go very differently depending on the participants' background. Many instructors make adjustments to this section, and they should, depending on the learners.
+
+Some instructors have succeeded in adding ice-breaker questions and more on scientific background to discussion 1, such as:
+
+- What's your name?
+- What kind of sequencing data are you collecting?
+- What question is your experiment answering?
+- What kinds of data and information have you generated before you sent your DNA/RNA off for sequencing?
+
+This had some positive points:
+
+- instructors got to see the range of projects being worked on (metagenomics, RNA-seq, DNA-seq, etc).
+- we had a good discussion about linked metadata, e.g. a plant scientist also takes photos of their plants, an ecologist has site sampling data.
+- learners got to share lessons they'd learned.
+- for some learners, it may have been the first time they'd thought about it.
+- it only added 5 minutes.
+
+The drawback:
+
+- only about 1/2 of learners got to the point of talking about file types of that data.
+
+It could be more efficient to ask these questions in the pre-workshop survey, then present the range of answers during the class. It can also be helpful for instructors and helpers to share what they work on.
+
+## Technical tips and tricks
+
+Provide information on setting up your environment for learners to view your
+live coding (increasing text size, changing text color, etc), as well as
+general recommendations for working with coding tools to best suit the
+learning environment.
+
+## Common problems
+
+#### Excel looks and acts different on different operating systems
+
+The main challenge with this lesson is that Excel looks very different and how you
+do things is even different between Mac and PC, and between different versions of
+Excel. So, the presenter's environment will only be the same as some of the learners.
+
+We need better notes and screenshots of how things work on both Mac and PC. But we
+likely won't be able to cover all the different versions of Excel.
+
+If you have a helper who has more experience with the other OS than you, it would be good
+to prepare them to help with this lesson and tell people how to do things in the other OS.
+
+#### People are not interactive or responsive on the exercises
+
+This lesson depends on people working on the exercise and responding with things
+that are fixed. If your audience is reluctant to participate, start out with
+some things on your own, or ask a helper for their answers. This generally gets
+even a reluctant audience started.
+
+
diff --git a/learner-profiles.md b/learner-profiles.md
new file mode 100644
index 00000000..434e335a
--- /dev/null
+++ b/learner-profiles.md
@@ -0,0 +1,5 @@
+---
+title: FIXME
+---
+
+This is a placeholder file. Please add content here.
diff --git a/md5sum.txt b/md5sum.txt
new file mode 100644
index 00000000..30a35d6a
--- /dev/null
+++ b/md5sum.txt
@@ -0,0 +1,15 @@
+"file" "checksum" "built" "date"
+"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-05-02"
+"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-05-02"
+"config.yaml" "301fc4a15182dcbcdaf4b39112259eac" "site/built/config.yaml" "2023-05-02"
+"index.md" "33e8d0e0bc8849f755a4bc0f598b6a7a" "site/built/index.md" "2023-11-20"
+"episodes/01-tidiness.md" "4df50eab6edeb9e266d3f427fc409c8b" "site/built/01-tidiness.md" "2023-06-14"
+"episodes/02-project-planning.md" "cd5d2ae23412dc61828c0e47387b77b1" "site/built/02-project-planning.md" "2024-04-04"
+"episodes/03-ncbi-sra.md" "21c8c43c3e94862179eb75f1c32379ca" "site/built/03-ncbi-sra.md" "2023-11-20"
+"instructors/data.md" "3bb097fa45131e73e0860e8d4012d509" "site/built/data.md" "2023-05-02"
+"instructors/instructor-notes.md" "f3f7514a8b65d990adba4a5f874c1997" "site/built/instructor-notes.md" "2023-05-02"
+"instructors/old-ncbi.md" "832bcf8a0876a6e571b1e79bfa84437e" "site/built/old-ncbi.md" "2023-05-02"
+"learners/discuss.md" "ee9645b54825a8fcfcf76a59d73c528f" "site/built/discuss.md" "2023-05-02"
+"learners/reference.md" "e587d6ba644f190bb74f4543c99c67df" "site/built/reference.md" "2023-05-02"
+"learners/setup.md" "a4c40b1997442a5f4124349f603099a6" "site/built/setup.md" "2023-05-02"
+"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2023-05-02"
diff --git a/old-ncbi.md b/old-ncbi.md
new file mode 100644
index 00000000..06e728f2
--- /dev/null
+++ b/old-ncbi.md
@@ -0,0 +1,27 @@
+---
+title: Old NCBI
+---
+
+## Original (older) NCBI instructions
+
+These will be phased out of our lesson when NCBI stops supporting
+the old page versions.
+
+1. Access the Tenaillon dataset from the provided link: [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). Click on "Revert to the old Run Selector" at the top of the page.
+
+2. You will be presented with the old page for the overall SRA accession SRP064605 - this is a collection of all the experimental data.
+ data:image/s3,"s3://crabby-images/c7155/c71556310031cf5b1bb9a3517da65b37cfc9ed7f" alt=""{alt='ncbi-old-run-selector'}
+
+3. In this window, you will click on the Run Number of the first entry in the "Runs Found" table (see red box above). This will take you to a page that is a run browser. Take a few minutes to examine some of the descriptions on the page.
+ data:image/s3,"s3://crabby-images/725c4/725c4ee9c5f4efd594053c8ec3c30d8ffc20c81b" alt=""{alt='ncbi-run-browser.png'}
+
+4. Use your browser's "Back" button or arrow to go back to the ['previous page'](https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP064605). Above where it lists the "312 Runs found" is a line starting with **Total** and you will see there are 312 runs, 109.43 Gb data, and 168.81 Gbases of data. Click the 'RunInfo Table' button and save the file to your Desktop.
+ data:image/s3,"s3://crabby-images/447c0/447c04c40efcd2345b515c7d14b927572fc6ccb4" alt=""{alt='ncbi-old-runtable-button.png'}
+ We are not downloading any actual sequence data here! This is only a text file that fully describes the entire
+ dataset.
+
+You should now have a **tab-delimited** file called `SraRunTable.txt`.
+
+**Return to lesson [Examining Data on the NCBI SRA Database](../episodes/03-ncbi-sra.md#you-should-now-have-a-file-called-sraruntabletxt) and continue.**
+
+
diff --git a/reference.md b/reference.md
new file mode 100644
index 00000000..0ba51183
--- /dev/null
+++ b/reference.md
@@ -0,0 +1,74 @@
+---
+title: 'Glossary'
+---
+
+## Glossary
+
+{:auto\_ids}
+accession
+: a unique identifier assigned to each sequence or set of sequences
+
+BLAST
+: The Basic Local Alignment Search Tool at NCBI that searches for similarities between known and unknown biomolecules like DNA
+
+categorical variable
+: Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical variables take on a fixed number of values that are names or labels.
+
+cleaned data
+: data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
+
+conditional formatting
+: formatting that is applied to a specific cell or range of cells depending on a set of criteria
+
+CSV (comma separated values) format
+: a plain text file format in which values are separated by commas
+
+factor
+: a variable that takes on a limited number of possible values (i.e. categorical data)
+
+Gb
+: gigabyte of file storage or file size
+
+Gbase
+: a gigabase represents one billion nucleic acid bases (Gbp may indicate one billion base pairs of nucleic acid)
+
+headers
+: names at tops of columns that are descriptive about the column contents (sometimes optional)
+
+metadata
+: data which describes other data
+
+NGS
+: common acronym for "Next Generation Sequencing" currently being replaced by "High Throughput Sequencing"
+
+null value
+: a value used to record observations missing from a dataset
+
+observation
+: a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
+
+plain text
+: unformatted text
+
+quality assurance
+: any process which checks data for validity during entry
+
+quality control
+: any process which removes problematic data from a dataset
+
+raw data
+: data that has not been manipulated and represents actual recorded values
+
+rich text
+: formatted text (e.g. text that appears bolded, colored or italicized)
+
+string
+: a collection of characters (e.g. "thisisastring")
+
+TSV (tab separated values) format
+: a plain text file format in which values are separated by tabs
+
+variable
+: a category of data being collected on the object being recorded (e.g. a mouse's weight)
+
+
diff --git a/setup.md b/setup.md
new file mode 100644
index 00000000..244dced1
--- /dev/null
+++ b/setup.md
@@ -0,0 +1,10 @@
+---
+title: Setup
+---
+
+This workshop is designed to be run on pre-imaged Amazon Web Services
+(AWS) instances. For information about how to
+use the workshop materials, see the
+[setup instructions](https://www.datacarpentry.org/genomics-workshop/setup.html) on the main workshop page.
+
+