diff --git a/00_intro_organization.md b/00_intro_organization.md deleted file mode 100644 index a8e2f692..00000000 --- a/00_intro_organization.md +++ /dev/null @@ -1,24 +0,0 @@ -#Getting your project started - -Project organization is one of the most important parts of a sequencing project, but is often overlooked in the -excitment to get a first look at new data. While it's best to get yourself organized before you begin analysis, -it's never too late to start. - -You should approach your sequencing project in a very similar way to how you do a biological experiment, and ideally, -begins with experimental design. We're going to assume that you've already designed a beautiful sequencing experiment -to address your biological question, collected appropriate samples, and that you have enough statistical power. For all of those steps, collecting specimens, extracting DNA, prepping your samples, you've likely kept a lab notebook that details how and why you did each step, but documentation doesn't stop at the sequencer! - -Every computational analysis you do is going to spawn many files, and inevitability, you'll -want to run some of those analysis again. Genomics projects can quickly accumulates hundreds of files across tens of folders. -Do you remember what PCR conditions you used to create your sequencing library? Probably not. Similarly, you probably won't -remember whether your best alignment results were in Analysis1, AnalysisRedone, or AnalysisRedone2; or which quality cutoff -you used. - -Luckily, recording your computational experiments is even easier than recording lab data. Copy/Paste will become your best -friend, sensible file names will make your analysis traversable by you and your collaborators, and writing the methods -section for your next paper will be a breeze. Let's look at the best practices for documenting your genomics project. - -Your future self will thank you. - -###References -[A Quick Guide to Organizing Computational Biology Projects] (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424) diff --git a/GoodBetterBest.md b/GoodBetterBest.md deleted file mode 100644 index ddf20739..00000000 --- a/GoodBetterBest.md +++ /dev/null @@ -1,66 +0,0 @@ -#Best practices -If you're starting out a brand new project, it's easy to begin in an organized way. However, many of us inherit data that's organized in idiosyncratic ways that may work for the creator, but might not fit into your organization style, or aren't arranged in a way that public databases like NCBI will accept. - -We've compiled a list of common data issues, and some suggestions for Good, Better and Best practices. If you're begining with all new data, we suggest you try for Best in each instance, but also offer suggestions for improving the organization of legacy data to get it into public database formats. The idea is not only to make your data robust for your own re-use but to make your data discoverable and re-usable by your collaborators and the worldwide community. In addition to getting your papers cited, use these practices to enhance the chances of *getting your data cited!* - -##Naming standards - -###Column Headers - -Your datasheets and metadata sheets should have column names that are meaningful both to you and to other people who may use your data. If you intend to import this data into R, Python or other program (or you don't want to make life difficult for other researchers who might), avoid spaces and non-standard characters. - -#####Good - -If some meta-data takes the form of highlighting or other formatting, make new columns to house that information in a way that is compatible with the .csv file type. For instance, you may have highlighted the names of individuals from one population in blue, and the other in red. This is helpful for you to differentiate between populations, but that information can't be imported into R (or any other text based program). Instead, make a column called "Population" and note where each individual came from. - -#####Better -Name your columns something as short as possible while still being unique and descriptive. Make column comments with an extended description of each column header and any abbreviations in the data sheet. - -#####Best -Use terms from *data standards* -You can use the Biocode Field Information Management System to generate datasheets with globally acccepted field names. Click on Tools to get started. - -###Unique Sample Names - -#####Good - -Each of your samples should have a unique identifier, and a set of standardized metadata. Avoid names that look like dates (Dec14), times (AM1245) and other things that Excel might auto-convert. - -#####Better - -Add a column to your dataset starting with MyLab1, MyLab2...MyLabN, so each sample gets a unique ID - -#####Best - -Genomics Data Standards The Global Genome Biodiversity Network (GGBN) was formed in 2011 with the principal aim of making high-quality well-documented and vouchered collections that store DNA or tissue samples of biodiversity, discoverable for research through a networked community of biodiversity repositories. This is achieved through the GGBN Data Portal, which links globally distributed databases and bridges the gap between biodiversity repositories, sequence databases and research results. Global Genome Biodiversity Network (GGBN): GGBN Data Standard Terms - -###Dates, times and temperatures - -In the field, it's often easiest to write collection dates in a familiar format, but programs like R and Excel may do very weird things with dates. - -#####Good - -Seperate dates into three (or more) unambiguous columns, minimally Day, Month, Year. If your data is finer grained, also use Hours, Minutes, Seconds, etc. - -#####Better - -Whatever the number of your date levels, ensure that every individual has every cell filled in. Be especially careful of numbers like times which Excel may reformat from "00" to " ", and may be interpreted as missing data in downstream analysis. - -#####Best - -Follow the good and better practices, and also: In your README, note important meta-meta-data that seems obvious now, but might be difficult to remember later. For example, whether you used 12 or 24 hour time, what temperature scale you used, and what your zero point means. If the data was collected across different time zones, be sure that the timezone has it's own column, and note in your meta-meta-data file whether you've standardized the time (i.e. Were your 5pm and 6pm measurements that were done in adjecent time zones done at the same physical time? or an hour apart?) Add an extended description of each column header and any abbreviations in the data sheet. - -###Remove spaces and punctuation -Make all your downstream analysis easier by using computer friendly naming - -#####Good -Never use spaces. When working on the command line, spaces in file names make everything exponentially more difficult. Replace all your spaces with under_scores or use CamelCase to seperate words in complex files names. Similarly, don't use special characters (.,;"*) in file names. - -#####Better -Don't stop at file names, get rid of spaces and special characters inside of files too! If a column contains temperature data, remove any degree symbols or letters and just keep the numbers...just make sure your metadata includes the temperature scale. - -#####Best -Use atomic units for all of your data, if the value needs a space or punctuation, it probably isn't an atomic unit, and can be seperated into two data columns. - - - diff --git a/InstructionsForEditors.md b/InstructionsForEditors.md deleted file mode 100644 index b1aa7811..00000000 --- a/InstructionsForEditors.md +++ /dev/null @@ -1,19 +0,0 @@ -#What's been done - -I've created a number of empty "mini-modules" with names that reflect our list of learning objectives, as well as a number of "issues", which are smaller chunks of each section that need to be tackled. - -##Proposed plan of action - -###Issues - 1. Look through the issues list and claim the ones that you want to work on - 2. Be sure to claim an issue before you start to work on it, so we don't have to do too much de-convoluting simultanous changes - 3. Make your own issues! If you get stuck on your lesson, request help with an issue. No request is too small - 4. Check the issues frequently! Even if you don't have a ton of time to help, some issues are as simple as "add a link" or "check my grammar" - -###MiniModules - 1. If you're starting from a blank page, feel free to "Commit directly to the master branch" so we get the basics of the lessons - 2. Once there's a basis to a lesson, or **if you've chosen an "edit" type issue use "Create a new branch for this commit and start a pull request." so other can comment on your changes before we accept them** otherwise it can do weird overwrites of other peoples edits. If you're sure about your change, you can immediatly accept your own pull request after you branch. - 3. If you want to be completely in charge of a mini-module, make it an issue, and then claim it - 4. You can change the name of the mini-modules by clicking **edit** and then changing the path text. We're going to arrange them from 00 to N once they're completed - 5. If you don't see the mini-module you want, create it - 6. If your module has data to be given to learners, they can be inline in the instructor document, but also put a copy in the data folder diff --git a/datafiles/fakefile.txt b/datafiles/fakefile.txt deleted file mode 100644 index 8b137891..00000000 --- a/datafiles/fakefile.txt +++ /dev/null @@ -1 +0,0 @@ - diff --git a/lessons/01_intro_organization.md b/lessons/01_intro_organization.md new file mode 100644 index 00000000..a8fe981e --- /dev/null +++ b/lessons/01_intro_organization.md @@ -0,0 +1,117 @@ +#Getting your project started + +Project organization is one of the most important parts of a sequencing project, but is often overlooked in the excitement to get a first look at new data. While it's best to get yourself organized before you begin analysis, +it's never too late to start. + +You should approach your sequencing project in a very similar way to how you do a biological experiment, and ideally, begins with experimental design. We're going to assume that you've already designed a beautiful sequencing experiment +to address your biological question, collected appropriate samples, and that you have enough statistical power. For all of those steps, collecting specimens, extracting DNA, prepping your samples, you've likely kept a lab notebook that details how and why you did each step, but documentation doesn't stop at the sequencer! + +Every computational analysis you do is going to spawn many files, and inevitability, you'll +want to run some of those analysis again. Genomics projects can quickly accumulates hundreds of files across tens of folders. Do you remember what PCR conditions you used to create your sequencing library? Probably not. Similarly, you probably won't +remember whether your best alignment results were in Analysis1, AnalysisRedone, or AnalysisRedone2; or which quality cutoff +you used. + +Luckily, recording your computational experiments is even easier than recording lab data. Copy/Paste will become your best friend, sensible file names will make your analysis traversable by you and your collaborators, and writing the methods section for your next paper will be a breeze. Let's look at the best practices for documenting your genomics project. + +Your future self will thank you. + +##Exercise + +In this exercise we will setup a filesystem for the project we will be using over the next few days. We will also introduce you to some helpful shell commands/programs/tools: + +* ``mkdir`` +* ``history`` +* ``tail`` +* ``|`` +* ``nano`` +* ``>>`` + +#### A. Create a file system for a project + +Inspired by the guide below, we will start by create a directory that we can use for the rest of the workshop: + +First, make sure that you are in your home directory: +```bash +$ pwd +/home/dcuser +# Hopefully you got the above output '/home/dcuser' +``` + +**Tip:** Remember, when we give a command, rather than copying and pasting, just type it out. Also the '$' indicates we are at the command prompt, do not include that in your command. + +**Tip** If you were not in your home directory, the easiest way to get there is to enter the command ``cd`` which always returns you to home. + +Next, try making the following directories using the ``mkdir`` command + + +* dc_workshop +* dc_workshop/docs +* dc_workshop/data +* dc_workshop/results + + +Verify that you have created the directories; + +```bash +$ ls -R dc_workshop +``` + +if you have created these directories, you should get the following output from that command: + +```bash +dc_workshop/: +data docs results + +dc_workshop/data: + +dc_workshop/docs: + +dc_workshop/results: +``` + +#### B. Document your activity on the project + +The *history* command is a convenient way to document the all the commands you have used while analyzing and manipulating your project. Let's document the work we have done to create these folders. + +1. View the commands that you have used so far during this session using ``history``: + + ```bash +$ history +``` +The history likely contains many more commands that you have used just for these projects. Let's view the last several commands so that focus on just what we need for the project. +2. View the last n lines of your history (where n = approximately the last few lines you think relevant - for our example we will use the last 7: + + ```bash +$ history | tail -n7 +``` +As you may remember from the shell lesson, the pipe ``|`` sends the output of history to the next program, in this case, tail. We have used the -n option to give the last 7 lines. +3. Using your knowledge of the shell use the append redirect ``>>`` to create a file called **dc_workshop_log_XXXX_XX_XX.txt** (Use the four-digit year, two-digit month, and two digit day, e.g. dc_workshop_log_2015_07_30.txt) +4. You may have noticed that your history may contain the ``history`` command itself. To remove this redundancy from our log, lets use the ``nano`` text editor to fix the file: + ```bash +$ nano dc_workshop_log +``` +From the nano screen, you should be able to use your cursor to navigate, type, and delete any redundant lines. +5. Add a dateline and comment to the line where you have created the directory e.g.
+ ``` +# 2015_07_30 +``` +
+ ``` +# Created sample directories for the Data Carpentry workshop +``` +6. Next, remove any lines of the history that are not relevant. Just navigate to those lines and use your delete key. +7. Close nano by hitting 'Control' and the 'X' key at the same time; notice in nano this is abbreviated '\^X'; nano will ask if you want to save; hit 'Y' for yes. When prompted for the 'File Name to Write' we can hit 'Enter' to keep the same name and save. +8. Now that you have created the file, move the file to 'dc_workshop/docs' using the ``mv`` command. + + +**Questions**:
+1. What is the default number of lines that tail displays?
+2. What is the difference between '>' and '>>' + + + + +###References +[A Quick Guide to Organizing Computational Biology Projects] (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424) + + diff --git a/mini_module_DataStandards.md b/mini_module_DataStandards.md deleted file mode 100644 index c4155117..00000000 --- a/mini_module_DataStandards.md +++ /dev/null @@ -1 +0,0 @@ -What are the standards for organizing and naming your data? diff --git a/mini_module_Documentation.md b/mini_module_Documentation.md deleted file mode 100644 index dacf408b..00000000 --- a/mini_module_Documentation.md +++ /dev/null @@ -1,3 +0,0 @@ -- pass 'leaving academia forever' test -- knowing what kind of data you'll be dealing with -- knowing what kinds of things you need to keep track of for your experiments, what do people in your field care about diff --git a/mini_module_RawFiles.md b/mini_module_RawFiles.md deleted file mode 100644 index 5ae94019..00000000 --- a/mini_module_RawFiles.md +++ /dev/null @@ -1,30 +0,0 @@ -**What do we mean by raw data?** -Depending on what sequencing facility you use and what services you buy, your sequencing data may arrive in one of several formats, but likely you will get a compressed version of your data (a .tar or .gz file). This file (or files) is your raw data. Your data is compressed, because modern sequencers give huge amounts of data; a single lane of 100bp paired end Illumina reads may give you 100 gigabytes of sequence. - -**Raw data needs to stay raw** -The raw data you get from your sequencing facility is virtually irreplaceable. -Since sequencing facilities process many samples from many labs, most can only afford to store these files for a limited time, often only a few weeks. Once your facility has purged their copy of the file, the only way to get that data is to re-run samples, which often means re-doing the entire experiment. This means that you need to protect your raw data not only from accidental loss, but also from irrevocable changes. - -**How to work with your data (without accidently destroying it) -First, you should always get a copy of your data from the sequencing facility as soon as possible, and make that copy READ-ONLY. That means, that you can open your data, look at it, and even make copies of it, but the computer won't let you save any changes to it. To make your date read-only, navigate to the folder that holds the file and then type: -
$ chmod 444 MyRawData
-The chmod command calls a program *ch*ange *mod*e, that alters the permissions for a file or directory. It accepts three or four digits, where the first digit is optional, and each of the other three coorosponds to a different level of computer operator: user, group and all. The numerical value of each digit (found in the table below) specifies how each of those operators can interact with the file in question. So $ chmod 444 MyRawData gives everyone (user, group and all) the ability to look at the data file but not to change it. - -\# | Permission ----|------------- -7 | read, write and execute -6 | read and write -5 | read and execute -4 | read only -3 | write and execute -2 | write only -1 | execute only -0 | none - -When you want to work with the data file, simply copy it to a new location, and work only with the copy. That way any accidental deletions, or other changes, won't effect your raw data. One simple way to do this is to use rsync instead of cp. - -
$ rsync MyRawData MyWorkingDirectory/datacopy chmod 744 MyRawData
-This will make a copy of MyRawData called datacopy in MyWorkingDirectory, and simultanously update the permissions of datacopy to allow the user (you) to read, write and execute (do science on!) that copy, all without editing the content or permissions of MyRawData - -**How to best store your raw data** -Remember, the raw data files from your sequencing facility are virtually irreplaceable, so you should always have more than one copy. Discussion of all of the various storage options are beyond the scope of this workshop, but your data storage solution should include at least two (preferably three or more) copies of your data, at widely varying locations and pass the fire test: If one of your data copies is on fire, the rest shouldn't be. Backing up your desktop computer to a harddrive on your desk will save you if a hardrive fails, but not if your office is on fire. diff --git a/mini_module_backups.md b/mini_module_backups.md deleted file mode 100644 index ce3bd095..00000000 --- a/mini_module_backups.md +++ /dev/null @@ -1,34 +0,0 @@ -Any files that are difficult or impossible to replace should be backed up on a regular basis. Depending on what your project is, this may include raw data files, processed data files, programs and scripts, and metadata. Technology is always changing, so it is impossible for this tutorial to provide a complete guide to backups. However, the following considerations should help you narrow down yout backup choices and choose an appropriate one. - -But remember: what's most important is *that* you backup, not *how* you backup. - -##Backup Basics: The 3-2-1 rule ->You should always have: ->* 3 Copies of any important data ->* 2 Different types of backup (cloud and tape for instance) ->* 1 Off-site backup (physically as far away from your other backups as possible) - -This may sound a bit excessive, but hard-drives fail, buildings catch fire and accidents happen, and there are good reasons for these rules. - -### 3: Copies of any important data -If your data is important enough to back up, it's important enough to keep in triplicate. Just like your main copy, backups can fail. The more copies you have, the less likely they'll all fail at once. - -### 2: Different types of backup media -There are two good reasons for having multiple media types. First, different media wear out at different rates: About 6% of harddrives fail per year, and are usually said to have an average lifespan of 3 to 5 years. Consumer grade CDs and DVDs (from a good brand) can last about 5 years, flashdrives can last 10 years, and tapes perhaps for 50 years. However, it's important to realize that these are mostly best guesses, and are based on your media being treated perfectly. Also, consider that if you go buy two harddrives today, back up all your data to both and store them in two different places, even with perfect care they're likely to fail at around the same time. -Second, it hedges your technological bets: Theoretically, it will take 50 years for tapes to lose their magnetic charge, but that time can be drastically reduced by temperature, dust...or breaking the reader. Storing all your data in one format increases the chance that you'll have a great archive that you can't access. - -### 1: Off-site backup -For an experiment, you may do technical replicates and biological replicates. The first is a just a second copy, it can tell you about your pipettes, but doesn't tell you anything about your study system. You can think of computer back-ups in a similar way: if you make a second copy of all your data onto another part of your laptop, you technically have a backup...but it won't be any help if you drop your computer in the stairwell. For a backup to actually be helpful it should be as physically distant as is possible from your main copy. Another building is good, another city is better. Extra points for another continent. - -So where should you keep all these data copies? The first thing to consider is: Who you are storing the data *for* and *why*? - -The level of accessibility you need can help determine where to store it. Are you storing data for your future self? for you and your collegues to use? for the scientific community? -* If it's just for you or a few collaborators, a few harddrives or your university HPC might be the easiest option. -* If many people will be accessing it, cloud options, like AWS , DRYAD, osf.io, or Figshare might be best. -* If your data is sensitive (like human genomic data), it may be illegal to store in many of the popular online data repositories. Check with your campus HPC or campus librarian, many universites already have data repositories and management plans in place for sensitive research data. -*Scripts and other non-data files can be stored at a public or private Github repository, Bitbucket, Google Code, SourceForge or others - -##For more discussion -* Where should scientists store their data? -* Github for Science -* Ten Simple Rules for Reproducible Computational Research diff --git a/mini_module_genomics_warmup_excercise.md b/mini_module_genomics_warmup_excercise.md deleted file mode 100644 index c54f08cf..00000000 --- a/mini_module_genomics_warmup_excercise.md +++ /dev/null @@ -1,35 +0,0 @@ -#### Goals -* Introduce the concept of data standards, and specifically in this case, data standards for genomics data. -* Using this introduction, reinforce the concept of planning for data and metadata capture before starting a project. -* Using this exercise, stimulate a conversation about other relevant standards and how to find them. -* Reinforce the concept of using these standards, and good metadata when inheriting work begun by others, or merging datasets. -* This exercise is also an easy way to start a discussion about the longevity of a given dataset. - -#### Objectives -1. Given the following image as a printed handout, students mark (circle, highlight) terms they recognize. ![the labels handout](https://cloud.githubusercontent.com/assets/2990155/6986653/689377b4-da0d-11e4-9272-e7c45a4b465b.png) - * Depending on the students and their current level of education, some will have more familiarity with these terms than others. - * Once recognized terms are marked, share / compare with your neighbors, report back on any new terms / unfamiliar terms. -3. Have students share what term is used in their lab for any given term on this label hand out. - * Find a few examples from the students, where different labs / individuals refer to the same data, by a different label. - * Discuss what would happen if you tried to merge datasets with different column headers (labels) for the same data. - * How much time would it take? Would mistakes in "mapping" occur? Would data have to be scrapped because no one can figure - what it means for certain? -4. Discuss the difference between two scenarios - * starting from scratch with a collaborative team, deciding on what data and metadata to collect, vs. - * starting up by inheriting someone else's dataset/s or merging datasets. What are the challenges? pitfalls? -5. What kind of documentation do you need to keep track of genomics projects? -6. Show how genomics (and other) data standards can help with data and metadata organization and documentation. - * If you have an example, show how lack of standard use of terms created a problem for a specific research project. -7. Next print and hand out the image below. It's a [Global Genome Biodiversity Network (GGBN)]([http://www.ggbn.org/) [list of the same terms or labels as above](http://terms.tdwg.org/wiki/GGBN_Data_Standard), but now by "concept." These are the concept names for the general terms used in the above exercise. It is these terms that make data sharing across projects, and across time, more durable. They make the data more re-usable, and less prone to mis-use, mis-interpretation. - ![GGBN terms by concept name](https://cloud.githubusercontent.com/assets/2990155/6986656/6e8052be-da0d-11e4-9718-72fce5bfe155.PNG) - * Note I have found the Rosetta Stone a good analogy for students when explaining why the common language, a standard language, is crucial for best practices when sharing and publishing data. - -#### Time estimate for this exercise -* Please give feedback here if you try this exercise. I would estimate maximum of 20 minutes. It's intended to start a conversation (an introduction), it's not a comprehensive lesson about data standards. - -#### Read More About It -* [Global Genome Biodiversity Network](http://www.ggbn.org/) -* [GGBN Data Standard](http://terms.tdwg.org/wiki/GGBN_Data_Standard) Quoted from their site: -
The Global Genome Biodiversity Network (GGBN) is a global network of well-managed collections of genomic tissue samples from across the Tree of Life, benefiting society through biodiversity research, development and conservation. This network will foster collaborations among repositories of molecular biodiversity in order to ensure quality standards, improve best practices, secure interoperability, and harmonize exchange of material in accordance with national and international legislation and conventions.

-The GGBN Data Standard is a set of vocabularies designed to represent tissue, DNA or RNA samples associated to voucher specimens, tissue samples and collections. Contributors: Gabriele Droege, Birgit Gemeinholzer, Holger Zetzsche, Astrid Schories, Jörg Holetschek, Enrique Arbeláez Cortés, Katharine Barker, Sean Brady, Boyke Bunk, Margaret Casey, Jonathan Coddington, John Deck, René Dekker, Sonya Dyhrman, Elisabeth Haring, Patricia Kelbert, Hans-Peter Klenk, Thomas Knebelsberger, Renzo Kortmann, Christopher Lewis, Jacqueline Mackenzie-Dodds, Christopher Meyer, Jon Norenburg, Thomas Orrell, Michael Raupach, Thomas von Rintelen, Ole Seberg, Larissa Smirnova, Carola Söhngen, Sun Ying, Lee A. Weigt, Jamie Whitacre, Kenneth Wurdack, Pelin Yilmaz, Elizabeth Zimmer, Xin Zhou.. -
diff --git a/mini_module_samplemetadata.md b/mini_module_samplemetadata.md deleted file mode 100644 index 5b6e3090..00000000 --- a/mini_module_samplemetadata.md +++ /dev/null @@ -1,3 +0,0 @@ -#Lesson content -#Exercise -For this exercise, we'll use a set of 56 Escherichia coli O104:H4 genomes in the [PATRIC database] (http://patricbrc.org/portal/portal/patric/Home) (Pathosystems Resource Integration Center). We'll look at the metadata associated with these genomes. Click [here] (http://patricbrc.org/portal/portal/patric/GenomeList?cType=taxon&cId=1038927&displayMode=&dataSource=PATRIC&pk=#aP0=1&aP1=1&aT=0&key=1349703250&cwG=false&tId=1038927&gName=&kW=) to access the 56 E. coli genomes within PATRIC. diff --git a/project_setup.md b/project_setup.md deleted file mode 100644 index bbd42669..00000000 --- a/project_setup.md +++ /dev/null @@ -1,124 +0,0 @@ -Project setup -============= - -Group ------ -Amanda Charbonneau, Tracy Teal, Laura Williams, Juan Ugalde, Deb Paul - -Learning Objectives: -------------------- - -#### What's the motiviation for this lesson? -* Equip learners with knowledge and skills to organize and access data so that they and their colleagues can use it. -* Keeping you data accessible to you. Sequencing centers don't keep the data (for example), and you might need to redo analysis. -* Keep track of your analysis. In two months, you won't remember what you did with your data. -* Sharing with collaborators is less fraught with anxiety that you're giving them the right version or the clean version. -* When you go to publish your paper, you have your data ready to put in the data repository. -* Journals and databases are requiring this information. -* Makes you feel happy -* Highlights the importance of what to consider when setting up a project from scratch or how to manage when inheriting someone else's data. -* Introduces learners to best practices for using data standards to share data for *reproducibility* - -#### What mindset change should the learner have? (ie, awareness of certain techniques) - -* Time spent organizing data is time that should be valued https://xkcd.com/1205/ -* Value data organization -* Good data organization saves you time -* Preventative data care -* The computer is a tool that you're doing experiments with - you can think about the data analysis as a fundamental part of the experiment and should extend the same good practices - -#### What skills should the learner leave the room w/? -* How to set up their computational environmental for the types of experiments you're going to do -* metadata organization in spreadsheets. For example, keep a master list of genomic data -* Awareness of metadata standards (where is that info). What minimal information do you need -* Management of raw data files, including: - * Good naming schemes with unique names - * No spaces or non-normal characters - * Backed up - * Accessible to lab mates - * Don't mess with raw data files -* Knowing levels of data backup -* Knowing places where you can put your data -* management of analysis files -* Documenting your steps / data exploration (this could be incorporated in to data wrangling) - * What you did and why - -#### What would be an exercise: - -* Paired exercise - each of you do an analysis, then swap and see if you can reproduce it. -* Paired exercise - work with a partner to develop the beginnings of a scheme for organizing and storing your datasets (each attendee probably has a general project plan, a partner could be a good sounding board for developing an organizational structure for data) -* Organize the data related with the test data set *E. coli* from [Patric] (http://patricbrc.org/portal/portal/patric/Home). This can be used as a test example. This can be done in collaboration with a lab partner, where you can try to reproduce, understand the data generated. - - -#### What are the pre-reqs -* Knowing what kind of data you'll be dealing with -* Knowing what kinds of things you need to keep track of for your experiments, what do people in your field care about -vocabulary: - * what are different levels of backups - * metadata - * what do we mean by raw data - -#### Other comments - -##### Recommendations (good, better, best) -* meaningful file names -* README files -* automated backups -* pass 'leaving academia forever' test - - -provenance - -opportunities for data storage/management through other resources -- github https://github.com/blog/1840-improving-github-for-science -- dat -- osf.io -- dryad -- bitbucket (offers free private repos for academic users) - -#### Assessment Questions - -7-10 key things attitudes/dispositions vs declarative knowledge vs skill -"I feel" vs "I know this about X " vs "I know how to do X" vs -Please remember (Where possible): Be measurable, specific, precise. - -#####Attitudes -* I feel that having #reproducible data is worth the time collating the metadata (Rate your feelings from 1 to 5 (most true)): - + If necessary, it will be easy to re-process my sequencing data as long as I know what programs I used, and in what order - + If two people analyze sequencing data using the same set of tools they will always get the same answer - + I need to write down which programs I use to process my sequencing data, and in what order - + My computer automatically records the steps of my data processing - + I can predict what would be in my colleagues "e-coli assembly" folder, without looking - + A readme file should be written only after all the work is finished, so it will be well-written and complete - -* I think that documenting my digital data processing is as important as documenting my laboratory data processing (what I did and why). - + How long do you typically spend documenting your computational processing (sequencing data cleaning at several cutoffs)? Your laboratory processing (PCR experiment with many temps/salt concentrations)? -* I think of my digital data processing as hypothesis testing and a series of experiments - + How many times to you expect to process your sequencing data before you'll believe it's accurate? - + How many times do you typically amplify a gene sequence before you'll believe it's accurate? - -#####Skills -* I can identify shortcomings in other datasets - + give students the Patric 0104H4 database and ask them to organize the metadata for their own use. Later have them revisit their results and pick out some data, or pick out data from a neighbor to demo how easy it is to get lost in your own poorly organized filing system (see below for detailed example) - - Patric 0104H4 Dataset: http://patricbrc.org/portal/portal/patric/GenomeList?cType=taxon&cId=1038927&displayMode=&dataSource=PATRIC&pk=#aP0=1&aP1=1&aT=0&key=1349703250&cwG=false&tId=1038927&gName=&kW= -* I can transform badly structured data into well structured data - + Have the students restructure their first try at organizing the Patric 0104H4 dataset -* I can set up a cohesive and well-documented data collection system and consistently apply it to projects - + Have students add a hypothetical new isolate to the Patric O104:H4 database - + From an already populated file structure with data, assemblies, etc., find a particular output (for example, can you find the assembly done with assembly program X and hash length N) -* I can describe what makes a good file name - + give a small set of filenames (with issues) and ask them to come up with a list of what the issues are. - - example issues in filnames - - month-day ambiguous - - spaces in file names - - special characters -* I know enough vocabulary to look for help -* I can redo my analysis from scratch, change one parameter for publication, check that results are reproducible, etc - + Documenting workflow/processes can be integrated into modules throughout the workshop - -#####Declarative knowledge -* I can list three places to get information about data standards for my field -* I can list the pros and cons of various long-term raw data storage methods -* I can explain the difference between using unique identifiers within my lab, and using globally unique identifiers to name samples - - diff --git a/warmup exercise b/warmup exercise deleted file mode 100644 index 36f0bcf7..00000000 --- a/warmup exercise +++ /dev/null @@ -1 +0,0 @@ -Take screenshot of genomics terms and ask them to highlight which ones they know they have recorded for their data