-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathblast.txt
More file actions
140 lines (84 loc) · 3.84 KB
/
blast.txt
File metadata and controls
140 lines (84 loc) · 3.84 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
Basic EC2, command line, and BLAST
==================================
Follow the instructions in :doc:`amazon/log-in-with-ssh-mac` or :doc:`amazon/log-in-with-ssh-win`.
----
Two points:
Your machine name is available `here <https://docs.google.com/spreadsheet/ccc?key=0ArcOEBWnXSBidEVtNURjLU9fSHExbzhIdGhIMl9uc0E#gid=0>`__
Download a keyfile here: http://athyra.idyll.org/~t/uw-bootcamp.pem
----
You should now be at a '#' prompt.
Create a directory for yourself
-------------------------------
Type::
cd /mnt
and then type::
mkdir <NetID>
but replace ``<NetID>`` with your MSU NetID (or some distinguishing lowercase
name).
Then type::
cd <NetID>
and ::
pwd
It should say '/mnt/<NetID>'. Here, you've created your own folder and
made it your current "working directory", which means it's where UNIX
will look for files and programs by default.
Download some data
------------------
Download the E. coli MG1655 protein data set::
curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa
This grabs that URL and saves the contents of 'NC_000913.faa' to the local
disk.
Next, download a Salmonella protein data set::
curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Salmonella_enterica_Serovar_Typhimurium_var__5__CFSAN001921_uid212972/NC_021814.faa
Likewise, this creates a local copy of NC_021814.faa.
Let's take a quick look at these files::
head NC_000913.faa
head NC_021814.faa
These files contain a bunch of protein data from two different genomes.
What can we do with it??
Format for BLAST and run BLAST
------------------------------
Format the E. coli data set for BLAST and run BLAST of the Salmonella proteins
against the MG1655 protein set::
formatdb -i NC_000913.faa -o T -p T
blastall -i NC_021814.faa -d NC_000913.faa -p blastp -e 1e-12 -o salm.x.ecoli
Look at the first 50 lines of the output file::
head -50 salm.x.ecoli
good, BLAST output! But if you type 'wc salm.x.ecoli' you'll see that
this file has 462,000 lines in it -- surely you don't want to look at
each one?
Let's convert 'em to a CSV file, instead, that can be opened in Excel::
python /usr/local/share/ngs-scripts/blast/blast-to-csv-with-names.py NC_021814.faa NC_000913.faa salm.x.ecoli > salm.x.ecoli.csv
Take a look at this file ::
head salm.x.ecoli.csv
But ... this file is on our remote computer. How do we get this file onto
our *local* computer?? There are lots of ways of doing this; for now,
I've set up a Web server on your Amazon computer, so you can just type::
ln -fs $PWD /var/www
and go to your computer name in your browser plus '/<NETID>. You
should see a bunch of files, including 'salm.x.ecoli.csv'. For an example,
go to::
http://ec2-23-20-239-64.compute-1.amazonaws.com/titus/
Reciprocal BLAST calculation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Be sure to start in "your" directory::
cd /mnt/<NETID>
Now, let's do the reciprocal BLAST, too::
formatdb -i NC_021814.faa -o T -p T
blastall -i NC_000913.faa -d NC_021814.faa -p blastp -e 1e-12 -o ecoli.x.salm
Extract reciprocal best hit::
python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py NC_021814.faa NC_000913.faa salm.x.ecoli ecoli.x.salm > ortho.csv
This generates a file 'ortho.csv', containing the ortholog assignments and
their annotations. Now download *that* to your local computer and take
a look at it in Excel.
Time for reflection
-------------------
Get together with those sitting around you and come up with three uses
for this kind of "batch BLAST" in your collective research, whatever
it may be. We'll make a list!
A few post-tutorial links
-------------------------
Explore the NCBI bacterial genome site here: http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria
- '.faa' files are protein data sets;
- '.fna' files are genomic DNA;
- the rest are annotation files of various kinds.