forked from ninas/umonya_notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathparsing.html
152 lines (107 loc) · 5.16 KB
/
parsing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Introductory Programming in Python: Basic Parsing</title>
<link rel='stylesheet' type='text/css' href='style.css' />
<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />
<script src="animation.js" type="text/javascript">
</script>
</head>
<body onload="animate_loop()">
<div class="page">
<h1>Introductory Programming in Python: Lesson 20<br />
Basic Parsing</h1>
<div class="centered">
[<a href="regexp.html">Prev: Regular Expressions</a>] [<a href="index.html">Course Outline</a>] [<a href="os.html">Next: Operating System Functionality</a>]
</div>
<h2>What is Parsing?</h2>
<p>The vast majority of your time as a bioinformaticist will be spent
converting data from one format to another. You invariably want to
massage your raw data into a format which you can plug into an analysis
tool, or set of tools. Parsing is the name given to the reading and
programmatic interpretation of flat file formats. There are two basic
approaches to parsing files, namely the serial access approach, and the
document object model approach.</p>
<h2>General Comments of File Structure</h2>
<p>The majority of file formats have a structure that involves a small
bit of overall information, if any, followed by a collection of
repeated structures. A FastA file for example, has no header
information, but consists entirely of the repeated structure of title
line followed by sequence lines. If we write some code to read the
repeated structures, and put this in a loop, we have a parser. Multiple
levels of nesting of structures can be handled similarly, by writing
some code for the most deeply nested structure, and put that code
inside the loop or code for the outer structure, and so on and so
forth.</p>
<h2>The Serial Access Approach</h2>
<p>The serial approach involves reading the file in question from top
to bottom, in sequence, acting on each structural element of the file
format as it is encountered, then overwriting it with the next one.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Files are usually accessed in a serial or sequential manner
anyway, so this approach adds no complexity</li>
<li>Because we are only keeping one structural element in memory at
a time, we can handle large files with ease.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>We have to read all the data in the file up to the structure of
interest, even if it's of no importance to us.</li>
<li>Some file structures contain back references, i.e. structures
that point to formerly encountered structures. If we have thrown
these previously encountered structures away, then the structure we
are dealing with at the moment does not have complete information.
We must reprocess on a second pass to fill out this
information.</li>
<li>The problem of forward referencing is solvable without a second
pass, although the solution is complex. Whenever we encounter a
forward reference we must record where the reference was made, and
to what it is being made. Every structure read from file after that
must be checked against the entire list of 'unfulfilled
references', and if it matches one of them, appropriate data from
the structure is substituted into the structure referencing
it.</li>
</ul>
<h2>The Document Object Model (DOM) Approach</h2>
<p>The document object model approach refers to mapping out the file
format structure in memory using data structures that mirror the
structure in the file. The entire file is read in to memory, then
closed. Henceforth everything is dealt with in memory. So an outline of
the DOM approach is</p>
<ol>
<li>Look at the file</li>
<li>Identify it's structure</li>
<li>Create a data structure in memory that mirrors the file's
structure</li>
<li>Read the whole file into that data structure</li>
<li>Close the file</li>
</ol>
<p><strong>Pros</strong></p>
<ul>
<li>We can access the data in any order and far faster then reading
to it serially, once the entire file is loaded.</li>
<li>We gain complete understanding of the file format's
structure</li>
<li>Because all the structures are available at once, and
accessible directly without any tedious serial access, back
referencing and forward referencing are as simple as looking up the
relevant data.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Large files or data sets do not conveniently fit into
memory</li>
<li>We have to read the whole file</li>
</ul>
<h2>Exercises</h2>
<div class="centered">
[<a href="regexp.html">Prev: Regular Expressions</a>] [<a href="index.html">Course Outline</a>] [<a href="os.html">Next: Operating System Functionality</a>]
</div>
</div>
<div class="pagefooter">
Copyright © James Dominy 2007-2008; Released under the <a href="http://www.gnu.org/copyleft/fdl.html">GNU Free Documentation License</a><br />
<a href="intropython.tar.gz">Download the tarball</a>
</div>
</body>
</html>