parsing.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title>Introductory Programming in Python: Basic Parsing</title>
	<link rel='stylesheet' type='text/css' href='style.css' />
	<meta http-equiv='Content-Type' content='text/html; charset=utf-8' />
	<script src="animation.js" type="text/javascript">
	</script>
</head>

<body onload="animate_loop()">
	<div class="page">

		<h1>Introductory Programming in Python: Lesson 20<br />

		Basic Parsing</h1>

		<div class="centered">
			[<a href="regexp.html">Prev: Regular Expressions</a>]&nbsp;[<a href="index.html">Course Outline</a>]&nbsp;[<a href="os.html">Next: Operating System Functionality</a>]
		</div>

		<h2>What is Parsing?</h2>

		<p>The vast majority of your time as a bioinformaticist will be spent
		converting data from one format to another. You invariably want to
		massage your raw data into a format which you can plug into an analysis
		tool, or set of tools. Parsing is the name given to the reading and
		programmatic interpretation of flat file formats. There are two basic
		approaches to parsing files, namely the serial access approach, and the
		document object model approach.</p>

		<h2>General Comments of File Structure</h2>

		<p>The majority of file formats have a structure that involves a small
		bit of overall information, if any, followed by a collection of
		repeated structures. A FastA file for example, has no header
		information, but consists entirely of the repeated structure of title
		line followed by sequence lines. If we write some code to read the
		repeated structures, and put this in a loop, we have a parser. Multiple
		levels of nesting of structures can be handled similarly, by writing
		some code for the most deeply nested structure, and put that code
		inside the loop or code for the outer structure, and so on and so
		forth.</p>

		<h2>The Serial Access Approach</h2>

		<p>The serial approach involves reading the file in question from top
		to bottom, in sequence, acting on each structural element of the file
		format as it is encountered, then overwriting it with the next one.</p>

		<p><strong>Pros:</strong></p>

		<ul>

			<li>Files are usually accessed in a serial or sequential manner
			anyway, so this approach adds no complexity</li>

			<li>Because we are only keeping one structural element in memory at
			a time, we can handle large files with ease.</li>

		</ul>

		<p><strong>Cons:</strong></p>

		<ul>

			<li>We have to read all the data in the file up to the structure of
			interest, even if it's of no importance to us.</li>

			<li>Some file structures contain back references, i.e. structures
			that point to formerly encountered structures. If we have thrown
			these previously encountered structures away, then the structure we
			are dealing with at the moment does not have complete information.
			We must reprocess on a second pass to fill out this
			information.</li>

			<li>The problem of forward referencing is solvable without a second
			pass, although the solution is complex. Whenever we encounter a
			forward reference we must record where the reference was made, and
			to what it is being made. Every structure read from file after that
			must be checked against the entire list of 'unfulfilled
			references', and if it matches one of them, appropriate data from
			the structure is substituted into the structure referencing
			it.</li>

		</ul>

		<h2>The Document Object Model (DOM) Approach</h2>

		<p>The document object model approach refers to mapping out the file
		format structure in memory using data structures that mirror the
		structure in the file. The entire file is read in to memory, then
		closed. Henceforth everything is dealt with in memory. So an outline of
		the DOM approach is</p>

		<ol>

			<li>Look at the file</li>

			<li>Identify it's structure</li>

			<li>Create a data structure in memory that mirrors the file's
			structure</li>

			<li>Read the whole file into that data structure</li>

			<li>Close the file</li>

		</ol>

		<p><strong>Pros</strong></p>

		<ul>

			<li>We can access the data in any order and far faster then reading
			to it serially, once the entire file is loaded.</li>

			<li>We gain complete understanding of the file format's
			structure</li>

			<li>Because all the structures are available at once, and
			accessible directly without any tedious serial access, back
			referencing and forward referencing are as simple as looking up the
			relevant data.</li>

		</ul>

		<p><strong>Cons:</strong></p>

		<ul>

			<li>Large files or data sets do not conveniently fit into
			memory</li>

			<li>We have to read the whole file</li>

		</ul>

		<h2>Exercises</h2>

		<div class="centered">
			[<a href="regexp.html">Prev: Regular Expressions</a>]&nbsp;[<a href="index.html">Course Outline</a>]&nbsp;[<a href="os.html">Next: Operating System Functionality</a>]
		</div>
	</div>

	<div class="pagefooter">
		Copyright &copy; James Dominy 2007-2008; Released under the <a href="http://www.gnu.org/copyleft/fdl.html">GNU Free Documentation License</a><br />
		<a href="intropython.tar.gz">Download the tarball</a>
	</div>
</body>
</html>