Skip to content

Reading csv files with data.csv

jonase edited this page Aug 27, 2011 · 1 revision

Reading csv files with data.csv

This introduction provides an example of using data.csv. Please try to run the examples and if you find errors (either in code or text) or want to provide feedback, do not hesitate to contact me via github.

Data.csv is a library for reading and writing comma separated value files in the Clojure programming language. It consists of only two (public) functions, read-csv and write-csv. Both functions live in the namespace clojure.data.csv. A sample namespace declaration might look something like this:

(ns mlx.csv
  (:require [clojure.data.csv :as csv]
            [clojure.java.io  :as io]))

Data

In order to have some data to work with I visited http://finance.yahoo.com and searched for IBM's historical stock prices. Below the table of prices at that URL is a link where you can download the csv file used in this introduction. I saved the file as ibm.csv. Let's take a brief look at the file.

There are a total of 12449 lines in the file.

$ wc -l ibm.csv 
12449 ibm.csv

The file is 624 Kb. Not huge, but large enough to be interesting to work with.

$ ls -s ibm.csv 
624 ibm.csv

Looking at the structure of the file there are 7 columns. In this example I'm interested in the date (first), open (second) and close (fifth) columns.

$ head -n 4 ibm.csv 
Date,Open,High,Low,Close,Volume,Adj Close
2011-06-13,164.44,164.46,162.73,163.17,5099200,163.17
2011-06-10,164.57,164.84,162.87,163.18,4683300,163.18
2011-06-09,165.01,165.96,164.76,164.84,4288800,164.84

We'll need a few helper functions for assistance, let's define these now.

;; High school math
(defn percentage-change [old new]
  (* (/ (- new old) old) 100))

;; Clojure only provides first and second
(def fifth #(nth % 4))

;; Extract the interesting part
(def extract (juxt first second fifth))

;; Parse the data
(defn parse [stock-data-row]
  (let [[date open close] (extract stock-data-row)]
    [date (percentage-change (Double/parseDouble open)
                             (Double/parseDouble close))]))

Given a row of stock data parse will parse out the date and calculate the percentage change between the opening and closing prices:

user=> (parse ["2011-06-10" "164.57" "164.84" 
               "162.87" "163.18" "4683300" "163.18"])
["2011-06-10" 0.844625387373146]

Reading

I want to find out which date the IBM increased it's stock value the most over a period of almost 50 years of data. Here is the code for achieving that:

(defn max-change [csv-file]
  (with-open [reader (io/reader csv-file)]
    (->> (rest (csv/read-csv reader))
         (map parse)
         (apply max-key second))))

Let's walk through it. The second line creates a reader and with the macro with-open we can be confident that the reader is properly closed when it's done.

The third line (->> (rest (read-csv reader)) uses the data.csv function read-csv to lazily read the content of the file. The threading operator ->> is used to pass on the result for further processing.

The fourth line parses every row of stock data as we saw in the previous section.

The last row figures out which day the stock increased the most percentage wise.

Let's try it out in the REPL:

user=> (max-change "ibm.csv")
["2001-01-03" 12.979104477611948]

January 3, 2001 was apparently a good day for IBM.

If the last expression (apply max-key second) is replaced with ((juxt #(apply max-key second %) #(apply min-key second %))), we'll get both the highest increase as well as decrease:

user=> (max-change "ibm.csv")
[["2001-01-03" 12.979104477611948] ["1987-10-19" -23.51851851851852] ]

You can read about October 19, 1987 on http://en.wikipedia.org/wiki/Black_Monday_(1987).