coderholic

Parsing CSV data in Python

Python provides the csv module for parsing comma separated value files. It allows you to iterate over each line in a csv file and gives you a list of items on that row. For example, given the following csv data:

id, name, date
0, name, 2009-01-01
1, another name, 2009-02-01

You'd end up with something like:

["id", "name", "date"],
["0", "name", "2009-01-01"],
["1", "another name", "2009-02-01"]

In some situations it is nice to have a dictionary of keys and values though, so that instead of a simple list of columns we end up with:

{"id": "0", "name": "name", "date": "2009-01-01"},
{"id": "1", "name": "another name", "date": "2009-02-01"}

This would allow us to refer to fields by name rather than position in the list. Do you really want to remember that date is in position 2? And what happens if the input data changes, and a new column is added between name and date? If we're referring to columns by position then we'll have to change our existing code, but by referring to it by name we won't have to change anything.

It turns out this is pretty easy to achieve, in only a few lines of python:

import csv
data = csv.reader(open('data.csv'))
# Read the column names from the first line of the file
fields = data.next()
for row in data:
        # Zip together the field names and values
    items = zip(fields, row)
    item = {}
        # Add the value to our dictionary
    for (name, value) in items:
        item[name] = value.strip()

The csv module allows you to specify a delimiter, so if your data separated you just need to make a single change:

data = csv.reader(open('data.tsv'), delimiter='\t')

Update

Thanks to several people for mentioning csv.DictReader, which does exactly what I've mentioned here. Having a look at the code it does something very similar, but also takes into account rows of different length, ignores empty columns, and uses the method Tim mentioned in the comments for creating the dictionary:

    # From csv.py
    def next(self):
        if self.line_num == 0:
            # Used only for its side effect.
            self.fieldnames
        row = self.reader.next()
        self.line_num = self.reader.line_num

        # unlike the basic reader, we prefer not to return blanks,
        # because we will typically wind up with a dict full of None
        # values
        while row == []:
            row = self.reader.next()
        d = dict(zip(self.fieldnames, row))
        lf = len(self.fieldnames)
        lr = len(row)
        if lf < lr:
            d[self.restkey] = row[lf:]
        elif lf > lr:
            for key in self.fieldnames[lr:]:
                d[key] = self.restval
        return d
Posted on 03 Sep 2009
If you enjoyed reading this post you might want to follow @coderholic on twitter or browse though the full blog archive.