Saturday, April 26, 2014

Parsing PDFs in Python

I am a big fan of personal finance and I always like to keep my books up to date. My favourite accounting software is GNU Cash. It’s free, powerful, and allows you to import transactions in various established financial interchange formats, such as Quicken, OFX, etc. Unfortunately, some institutions only allow you to export your monthly statements as M$ Excel, or worse, PDF.

In my particular case it was AMEX Canada, only providing monthly downloadable PDF statements. Manually copying over the transactions into GNU Cash is not an option for me. I have better things to do with my time. So I set out to find a solution to convert my AMEX statements into a format that GNU Cash understands, with QIF being the least painful one to convert to.

The pain of making sense of PDFs

PDF is an evil format. Even though it is called a document, it is more similar to an image format that does not have as much structure to it as for example XML, HTML, or EPUB for that matter. There have been several attempts to parse PDFs in Python in the past; however, the packages PyPDF and PyPDF2 are completely oblivious to the layout of the PDF. All you get is a stream of characters (without any spacing or formatting information).
Yuske Shinyama has a three-part video series on explaining how to make sense of the raw format. Also feeling the need to make sense of PDF data, he developed a package called PDFMiner in Python that allows you to extract strings and layout information from PDFs. He has an elaborate documentation explaining the design of his miner.
After a few tries with PyPDF2 I decided to give PDFMiner a chance. Below you find a code snipped that allows you to parse a PDF and get some structured plain-text content out of it.

#!/usr/bin/env python

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from StringIO import StringIO
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter

class MyParser(object):
    def __init__(self, pdf):
        ## Snipped adapted from Yusuke Shinyamas 
        #PDFMiner documentation
        # Create the document model from the file
        parser = PDFParser(open(pdf, 'rb'))
        document = PDFDocument(parser)
        # Try to parse the document
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed
        # Create a PDF resource manager object 
        # that stores shared resources.
        rsrcmgr = PDFResourceManager()
        # Create a buffer for the parsed text
        retstr = StringIO()
        # Spacing parameters for parsing
        laparams = LAParams()
        codec = 'utf-8'

        # Create a PDF device object
        device = TextConverter(rsrcmgr, retstr, 
                               codec = codec, 
                               laparams = laparams)
        # Create a PDF interpreter object
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        
        self.records            = []
        
        lines = retstr.getvalue().splitlines()
        for line in lines:
            self.handle_line(line)
    
    def handle_line(self, line):
        # Customize your line-by-line parser here
        self.records.append(line)

if __name__ == '__main__':
    p = MyParser(sys.argv[1])
    print '\n'.join(p.records)

With this sample it was just a piece of cake to develop a simple parsing grammar for the transaction records and dump them into a QIF file that could be imported in GNUCash. Since my QIF implementation was quite elaborate to handle all for formatting corner cases I leave you with conceptual line-by-line parser shown above to illustrate the approach.