In my particular case it was AMEX Canada, only providing
monthly downloadable PDF statements. Manually copying over the transactions
into GNU Cash is not an option for me. I have better things to do with my time.
So I set out to find a solution to convert my AMEX statements into a format
that GNU Cash understands, with QIF being the least painful one to convert to.
The pain of making sense of PDFs
PDF is an evil format. Even though it is called a document,
it is more similar to an image format that does not have as much structure to
it as for example XML, HTML, or EPUB for that matter. There have been several
attempts to parse PDFs in Python in the past; however, the packages PyPDF and
PyPDF2 are completely oblivious to the layout of the PDF. All you get is a
stream of characters (without any spacing or formatting information).
Yuske Shinyama has a three-part video series on explaining
how to make sense of the raw format. Also feeling the need to make sense of PDF
data, he developed a package called PDFMiner in Python that allows you to
extract strings and layout information from PDFs. He has an elaborate documentation
explaining the design of his miner.
After a few tries with PyPDF2 I decided to give PDFMiner a
chance. Below you find a code snipped that allows you to parse a PDF and get
some structured plain-text content out of it.
#!/usr/bin/env python import sys from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from StringIO import StringIO from pdfminer.layout import LAParams from pdfminer.converter import TextConverter class MyParser(object): def __init__(self, pdf): ## Snipped adapted from Yusuke Shinyamas #PDFMiner documentation # Create the document model from the file parser = PDFParser(open(pdf, 'rb')) document = PDFDocument(parser) # Try to parse the document if not document.is_extractable: raise PDFTextExtractionNotAllowed # Create a PDF resource manager object # that stores shared resources. rsrcmgr = PDFResourceManager() # Create a buffer for the parsed text retstr = StringIO() # Spacing parameters for parsing laparams = LAParams() codec = 'utf-8' # Create a PDF device object device = TextConverter(rsrcmgr, retstr, codec = codec, laparams = laparams) # Create a PDF interpreter object interpreter = PDFPageInterpreter(rsrcmgr, device) # Process each page contained in the document. for page in PDFPage.create_pages(document): interpreter.process_page(page) self.records = [] lines = retstr.getvalue().splitlines() for line in lines: self.handle_line(line) def handle_line(self, line): # Customize your line-by-line parser here self.records.append(line) if __name__ == '__main__': p = MyParser(sys.argv[1]) print '\n'.join(p.records)
With this sample it was just a piece of cake to develop a simple parsing grammar for the transaction records and dump them into a QIF file that could be imported in GNUCash. Since my QIF implementation was quite elaborate to handle all for formatting corner cases I leave you with conceptual line-by-line parser shown above to illustrate the approach.
No comments:
Post a Comment