Sunday, October 19, 2014

Plex Media Server and Removable Media

About a year ago I got a Samsung Smart TV. One of the neat things about this TV is that it can run Apps like Plex. Plex is a client that connects to a corresponding media server, that could run on an old Linux box or Linux NAS.
While the whole world seems to go towards streaming stuff from the cloud (incurring excessive bandwidth charges) and buying terabytes of hard-drives (expensive), there is still a small fraction like me, that embraces on physical media. Especially Blu-Ray seems offers a reliable cost-efficient way to store immutable data (i.e. stuff you never modify) like movies, favourite TV series, etc. Assuming a dual-layer Blu-Rray you can stuff about 25 TB of stuff into a that case. If you buy the BD-R dual-layer in bulk from Japan, you can get them as low as 3$ - 4$ a piece. In terms of long-term storage characteristics, the Blu-Ray runs circles around your regular HDD. Compare that to investing a 25 TB NAS array. As you read this, Panasonic and Sony paired up to release a 300 GB disk by the end of 2015.
Anyways, the assumption to stream from a HDD is baked into many media servers like Plex. The obvious way to fix it hack the server at the OS level. Going back a decade the new kid on the block to do this was autofs in Linux that is still part of many modern distributions. My media server currently still runs Ubuntu 11.04. I could install it using
apt-get install autofs
Once installed it creates five files in /etc:
/etc/auto.master
/etc/auto.misc
/etc/auto.net
/etc/auto.smb
The idea of autofs is the following. Once you touch a directory that points to an external source, autofs tries to mount it. Specifically for my Blu-Ray player I did the following:
#/etc/auto.misc
cdrom -fstype=iso9660,ro,nosuid,nodev :/dev/cdrom
cdrom is the name of the directory to touch in the mount-point directory of autofs, which is declared in:
#/etc/auto.master
#
# Sample auto.master file
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# For details of the format look at autofs(5).
#
/vol /etc/auto.misc --timeout 3
In my case I chose /vol which is unmounted after inactivity of 3 seconds. In a nutshell, this boils down to mounting /vol/cdrom if it gets touched. What is left to do is to create the mount-point directory and give it the appropriate permissions. Make sure that the autofs mount-point is not used in: /etc/fstab.
sudo mkdir /vol
sudo chmod 0755 /vol
Now restart the autofs service and try it out.
sudo service autofs restart
ls /vol/cdrom
...
Now the issue left on the table is to instruct the media server to reset its index every time a disk is removed and to re-index every time a disk is inserted. For Plex, I wrote this little script to do the grunt-work. It takes the auto-mount point as first parameter and the id of the collection that points to the drive as second parameter. If it is run and detects a medium, a re-index is triggered. If no media is inserted the index is wiped.
#!/bin/bash
if [ "$#" -ne 2 ]; then
    echo "Illegal number of parameters"
    exit 1
fi

VOL=$1
SECTION=$2

if [ `ls ${VOL} 2> /dev/null | wc -l ` -ge 1 ]; then
 # Make sure only one indexer is running at the same time
 if [ `ps -eadf | grep "Plex Media Scanner" | wc -l` -le 1]; then
  # Index the section
  /usr/lib/plexmediaserver/Plex\ Media\ Scanner --scan --refresh --section ${SECTION} &> /dev/null
 fi
else
 if [ `ps -eadf | grep "Plex Media Scanner" | wc -l` -le 1]; then
  # Wipe the section
  /usr/lib/plexmediaserver/Plex\ Media\ Scanner --reset --section ${SECTION} &> /dev/null
 fi
fi
To run it automatically, you can add it to the crontab of root as follows.
sudo crontab -e
# add the following line
* * * * * su plex -c "/usr/sbin/scan_plex /vol/cdrom 4"
Cron runs this script every minute. So within a full-minute you have your inserted media indexed. Potential extensions are to trigger such a script from udev on auto-mount, or write your own little daemon, but that is a little more complex.

Saturday, April 26, 2014

Parsing PDFs in Python

I am a big fan of personal finance and I always like to keep my books up to date. My favourite accounting software is GNU Cash. It’s free, powerful, and allows you to import transactions in various established financial interchange formats, such as Quicken, OFX, etc. Unfortunately, some institutions only allow you to export your monthly statements as M$ Excel, or worse, PDF.

In my particular case it was AMEX Canada, only providing monthly downloadable PDF statements. Manually copying over the transactions into GNU Cash is not an option for me. I have better things to do with my time. So I set out to find a solution to convert my AMEX statements into a format that GNU Cash understands, with QIF being the least painful one to convert to.

The pain of making sense of PDFs

PDF is an evil format. Even though it is called a document, it is more similar to an image format that does not have as much structure to it as for example XML, HTML, or EPUB for that matter. There have been several attempts to parse PDFs in Python in the past; however, the packages PyPDF and PyPDF2 are completely oblivious to the layout of the PDF. All you get is a stream of characters (without any spacing or formatting information).
Yuske Shinyama has a three-part video series on explaining how to make sense of the raw format. Also feeling the need to make sense of PDF data, he developed a package called PDFMiner in Python that allows you to extract strings and layout information from PDFs. He has an elaborate documentation explaining the design of his miner.
After a few tries with PyPDF2 I decided to give PDFMiner a chance. Below you find a code snipped that allows you to parse a PDF and get some structured plain-text content out of it.

#!/usr/bin/env python

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from StringIO import StringIO
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter

class MyParser(object):
    def __init__(self, pdf):
        ## Snipped adapted from Yusuke Shinyamas 
        #PDFMiner documentation
        # Create the document model from the file
        parser = PDFParser(open(pdf, 'rb'))
        document = PDFDocument(parser)
        # Try to parse the document
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed
        # Create a PDF resource manager object 
        # that stores shared resources.
        rsrcmgr = PDFResourceManager()
        # Create a buffer for the parsed text
        retstr = StringIO()
        # Spacing parameters for parsing
        laparams = LAParams()
        codec = 'utf-8'

        # Create a PDF device object
        device = TextConverter(rsrcmgr, retstr, 
                               codec = codec, 
                               laparams = laparams)
        # Create a PDF interpreter object
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        
        self.records            = []
        
        lines = retstr.getvalue().splitlines()
        for line in lines:
            self.handle_line(line)
    
    def handle_line(self, line):
        # Customize your line-by-line parser here
        self.records.append(line)

if __name__ == '__main__':
    p = MyParser(sys.argv[1])
    print '\n'.join(p.records)

With this sample it was just a piece of cake to develop a simple parsing grammar for the transaction records and dump them into a QIF file that could be imported in GNUCash. Since my QIF implementation was quite elaborate to handle all for formatting corner cases I leave you with conceptual line-by-line parser shown above to illustrate the approach.