October 3, 2014

Extract annotations and highlighted passages from PDF files

The problem

Every student, researcher and scientist has to read and make notes and highlight or scribble on texts. In long-gone days I spent ages working on Uni literature and making summary charts of what I read, and I really wish I still had some of those notes at my fingertips.

Then came the electronic age. I pretty soon decided that the only future-proof way of marking up electronic texts is do it to the actual PDF files themselves. In 20 years I should still be able to open up a PDF from yesterday and find the scribbles I made and thoughts I had.

But how to know which of the zillions of PDFs on my hard drive or in my portion of the cloud I have actually made notes on, without opening them up?

The wrong solution: special software

Sure, the excellent zotfile plug-in for the excellent zotero can extract my notes but I can only see them in zotero, not on my hard drive. At some point you can be sure the actual PDFs will get detached from zotero or zotero will get bought by google and put to death, or whatever. And don’t get me started on mendley or even Endnote.

The right solution: use my little script

So I wrote my very first python script which you can find at github. You need to have python installed. Make the script executable and save it at the root of the folder where your subfolders with PDFs reside. When you run it, it finds and extracts the annotations or highlights from every PDF file within that folder and all its subfolders. So if your file myfile.pdf has annotations or highlights, it saves a file myfile.annotations.txt in the same folder. The new text file has the same modification date and the same name so it should get listed together with its big sister in your file browser, whether your files are sorted by modification date or by title. This means you can see at a glance if you have already marked up a PDF, and you can see all the notes (together with the approximate page numbers) at a glance without having to open the PDF and click through it.

Include the script in a cron job (or a scheduled task on Windows) to run every 15 minutes or so (it only takes a few seconds for the 34000 files in my folder, of which quite a few thousand are PDFs, of which about 250 have annotations) so you don’t have to remember to run it.

I would be interested to hear if this works on Windows and Mac OS too (I am on Linux).

For pure happiness and efficiency, combine this with my other tips on mirroring your zotero collections on your filesystem. So if, say, you are syncing your PDFs with your phone or tablet, you will also see the same little text files next to your PDFs on the other device too, so you will know which ones you have read and which ones you haven’t.

Pretty nifty.

Extra feature: extracting to-dos

When I am reading and commenting on PDFs I often realise I need to carry out further tasks like, say, look up a reference mentioned in the document. So I could switch to some other program (or a sticky note) and write down the to-do and copy the information I need to do the task as well as perhaps the name of the PDF etc etc.

If this is part of your workflow too, you will want to know about an extra feature of my script. If you write a short pop-up note in your PDF and include just the characters xk” (you can change this in the script) anywhere in the note, then the when you run the script you should find a folder called xk in the root of your drive, full of little text files, one for each pop-up note which has xk” somewhere in it.

So if for example you make a note

remember to ask the Professor for more info on the project xk

on p. 6 of document mydoc.pdf”, you should find a corresponding note in the folder xk named something like this:

remember to ask the Professor for more info on the project xk - mydoc - p. 6.txt


reproducibleResearch opensource zotero code


Previous post
From wordpress to pelican Wordpress is a great blogging engine. But I spend almost all my time with plaintext files in markdown. Whatever I am working on, from my CV to
Next post
Your zotero filesystem Zotero is a really great tool for organising your scientific and professional documents and their citations, and keeping documents and citations


This blog by Steve Powell is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, syndicated on r-bloggers and powered by Blot.
Privacy Policy
.