Recently I needed to migrate thousands of PDFs of fiscal impact statements for the South Carolina Revenue and Fiscal Affairs Office, as part of their migration to Drupal. In their legacy site, the impact statements could be browsed as links to files, with the list being manually updated as new impact statements were added, or existing ones updated.
I wanted a more elegant solution for the new site which would provide users with a faceted search of key data within the impact statements such as Legislative Session, Legislative Body, Bill Number, Bill Status, and Bill Author. This data existed in all of the PDFs, we just needed a way to consistently extract it. Spacy and Tesseract to the rescue!
Tesseract is an open source ocr engine developed by Google. It can recognize and process text in more than 100 languages out of the box.
Spacy is an open source Python libary for Natural Language Processing and excels at extracting information from unstructured content.
Our first step was to create an Impact Statement content type that provided the discrete fields that we wanted, as well as a media field for storing the actual pdf version of the impact statement.
We then created a migration (using migrate api) to create new Impact Statement entities for each impact statement, and poulate a media field with the pdf.
The next step was to figure out how to consistently retrieve the data from each pdf to populate the relevant fields in the content type during the migration.
To make this easy for other developers on the project to install and use, I wrapped Tesseract and Spacy in a Flask app that exposed the key functionality as a rest api. I then Dockerized the app so anyone on the team could just download the repo and get started, without needing to install all the dependencies locally.
We then created a custom process plugin for our migrate api implementaion that fed the pdf to the rest endpoint and received back json containing the field data needed for our content type.
The api would first pass each pdf to Tesseract to process and extract the text which was then passed to Spacy to extract the discrete data needed for the content types. This data was returned to the migrate plugin for validation and cleanup before being stored in the content type.
I've put the implementation I used up on github. Feel free to use it as a starting point for your own explorations. Just keep in mind some parts of the implementation were hard coded for my particular use-case, and there were a lot of best practices skipped as the goal was just to create an internal tool to experiment with and facilitate the migration.
In the future, this could be taken much further by extracting other relevant information from documents and creating a knowledge graph of relationships, and recurring entities such as companies, locations, people, etc. This can provide more ways for users to find and explore documents and data.