• Ramsay Lewis

How To Extract Table Data from a PDF or Image

Updated: Oct 9


They say that data cleaning is the most time-intensive part of a data scientist’s life.

Can confirm.

Recently, in conducting a project for Dr. Keyne Law of the CRISIS Lab at Seattle Pacific University, I found myself needing to extract tables of 911 call data from several .pdf files. It should have been easy: the pdf was clearly generated from the output of an SQL query, so it was already nicely formatted in a table.

Unfortunately, it wasn’t possible to simply get the output as a .csv file or an excel file (because… like… the government). So, instead, I spend some time testing and (in)validating several methods for extracting tables from .pdfs.

Here are the methods I tried, the one that worked, and the ones I learned about after. I’m sharing them all with you in case it’s helpful in your own journey extracting data from images or pdfs.


The best app I found for extracting table data from a PDF or image: Nanonets

Ultimately, what worked was Nanonets.

Nanonets is a SaaS platform that’s designed to automate data entry. It promises AI and machine learning algorithms to help you more efficiently understand what data you need from your pdf and optimally extract it. I didn’t need anything AI-ey, and my set of pdfs was too small to test whether it really got better at extracting over time.

But it did do what I needed: take a table in a .pdf and extract it into a spreadsheet.

What was good

  • It worked: it took a pdf table and turned it into excel data. Whoo!

  • It had a free trial of 100 pages, so I didn’t have to pay for it before testing it.

Limitations and challenges

  • It took a really long time for the pdfs to process—about 20 minutes for a file with 21 pages.

  • It wasn’t super easy to specify what data I needed or the rules for extracting it.


Screen shot of the Nanonets app
Screenshot of Nanonets by Author.

Other notes

  • It seems powerful, although I just needed some basic functionality.

  • It lets you “verify” each piece of data. I didn’t use this feature and it seems to undermine the “automation” part of the data entry, but perhaps it’s useful to some people.

The result was a spreadsheet with my data in it. Bless. I did have to clean the data, but it solved my problem.


 

Here are the apps that I tested that didn’t work for me.

Methods that didn’t work #1: Adobe Acrobat Pro

The Pro version of Adobe Acrobat lets you export a .pdf file to other kinds of files, including Excel files. This was the first option that I tried.

What was good

  • There was a free trial so I could find out that it didn’t work before paying (it’s expensive).

  • It did export my pdf to an excel file.

What went wrong

  • The resulting file was a mess. Lines were randomly merged together, throughout the table, and sometimes even columns were too. It was unusable.

Methods that didn’t work #2: Docparser

Next, I tried Docparser. I was aware of it because I followed its founder, Moritz Dausinger on Twitter. It was promising because it offered to “Convert Any Document to Data in Seconds!”

What was good

  • It was easy to set up extraction rules. You just dragged around the lines for where the table data was and it used those to pull out your data.

  • It was quick — much quicker than Nanonets.

  • There was a free trial so I could test it.

What went wrong

  • The resulting file was messy, although not terrible. It took quite a bit of time to clean up. I played around a bit with the rules to try to improve the output but it still resulted in messy data.

  • It was unreliable. I realized after cleaning the data for quite a while that it didn’t extract about 400 of the 1700 lines of data I had (it skipped about 25% of the data). Worse, there wasn’t an obvious pattern for which data it skipped — sometimes it skipped a whole page, but other times it skipped just a few lines. So I couldn’t trust the results or just re-extract a couple of pages easily.

Methods that didn’t work #3: Docsumo

After Docparser, I spent some time Googling and found some other apps. The first one I came across was Docsumo. It seemed promising: it claimed that it could, “help you convert unstructured documents such as pay stubs, invoices and bank statements to actionable data,” and that it, “Works with documents in any format with minimal setup.”

What was good

  • It was easy to set up extraction rules.

  • I did a test with my smallest .pdf and it worked really well for the 23 or so pieces of data in that table.

What went wrong

  • When I tried to use it for my larger .pdfs (13 and 21 pages, respectively), it stopped working. It gave me blank Excel files. I have no idea why it did this with the second and third files, but not with the first.

Methods that didn’t work #4: Rossum

Next, I tried Rossum. It was promising: it billed itself as “The World’s Easiest and Most Accurate OCR system”. When you set up your account, it becomes obvious it’s built for extracting data from invoices, but I figured I’d see if it still worked for my needs.

What was good

  • It has a sleek design.

  • It had a free trial so I could test it.

What went wrong

  • The extraction rules are hard to set up. It’s not very intuitive to use.

  • It didn’t actually extract any data. When I finally went to download my data, it gave me a blank excel file. Not useful.

Methods that didn’t work #5: PDFTron

Next, I tried PDFTron.

What was good

  • It had a free trial so I could test it.

What went wrong

  • It didn’t actually extract any data — it just gave me blank excel files.

Methods that didn’t work #6: PDFTables

Next, I tried PDFTables.

What was good

  • It had a free trial so I could test it.

What went wrong

  • It didn’t actually extract any data — it just gave me blank excel files.


 

Methods I didn’t try

There are several methods I didn’t try because I found them (or they were suggested to me) after I solved my problem.

Scraping Table Data From PDF Files Using the Tabula Python Library

The first is using the tabula-py library to scrape data from a table in a .pdf. It looks really useful, at least for cleanly structured tables with borders between cells. Unfortunately, I found this after I had finished, so I didn’t test it. Thanks to Satya Ganesh for sharing on Medium.

The Camelot Python Library

Another that I found after the fact is the Camelot library. You can find the instructions for using it here.


Conclusion: Nanonets was the best app for extracting tables from my set of .pdfs

I tried several apps to extract tables from .pdf files, but many of them didn’t work very well. The best one I found was Nanonets because it actually extracted the data I needed and it was pretty good about formatting the data cleanly.

It did still take me some time to clean after, but it was the best option I found. The other apps I list here might work for some circumstances, but they didn’t work for me at the time that I tried them.

Have another strategy for getting table data out of a .pdf? I’d love to hear it—feel free to share.


 

Looking for help extracting or cleaning your data? I do that. Get in touch to let me know how I can help.

28 views0 comments

Recent Posts

See All