Cool Tools: How to pull data from a PDF for use in Excel

This post is reposted from sites.gsu.edu/data_viz

Have you ever wanted to pull all of the raw data out of, let’s say, a pdf file of government data or a journal article? It’s the worst! You can’t just copy and paste from it, so first you try to contact the author, but you never hear back invariably. Second, you try to hand type the data, but if it’s a worthwhile amount of data, by the time you are half way through, you realize that none of your rows match up to what they are supposed to be, or you have carpel tunnel so bad that you are forced to bathe your delicate hands in your kitchen ice maker to reduce the swelling. After these two abominable efforts, you finally decide that you are willing to pay whoever or whatever to get some sort of tool to extract the data.

And this is where you suddenly wonder if the tool you are about to install is a real app, or just a devious technique to take your credit card information, and probably the deed to your house as well.

Have no fear, Tabula is here to help you in your moment of need without all of those pesky viruses.

tabula icon

 

Working with Tabula is simple. From the Tabula website, here’s how it works:

  1. Upload a file with tables you would like to copy.
  2. Draw a box around the area of the table you would like to copy. (Note: currently, Tabula can’t select tables over multiple pages)
  3. You will be given the option to copy the table as a CSV (comma-separated values) file or download the CSV or TSV (tab separated values). If you notice any errors in the table, you can make text edits to the selected text before copying or downloading.

Wow! all with a free tool for Windows, Mac, or, for our favorite penguin-lovers out there, Linux. If you haven’t checked it out yet, let us know how useful you find it!

Tagged with: , , ,
Posted in Cool Tools

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>