Microsoft announced almost a year ago an update to its Microsoft Office suite, which in the case of Excel resulted in the addition of a new function that until then had been one of the most demanded by your users: the option to extract data directly from PDF documents, and import them into Excel documents.
The demand for this particular feature was due to the fact that the PDF format is the most used by companies and public institutions to disseminate reports, studies and press releases, which in many cases contain relevant data arranged in tables ...
…However, resorting to simple copy and paste in these cases is usually useless, since the format and order of the data in these tables is not preserved when pasting it into another document.
We needed a tool capable of interpreting and managing said data as tables and - although professional tools such as Tabula already existed - be able to perform this task directly from our spreadsheet it was a big step for user comfort.
So let's go over how we can accomplish this in just a few steps.
Steps to follow
We open a blank Microsoft Excel document, and click on the 'Data' tab of the window. After that a new toolbar will appear, the first icon of which opens a drop-down menu. We just have to click it and follow this route: 'Get data'> 'From a file'> 'From PDF'.
The typical Explorer window will open from where we will select the PDF that we want to use as the source of the data. Once that is done, a message will be displayed stating that 'connection is being established' with the PDF in question and analyzing the data. The duration of this last process may vary depending on the size of the PDF and the complexity of the data it houses.
In any case, once this step is completed, a window very similar to the following will appear:
This window shows in its sidebar the possible data sources of the document and, on the right, a preview of them. There will be, at least, as many data sources as different tables it detects and —as in the case of the example— it may happen that different interpretations of the same data are shown as different sources.
We choose the one that shows the data we are looking for in the most exact way possible (it will almost always require subsequent readjustments, as we will see) and click on 'Transform data'.
Then, the chosen data will be loaded in a new window of the Power Query Editor, which will allow us an advanced handling of them with respect to the usual Excel options:
We can, if we deem it appropriate, modify the table from this editor (in the following image, for example, we have deleted the two columns of null data). In any case, the important thing is to know the next step: click on 'Close and save' to send all this data to the main Excel window:
Once that is done, we will see the data - already formatted - in our lifelong interface to continue working with it. How can we check, certain aspects (like the extracted headers) still require a little readjustment.
If we compare it with the original, we can see that the data extraction and formatting are not perfect, but in any case There is no doubt that this new feature will save us a lot of time compared to the options available to us only a year ago: