How to use pdfplumber to extract the actual text body for academic articles #1117

junoriosity · 2024-04-04T22:25:05Z

junoriosity
Apr 4, 2024

We are very interested in extracting the text body of academic articles via pdfplumber.

Now, there can be one or two columns and we have some things we do not want to consider, like abstracts, image captions or references.

Is that possible and if so, how?

jsvine · 2024-04-05T14:31:45Z

jsvine
Apr 5, 2024
Maintainer

This is possible, but will require custom coding/logic on your end to suit the particular PDFs you're parsing. PDFs use a wide range of formatting and layouts. For your particular question about two-column layouts, you can find some similar discussions on the topic, e.g. here: #954

0 replies

petermr · 2024-05-21T16:30:56Z

petermr
May 21, 2024

I have done a lot of this and am committed to PDFPlumber which is great.

Simple answer - there is no simple answer. The most promising is a per-publisher approach either through machine-learning or heuristics (templates) or both. But note there is no algorithmic answer. Many fields in an article are not formally defined and we have to guess at the semantics. The example above #954 is a good example. It's almost certainly from an Elsevier article - they are generally consistent over most of the titles (but CellPress and others are different).

Example: in the example given there are some words

A R T I C L E  I N F O 
____________________________
Keywords:
UAV 
Rice panicle
Semantic segmentation 
...

Most PDF readers will assume that "A R T I C L E I N F O" (note spaces) is 11 separate letters . It's only our brains and experience interpret it as 2 words. The only approaches are to scann a large body from the same source and build software that can parse it most-of-the-time. I see the following methods:

use the ground truth (e.g. the XML or HTML used to create the file). This is rarely available but often the output HTML is available (e.g. EuropePMC or OpenAlex) and it may be possible to help generate rules.
machine learning to create (say) HTML. This requires a LOT of humans to validate the output and the mindless task of retraining the algorithm
generate rules/templates. This is what I am working on. For example it should be possible to create per-publisher (or per-journal) templates. These contain:
- size of headers and footers (PDFPlumber gives coords)
- font size and style for different regions (PDFPlumber gives styles and fonts - very powerful)
- indents and outdents for lists and bullet characters
It requires dedication and won't give rise to high-impact-factor publications. BUT if you are interested in helping , then I already have a framework. I'm currently working on UNFCCC climate PDFs because this we need better transmission of climate knowledge.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use pdfplumber to extract the actual text body for academic articles #1117

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to use pdfplumber to extract the actual text body for academic articles #1117

junoriosity Apr 4, 2024

Replies: 2 comments

jsvine Apr 5, 2024 Maintainer

petermr May 21, 2024

junoriosity
Apr 4, 2024

jsvine
Apr 5, 2024
Maintainer

petermr
May 21, 2024