How to use pdfplumber to extract the actual text body for academic articles #1117
Replies: 2 comments
-
This is possible, but will require custom coding/logic on your end to suit the particular PDFs you're parsing. PDFs use a wide range of formatting and layouts. For your particular question about two-column layouts, you can find some similar discussions on the topic, e.g. here: #954 |
Beta Was this translation helpful? Give feedback.
-
I have done a lot of this and am committed to PDFPlumber which is great. Simple answer - there is no simple answer. The most promising is a per-publisher approach either through machine-learning or heuristics (templates) or both. But note there is no algorithmic answer. Many fields in an article are not formally defined and we have to guess at the semantics. The example above #954 is a good example. It's almost certainly from an Elsevier article - they are generally consistent over most of the titles (but CellPress and others are different). Example: in the example given there are some words
Most PDF readers will assume that "A R T I C L E I N F O" (note spaces) is 11 separate letters . It's only our brains and experience interpret it as 2 words. The only approaches are to scann a large body from the same source and build software that can parse it most-of-the-time. I see the following methods:
|
Beta Was this translation helpful? Give feedback.
-
We are very interested in extracting the text body of academic articles via pdfplumber.
Now, there can be one or two columns and we have some things we do not want to consider, like abstracts, image captions or references.
Is that possible and if so, how?
Beta Was this translation helpful? Give feedback.
All reactions