Help Extracting PDF with many small tables #1170

jfitzpatrick6 · 2024-07-11T13:55:01Z

jfitzpatrick6
Jul 11, 2024

I am attempting to scrape a pdf containing a bunch of pricing data, and I am having trouble getting pdf plumber to identify each smaller table in a consistent matter. I have tried filtering based on stroke color and using the curves and edges to define explicit lines. Which has definetly gotten me closer to the end goal, but the results are still inconsistent.

This is the unmarked pdf:

This is the output of debug_table finder:

This is my code so far:
def inside(self, other):
return all((
self['x0'] >= other['x0'],
self['top'] >= other['top'],
self['x1'] <= other['x1'],
self['bottom'] <= other['bottom']
))

def largest_parent_rect(page, self):
parent_rects = [other for other in page.rects if inside(self, other)]
if parent_rects:
parent_rect = max(parent_rects, key=itemgetter('width', 'height'))
if self != parent_rect:
return parent_rect

def remove_nested_rects(page, keep_largest=False):
def filter_condition(other):
if other['object_type'] == 'rect':
return tuple(other['pts']) not in rects_to_remove
return True

rects_to_remove = set()

for rect in page.rects:
    parent = largest_parent_rect(page, rect) 
    if parent is not None:
        rects_to_remove.add(tuple(rect['pts']))
        if keep_largest is False:
            rects_to_remove.add(tuple(parent['pts']))

return page.filter(filter_condition)

def keep_visible_lines(obj):
if obj['object_type'] == 'rect':
return obj['non_stroking_color'] == [1]
return True

frames = []
for file in glob.glob("C:\Users\jfitzpatrick\Desktop\Fix Karbone\*.pdf"):
print(file)
start_date_formatted = ""
end_date_formatted = ""
insert = False
tables = []
with pdfplumber.open(file,repair=False) as pdf:

    p0 = pdf.pages[0]
    assert p0.chars
    for obj in p0._objects["char"]:
        obj.update({"upright": True })
    filtered_page = remove_nested_rects(p0, keep_largest=True)
    filtered_page = filtered_page.filter(lambda obj: obj.get("non_stroking_color") != (0.851563,))
    table_settings = {
        "vertical_strategy": "explicit",
        "horizontal_strategy": "explicit",
        "explicit_vertical_lines": p0.curves + p0.edges,
        "explicit_horizontal_lines": p0.curves + p0.edges,
        "min_words_horizontal": 2,
    }
    im = filtered_page.to_image()
    im.debug_tablefinder(tf=table_settings).show()
    time.sleep(1000)

Here is the pdf:
Karbone Prices July 8, 2022.pdf

Any help is greatly appreciated, thank you for such an awesome package. I have been using it for many other projects.

cmdlineluser · 2024-07-13T13:53:33Z

cmdlineluser
Jul 13, 2024

Basically you want to chop up the page into each indivial table area (with .crop()) and then use the table extraction methods on each isolated area (instead of the whole page at once).

If you look at each "Bid Ask" header row, one possible definition of the table area could be:

2nd vertical line to the left
2nd vertical line above
1st vertical line to the right
next "Bid Ask" below (or page bottom)

You can search for each "Bid Ask" header area and use cluster_objects to group them into "columns" (vertically aligned).

You can then search for the nearest top/left/right lines to use as the crop borders.

import pdfplumber
from bisect import bisect_left, bisect_right
from operator import itemgetter
from pdfplumber.utils import cluster_objects

pdf = pdfplumber.open("Downloads/Karbone.Prices.July.8.2022.pdf")
page = pdf.pages[0]

vertical_lines = [
    min(col, key=itemgetter("x0"))["x0"]
    for col in cluster_objects(page.vertical_edges, itemgetter("x0"), tolerance=3)
]

horizontal_lines = [
    min(col, key=itemgetter("top"))["top"]
    for col in cluster_objects(page.horizontal_edges, itemgetter("top"), tolerance=3)
]

cols = cluster_objects(
    page.search("Bid Ask"), itemgetter("x0"), tolerance=3
)

for col_num, col in enumerate(cols):
    left = vertical_lines[
        bisect_left(vertical_lines, col[0]["x0"]) - 2 # second line to left
    ]
    right = vertical_lines[
        bisect_right(vertical_lines, col[0]["x0"]) + 1 # next line to right
    ]
   for row_num, row in enumerate(col):
       top = horizontal_lines[
           bisect_left(horizontal_lines, row["top"]) - 1 # line above
       ]
       try:
           bottom = col[row_num + 1]["top"] 
       except IndexError:
           bottom = page.bbox[-1] # if no "Bid Ask" below us, use page bottom
       crop = page.crop((left, top, right, bottom))
       crop.to_image(200, antialias=True).save(f"crop-{col_num}-{row_num}.png")

Some example results:

It's not perfect, one result from the final column doesn't catch the top header cleanly:

Perhaps a more robust approach is to use the position of the text on the previous line directly above each "Bid Ask" section instead of page.horizontal_edges

Some extra work is also needed if you need to group the tables by subsection markers, e.g. PJM, NEPOOL

0 replies

cmdlineluser · 2024-07-15T14:01:34Z

cmdlineluser
Jul 15, 2024

I think this could perhaps be a more "accurate" approach than #1170 (comment)

Instead of using horizontal lines, it first finds the characters that are "in line" with each "Bid column".

It then finds the "nearest word" that is above the "Bid" i.e. the "table name".

import pdfplumber
from bisect import bisect_left, bisect_right
from operator import itemgetter
from pdfplumber.utils import cluster_objects, within_bbox, obj_to_bbox
from pdfplumber.utils.text import WordExtractor

pdf = pdfplumber.open("Downloads/Karbone.Prices.July.8.2022.pdf")
page = pdf.pages[0]

bid_ask = page.search(r"Bid\s+Ask")

# Group tables into "columns" which makes it easier when locating nearest objects 
columns = cluster_objects(bid_ask, itemgetter("x0"), tolerance=3)
tables  = [ [] for _ in columns ]

page_words = (
    WordExtractor(keep_blank_chars=True, use_text_flow=True)
     .iter_extract_tuples(page.chars)
) 

page_chars = {}
for word, word_chars in page_words:
    for char in word_chars:
        page_chars[char["matrix"]] = dict(word=word, chars=word_chars)

vertical_lines = [
    min(col, key=itemgetter("x0"))["x0"]
    for col in cluster_objects(page.vertical_edges, itemgetter("x0"), tolerance=3)
]

for col_num, col in enumerate(columns):
    left = vertical_lines[
        bisect_left(vertical_lines, col[0]["x0"]) - 2 # second line to left
    ]
    right = vertical_lines[
        bisect_right(vertical_lines, col[0]["x0"]) + 1 # next line to right
    ]
    
    col = sorted(col, key=itemgetter("top"))
    for row_num, bid in enumerate(col):
        cluster = next(
            cluster for cluster in
            cluster_objects(page.chars + [bid], itemgetter("x0"), tolerance=1)
            if bid in cluster
        )
    
        words_in_cluster = {}
        for obj in cluster:
            if obj is not bid:
                if obj["matrix"] in page_chars:
                    """
                    As a word can contain multiple matching chars
                    we use the matrix coords as a way to remove duplicate entries
                    """
                    word = page_chars[obj["matrix"]]["word"]
                    matrix = page_chars[obj["matrix"]]["chars"][0]["matrix"]
                    words_in_cluster[matrix] = word
                
        words_in_cluster = list(words_in_cluster.values())
        header = words_in_cluster[bisect_left(words_in_cluster, bid["top"], key=itemgetter("bottom")) - 1]
        
        tables[col_num].append(
            dict(bid=bid, header=header, left=left, top=bid["bottom"], right=right)
        )
        
for col_num, col in enumerate(tables):
    # the bottom of each table is the top of the next header
    for idx in range(len(col) - 1):
        col[idx]["bottom"] = col[idx + 1]["header"]["top"]
    col[-1]["bottom"] = page.bbox[-1] # last table, set page height as bottom
    
    for row_num, table in enumerate(col):
        name = table["header"]["text"]
        crop = page.crop((table["left"], table["top"], table["right"], table["bottom"]))
        print(f"{name=}")
        #crop.extract_table()
        #crop.to_image(200, antialias=True).save(f"crop-{col_num}-{row_num}.png")

1 reply

jfitzpatrick6 Jul 15, 2024
Author

Thank you for the quick response. Using parts of your code I was able to get some ideas, and a final solution that worked. I ended up splitting the page into vertical strips, and then capturing the text. After formatting the text, I was able to get way more consistent results. The text would come out like this:

PA TIER I
Bid Ask
RY 21 $15.10 $15.25
RY 22 $15.30 $15.50
RY 23 $15.40 $15.60
RY 24 $15.35 $15.55
NJ Class II
Bid Ask
RY 21 $9.25 $10.25
NEPOOL
MA CLASS I
Bid Ask
Cal 20 $42.00 $44.00
Cal 21 $40.00 $41.00
Cal 22 $37.75 $39.50
Cal 23 $30.00 $31.00
Cal 24 $26.00 $28.00

If anyone is curious this is how I did it:

import pandas as pd
import glob
import pdfplumber
from bisect import bisect_left, bisect_right
from operator import itemgetter
from pdfplumber.utils import cluster_objects
from pdfplumber.utils.text import WordExtractor
import time

frames = []
for file in glob.glob("C:\\Users\\jfitzpatrick\\Desktop\\Fix Karbone\\*.pdf"):
    file = "C:\\Users\\jfitzpatrick\\Desktop\\Fix Karbone\\Karbone Prices March 3, 2021.pdf"
    print(file)
    pdf = pdfplumber.open(file)
    page = pdf.pages[0]
    center_crop = page.within_bbox((34, 110, page.width - 50, page.height - 90))

    bid_ask = center_crop.search(r"[Bb]id\s*?[Aa]sk")
    # Group tables into "columns" which makes it easier when locating nearest objects 
    columns = cluster_objects(bid_ask, itemgetter("x0"), tolerance=3)
    products = []
    cost_data = []
    markets = []
    for dex, col in enumerate(columns):
        width = (center_crop.width) / len(columns)
        if dex == len(columns) - 1:
            third = center_crop.within_bbox((width * dex - 10, 0, width * (dex + 1), center_crop.height), relative=True)
        else:
            third = center_crop.within_bbox((width * dex, 0, width * (dex + 1), center_crop.height), relative=True)
        text = third.extract_text()
        
        if dex == 1:
            for index, line in enumerate(text.split('\n')):
                if "Bid Ask" in line and '$' not in text.split('\n')[index -2]:
                    markets.append(text.split('\n')[index -2])
        for index, line in enumerate(text.split('\n')):
            if "Bid Ask" in line and line not in markets:
                products.append(text.split('\n')[index - 1])
        temp = []
        for cell in text.split('\n'):
            if cell not in products and cell != 'Bid Ask':
                temp.append(cell)
            elif cell in products:
                if temp:
                    cost_data.append(temp)
                    temp = []
        cost_data.append(temp)

    for index, product in enumerate(products):
        for cost in cost_data[index]:
            if cost in markets:
                continue
            cost = cost.split('$')
            if len(cost) > 2:
                vintage = cost[0].strip()
                bid = cost[1].strip().replace('$', '')
                ask = cost[2].strip().replace('$', '')
                mid = (float(bid) + float(ask)) / 2
            else:
                vintage = cost[0].strip()
                bid = 0
                ask = 0
                mid = 0
            data = {'Date': [''], "Market": [''], "Product": [product], "Vintage": [vintage], 'Bid': [bid],
                    'Mid': [mid], "Ask": [ask]}
            frames.append(pd.DataFrame.from_records(data))

pd.concat(frames).to_csv("KarboneGithub.csv", index=False)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help Extracting PDF with many small tables #1170

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Help Extracting PDF with many small tables #1170

jfitzpatrick6 Jul 11, 2024

Replies: 2 comments · 1 reply

cmdlineluser Jul 13, 2024

cmdlineluser Jul 15, 2024

jfitzpatrick6 Jul 15, 2024 Author

jfitzpatrick6
Jul 11, 2024

Replies: 2 comments 1 reply

cmdlineluser
Jul 13, 2024

cmdlineluser
Jul 15, 2024

jfitzpatrick6 Jul 15, 2024
Author