Table in PDF with checkboxes to Excel #1192
Replies: 3 comments 1 reply
-
Thanks for the kind words! For this particular PDF, this seems to work (simplified to focus just on the checkbox logic and less on the specific Excel flow): import pdfplumber
TARGET_SIDE_LENGTH = 12.5
def has_checkbox_dimensions(rect):
return (
abs(rect["height"] - TARGET_SIDE_LENGTH) < 2
and abs(rect["width"] - TARGET_SIDE_LENGTH) < 2
)
def get_checkboxes(page):
def process_checkbox(checkbox):
bbox = pdfplumber.utils.obj_to_bbox(checkbox)
cropped = page.crop(bbox).lines
checkbox_lines = list(filter(has_checkbox_dimensions, cropped))
is_checked = int(bool(checkbox_lines))
return dict(
bbox=bbox,
is_checked=is_checked
)
checkbox_rects = filter(has_checkbox_dimensions, page.rects)
return list(map(process_checkbox, checkbox_rects))
pdf = pdfplumber.open("../pdfs/17364-725846.pdf")
checkboxes = get_checkboxes(pdf.pages[0])
print(checkboxes) ... returns: [{'bbox': (78.775, 163.29999999999995, 91.05000000000001, 175.54999999999995),
'is_checked': 1},
{'bbox': (78.775, 178.04999999999995, 91.05000000000001, 190.29999999999995),
'is_checked': 0}]
Selection deleted And using the visual debugging tools: im = pdf.pages[0].to_image()
for c in checkboxes:
fill = (0, 255, 0, 120) if c["is_checked"] else (255, 0, 0, 120)
im.draw_rect(c["bbox"], fill=fill)
im ... displays: |
Beta Was this translation helpful? Give feedback.
-
I have a follow-up question: How do I extract and map the checkbox information to a data frame? The table extraction works; it's mapping the Here's my working code (for the last page of the PDF I shared earlier):
In my Thank you for your help so far! |
Beta Was this translation helpful? Give feedback.
-
I found a solution by first cleaning the extracted nested lists for each table before mapping the Thank you for all your help. |
Beta Was this translation helpful? Give feedback.
-
TL;DR
I want to extract the tables (in the attached PDF) with checkboxes, encode checked boxes as 1s, and unchecked boxes as 0s, and export to Excel.
The long read
PDFPlumber has been instrumental in my workflow. I've been extracting tables from PDFs using PDFPlumber without any issues—except for the tables with checkboxes. I've read through all discussion posts on here about extracting checkboxes, as well as all StackExchange posts on the same topic. My Excel output is usually just the table with its headers/text, but without any 1s or 0s.
Here's what I've been trying so far:
I'd really appreciate your help.
Thank you for creating such an amazing library!
17364-725846.pdf
Beta Was this translation helpful? Give feedback.
All reactions