Table in PDF with checkboxes to Excel #1192

errolovich · 2024-08-26T18:23:16Z

errolovich
Aug 26, 2024

TL;DR
I want to extract the tables (in the attached PDF) with checkboxes, encode checked boxes as 1s, and unchecked boxes as 0s, and export to Excel.

The long read
PDFPlumber has been instrumental in my workflow. I've been extracting tables from PDFs using PDFPlumber without any issues—except for the tables with checkboxes. I've read through all discussion posts on here about extracting checkboxes, as well as all StackExchange posts on the same topic. My Excel output is usually just the table with its headers/text, but without any 1s or 0s.

Here's what I've been trying so far:

import pdfplumber
import pandas as pd

def filter_checkboxes(rects):
    return [rect for rect in rects if 10 < rect["width"] < 15 and 10 < rect["height"] < 15 and int(rect["width"]) == int(rect["height"]) and rect.get('non_stroking_color') is None]

def determine_if_checked(checkbox, page):
    cropped = page.within_bbox((checkbox["x0"], checkbox["top"], checkbox["x1"], checkbox["bottom"]))
    
    num_edges = len(cropped.edges)
    
    diagonal_lines = [line for line in cropped.edges if abs(line["x1"] - line["x0"]) > 2 and abs(line["y1"] - line["y0"]) > 2]
    num_diagonal_lines = len(diagonal_lines)
    
    if num_edges >= 4 and num_diagonal_lines == 0:
        return True
    return False

def process_pdf_for_checkboxes(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        all_data = []
        for page in pdf.pages:
            rects = page.rects
            checkboxes_all = filter_checkboxes(rects)
            checkboxes_checked = [checkbox for checkbox in checkboxes_all if determine_if_checked(checkbox, page)]
            
            checkbox_data = []
            for checkbox in checkboxes_all:
                is_checked = '1' if checkbox in checkboxes_checked else '0'
                checkbox_data.append([is_checked])

            if checkbox_data:
                checkbox_df = pd.DataFrame(checkbox_data, columns=['Checkbox'])
                all_data.append(checkbox_df)
        
        return all_data

def save_to_excel(all_data, output_excel_path):
    workbook = Workbook()
    workbook.remove(workbook.active)
    
    sheet = workbook.create_sheet(title="Checkboxes")
            
    for item in all_data:
        if isinstance(item, pd.DataFrame):
            for row_data in dataframe_to_rows(item, index=False, header=True):
                sheet.append(row_data)
        else:
            sheet.append([item])
            sheet.append([])  # Add empty row for spacing
    
    workbook.save(output_excel_path)

I'd really appreciate your help.

Thank you for creating such an amazing library!

17364-725846.pdf

jsvine · 2024-08-28T13:04:16Z

jsvine
Aug 28, 2024
Maintainer

Thanks for the kind words! For this particular PDF, this seems to work (simplified to focus just on the checkbox logic and less on the specific Excel flow):

import pdfplumber

TARGET_SIDE_LENGTH = 12.5
def has_checkbox_dimensions(rect):
    return (
        abs(rect["height"] - TARGET_SIDE_LENGTH) < 2
        and abs(rect["width"] - TARGET_SIDE_LENGTH) < 2
    )

def get_checkboxes(page):
    def process_checkbox(checkbox):
        bbox = pdfplumber.utils.obj_to_bbox(checkbox)
        cropped = page.crop(bbox).lines
        checkbox_lines = list(filter(has_checkbox_dimensions, cropped))
        is_checked = int(bool(checkbox_lines))
        return dict(
            bbox=bbox,
            is_checked=is_checked
        )
        
    checkbox_rects = filter(has_checkbox_dimensions, page.rects)
    return list(map(process_checkbox, checkbox_rects))

pdf = pdfplumber.open("../pdfs/17364-725846.pdf")
checkboxes = get_checkboxes(pdf.pages[0])
print(checkboxes)

... returns:

[{'bbox': (78.775, 163.29999999999995, 91.05000000000001, 175.54999999999995),
  'is_checked': 1},
 {'bbox': (78.775, 178.04999999999995, 91.05000000000001, 190.29999999999995),
  'is_checked': 0}]
Selection deleted

And using the visual debugging tools:

im = pdf.pages[0].to_image()
for c in checkboxes:
    fill = (0, 255, 0, 120) if c["is_checked"] else (255, 0, 0, 120)
    im.draw_rect(c["bbox"], fill=fill)
im

... displays:

1 reply

errolovich Aug 29, 2024
Author

It worked even on different pages in this PDF! Thank you so much for being so kind in helping me.

errolovich · 2024-09-02T14:51:38Z

errolovich
Sep 2, 2024
Author

I have a follow-up question: How do I extract and map the checkbox information to a data frame? The table extraction works; it's mapping the is_checked part to its corresponding place in a data frame that's giving me issues.

Here's my working code (for the last page of the PDF I shared earlier):

def reshape_checkboxes(checkboxes, rows, cols):
    checkbox_matrix = np.full((rows, cols), None, dtype=object)
    for i, checkbox in enumerate(checkboxes):
        row_idx = i // cols
        col_idx = i % cols
        if row_idx < rows and col_idx < cols:
            checkbox_matrix[row_idx][col_idx] = checkbox['is_checked']
    return checkbox_matrix

def map_checkboxes_to_table(table, checkboxes):
    rows = len(table)
    cols = len(table[0])
    checkbox_matrix = reshape_checkboxes(checkboxes, rows, cols)
    
    for i in range(rows):
        for j in range(cols):
            if table[i][j] == '':
                table[i][j] = checkbox_matrix[i][j]
    return table

def clean_and_map_tables(tables, checkboxes, page):
    cleaned_tables = []
    for table in tables:
        # Convert table to DataFrame first
        df = pd.DataFrame(table)
        
        # Drop columns where the data type is not 'object'
        df = df.select_dtypes(include=['object'])
        
        # Convert back to list of lists for mapping
        cleaned_table = df.values.tolist()
        
        # Map checkboxes to the cleaned table
        cleaned_table = map_checkboxes_to_table(cleaned_table, checkboxes)
        
        # Convert back to DataFrame after mapping
        df_mapped = pd.DataFrame(cleaned_table)
        cleaned_tables.append(df_mapped)
    
    return cleaned_tables

with pdfplumber.open(pdf_path) as pdf:
    last_page = pdf.pages[-1]  # Access the last page
    
    # Extract tables and checkboxes
    tables = last_page.extract_tables()
    checkboxes = get_checkboxes(last_page)
    
    # Clean and map tables with checkboxes
    cleaned_tables = clean_and_map_tables(tables, checkboxes, last_page)

    for idx, df in enumerate(cleaned_tables):
        print(f"Mapped Table {idx}:")
        print(df)

In my group_checkboxes_by_table function (not included here) that comes right before the code I just shared, I first determine that checkboxes with the same top and bottom bbox tuple values belong in the same row; that if these values differ significantly, then those checkboxes belong to a different table. I then implement the code above to map the is_checked tuple values to the extracted data frame. This is where my problem comes in because the mapping isn't working.

Thank you for your help so far!

0 replies

errolovich · 2024-09-11T17:21:11Z

errolovich
Sep 11, 2024
Author

I found a solution by first cleaning the extracted nested lists for each table before mapping the is_checked values to their corresponding positions in a data frame.

Thank you for all your help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table in PDF with checkboxes to Excel #1192

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Table in PDF with checkboxes to Excel #1192

errolovich Aug 26, 2024

Replies: 3 comments · 1 reply

jsvine Aug 28, 2024 Maintainer

errolovich Aug 29, 2024 Author

errolovich Sep 2, 2024 Author

errolovich Sep 11, 2024 Author

errolovich
Aug 26, 2024

Replies: 3 comments 1 reply

jsvine
Aug 28, 2024
Maintainer

errolovich Aug 29, 2024
Author

errolovich
Sep 2, 2024
Author

errolovich
Sep 11, 2024
Author