Parser Module

parser.py extracts text and tables from PDF files using pdfplumber.

Key Functions

parse_pdf(pdf_path) → Returns list of page dictionaries with keys: page_num, text, tables.
_clean_text(text) → Normalizes whitespace, removes page numbers or artifacts.
pages_to_full_text(pages) → Flattens all pages into a single string including tables.