I wrote this lengthy post, the third in my metadata workflow series, in org-mode, which helpfully generates a table of contents on exporting to markdown. I hope it’s handy for navigation.

I also hope to convert this file to ipynb format so interested folks can try the code out on their own. The code is currently published on GitHub.

Environment Setup
1. The Shebang Line
2. Necessary Libraries
Constants
1. (No) Magic Numbers
2. The Month Dictionary
Big, Ugly Regular Expressions
1. The Summary Header Pattern
2. The Date Pattern
Functions
Putting it all together
Up next

Environment Setup

The Shebang Line

We add a shebang line in case we want to make this file executable.

#!/usr/bin/env python

Necessary Libraries

Now we import the libraries we need.

import argparse
import sys
import re
import csv
from pathlib import Path

from more_itertools import chunked
from PyPDF2 import PdfReader

The role of each library is explained below:

argparse: handles command-line argument parsing
- In this script, those command line arguments are the path to the PDF file we are processing, the page we will start processing from, and an optional argument for where we will stop processing.
sys: handles a variety of interactions between python and its host system
- In this script, it is used once: to exit the program in case of a fatal error
re: is python’s regular expression library
- In this script, it does the bulk of the work.
csv: is python’s library for dealing delimited data files.
- In this script, it is used to write metadata from the bill summaries to a csv file.
from pathlib import Path: imports the Path object for dealing with file paths
from more_itertools import chunked: imports the chunked function from the more_itertools library
- more_itertools is not part of a standard python distribution and needs to be installed using pip.
from PyPDF2 import PdfReader: is a third-party library for dealing PDF files that be installed using pip.

Constants

(No) Magic Numbers

There are three numbers that play key roles in the script. To avoid having magic numbers, we make them constants with descriptive names.

ELEMENT_COUNT = (
    7  # full header, bill type, bill number, sponsor(s), action date, committee, text
)
CONGRESS_POSITION = 2
SESSION_POSITION = 3

ELEMENT_COUNT: is adequately explained by the comment in the above code block.
CONGRESS_POSITION: refers to starting index of the number in the name of the file passed to the script that refers to the congress (e.g., 74th) the summaries were written for.
SESSION_POSITION: is like CONGRESS_POSITION, but for the session of Congress (e.g., second).

The Month Dictionary

We use a python dictionary to map between names (or abbreviations, or bad OCR text) of months as found in the bill summaries and two character strings of digits that will be used in naming the files containing the extracted summaries.

MONTHS = {
    "January": "01",
    "February": "02",
    "Feb.": "02",
    "March": "03",
    "Mar.": "03",
    "April": "04",
    "Apr.": "04",
    "May": "05",
    "June": "06",
    "July": "07",
    "August": "08",
    "September": "09",
    "October": "10",
    "November": "11",
    "December": "12",
    "Alay": "05", # bad OCR, should be "May"
}

Big, Ugly Regular Expressions

The script relies on two regular expressions to do its work. They’re big, they’re ugly, they’re poorly tested, and they match a lot more than they are intended to match, but this script would be useless without them. In a later draft of this post, and a later version of the script, I will rewrite the regular expressions with the verbose flag, which they obviously need. I decided to demonstrate the version of the regexes the script actually uses.

The Summary Header Pattern

This regular expression (usually) matches the first line of a bill summary and groups (see the second row of the table) parts of the line that will be extracted into a starter CSV metadata file. The abundance of punctuation options has to do primarily with inconsistency in how OCR recognizes the punctuation characters in the original text, but there is also some inconsistency in how the original text is punctuated.

HEADER_PATTERN = re.compile(
    r"(((?:S|H)\.? ?(?:R\.?)? (?:J\.? Res\. ?)?)(\w{1,5})\.? ((?:M(?:rs?|essrs)\.) .+?)(?:[;,:])? (\w{1,9} \d{1,2}[.,] \d{4})[.—]? ?\n?(?:\((['0-9a-zA-Z ]+)\))?(?:\.|.+\.|:|.+:)?)",
    re.MULTILINE,
)

The regex captures the bill type, bill number, sponsors, date of introduction, committee, and text of the bill summary. It also captures the entire first line of the summary.

The Date Pattern

This regular expression is intended to match dates like the following: January 16, 1936. The year is optional.

DATE_PATTERN = re.compile(r"([JFMASOND][a-z]{2,8}\.?) (\d{1,2})[-—.,;: ]( \d{4})?")

Functions

Now we turn our attention to the functions that will be called in the main body of the script.

Argument Parsing

Here, we use the argparse package to handle three command line arguments: the name of the file, the page to start from, and the page to end on.

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "file_path", help="path to a bill digest PDF, e.g., ../74_2.pdf"
    )
    parser.add_argument(
        "start_page",
        help="PDF page number (not printed page number) for where program should start working",
        type=int,
    )
    parser.add_argument(
        "end_page",
        help="last PDF page number (not printed page number) program should process",
        type=int,
        nargs="?",
    )
    return parser.parse_args()

Naming the CSV data file

The CSV file we generate needs a meaningful name, e.g., 74_2_7_8.csv, that is, the Congress and Session (74_2, which we get from the file stem of the PDF file we’re processing, followed by the start page and the end page (if one is specified).

def name_output_file(file_stem, start_page, end_page):
    if end_page:
        return "_".join([file_stem, str(start_page), str(end_page)]) + ".csv"
    else:
        return "_".join([file_stem, str(start_page)]) + ".csv"

Getting the Data

This is where most of the work happens. The function could and should probably be broken out into some smaller functions, but it works. See the comments in the function definition for details, and the more_itertools chunked documentation for information about the use of that function.

def extract_summaries_and_metadata(file_path, start_page, end_page):
    # Set up some empty containers for our data
    metadata = []
    summaries = []
    text_file_names = []
    
    # Instantiate the PdfReader object
    reader = PdfReader(file_path)
    # Get the file stem, e.g., 74_2
    file_stem = file_path.stem
    
    # Handle the presence of absence of an ending page number
    if end_page == None:
        end = start_page + 1
    else:
        end = end_page
    
    # Iterate over page numbers
    for i in range(start_page, end):
        # Get the text from each page
        page_text = reader.pages[i].extract_text()
    
        # Find the first line that matches HEADER_PATTERN from above
        first_header_pos = re.search(HEADER_PATTERN, page_text).start()
        # Get the text from that point on
        page_text = page_text[first_header_pos:]
        # Split the text based on the header pattern
        raw_summaries = re.split(HEADER_PATTERN, page_text)[1:]
        # Group the summaries
        raw_summaries = list(chunked(raw_summaries, ELEMENT_COUNT))
    
    for item in raw_summaries:
        # Give a name to each grouped piece of each summary
        header, bill_type, bill_number, sponsor, date, committee, text = item
        # Normalize the bill type formatting
        formatted_bill_type = bill_type.strip()
        # Normalize bill type for the summary text file name
        lower_bill_type = bill_type.lower().replace(" ", "").replace(".", "")
        # Concatenate the summary header and summary text
        summary = header + text
        # Add a row of metadata to the list
        metadata.append(
            [formatted_bill_type, bill_number, sponsor, date, committee]
        )
        # Add a summary to the list
        summaries.append(summary)
        # Add the name of the text file to the list
        text_file_names.append(f"{file_stem}_{lower_bill_type}{bill_number}")

    # return a tuple of lists of data
    return (metadata, summaries, text_file_names)

Dates

Here’s an example of a bill summary in the original PDF:

And here’s what it looks like after processing:

S. 1019. Mr. King; January 15, 1935 (District of Columbia).
As passed by the Senate, February 12, and referred to House Committee on the District of Columbia. February 15, 1935:
Requires real estate brokers and salesmen in the District of Columbia (including nonresidents but excluding auctioneers, banks, trust companies, building and loan associations or land-mortgage or farm-loan associations) to secure annual licenses from a real estate commission hereby established (composed of three members—two appointed by the Commissioners and the assessor ex-officio). Applicants for licenses must be 21 years old, able to read and write English and must give proof of trustworthiness and competence in the business (not required if applicant has 2 years experience as a broker, etc., or in connection with real estate business in the District of Columbia). Bond required—$2,500 for brokers, $1,000 for salesmen—running to the District of Columbia; and a fee for broker’s license of $25 or $5 for salesman’s license. License may be revoked by the Commission, upon its own motion or on a verified complaint, after a hearing, for any fraudulent or dishonest dealing or conviction of a crime involving fraud or dishonesty. Revocation of a broker’s license suspends the license of every salesman under him.

You’ll notice that there are multiple dates specified. The date from the first line goes into the CSV file, but we want the latest action to go into the name of the text file for each summary. The function below handles that.

def format_latest_action_dates(summaries):
    actions = []
    date_tags = []
    intro_dates = []
    
    for summary in summaries:
        # Break the summary text into lines
        lines = summary.splitlines()
        # Get the date on the first
        intro_dates.append(re.findall(DATE_PATTERN, lines[0])[0])
        # If there is more than one line in the summary, store the second line, otherwise, store an empty string
        if len(lines) > 1:
            actions.append(lines[1])
        else:
            actions.append("")
    
    # Process the second line of each summary
    for action in actions:
        # Get the whole date the bill was introduced
        intro_date = intro_dates[actions.index(action)]
        # Get the year from that date
        intro_year = intro_date[2]
        # Find all the dates in the second line of the summary
        dates = re.findall(DATE_PATTERN, action)
        # If there are dates in the second line, handle that
        if dates:
            # Sometimes no years are specified in the second line of the summary, as in the example above
            default_year = dates[0][2] or intro_year
            date = list(dates[-1])
            if date[2] == "":
                date[2] = default_year
            # Default to the year the bill was introduced
          else:
              date = list(intro_date)
          # Convert the month text to a two digit numerical string
          date[0] = MONTHS[date[0]]
          # Create the date string that will go into the name of the text file
          date_tag = f"{date[2].strip()}{date[0]}{date[1].zfill(2)}"
          date_tags.append(date_tag)
    
      return date_tags

Output File Names

Here’s how we name the text file associated with each summary.

def format_output_files(text_file_names, date_tags):
    # If the number of text file base names and the number of date tags doesn't match, we've got a problem
    if len(text_file_names) != len(date_tags):
        print("Text files names list and date tags list are not the same length.")
        sys.exit(1)
    # Zip the text file base names and date strings together, join them with "_", and add the ".txt" extension
    filenames = map(
        lambda name: "_".join(name) + ".txt", list(zip(text_file_names, date_tags))
    )
    return list(filenames)

Writing out the files

Here’s how we write the files to disk.

Writing the CSV File

This is decently straightforward thanks to the csv module in the python standard library.

def write_metadata(metadata, output_file_name, congress):
    headers = [
        "congress",
        "session",
        "pl_num",
        "bill_type",
        "bill_number",
        "sponsor",
        "committee",
        "action_red",
        "summary_version_code",
        "report_number",
        "action",
        "action_date",
        "associated_summary_file",
        "questions_comments",
    ]

    with open(output_file_name, "w") as f:
        writer = csv.DictWriter(f, fieldnames=headers)
        writer.writeheader()
        for row in metadata:
            writer.writerow(
                {
                    "congress": congress,
                    "session": "",
                    "pl_num": "",
                    "bill_type": row[0],
                    "bill_number": row[1],
                    "sponsor": row[2],
                    "committee": row[4] or "N/A",
                    "action_red": "",
                    "summary_version_code": "",
                    "report_number": "",
                    "action": "Introduced",
                    "action_date": row[3],
                    "associated_summary_file": "",
                    "questions_comments": "",
                }
            )

Writing the Summary Files

This is the most straightforward function in the script. Essentially, we iterate over the summaries and match them up with the text file names we created.

def write_summaries(summaries, text_file_names):
    for i in range(len(summaries)):
        with open(text_file_names[i], "w") as f:
            f.write(summaries[i])

Putting it all together

Now we actually do what we set out to do.

if __name__ == "__main__":
    args = parse_args()
    file_path = args.file_path
    start_page = args.start_page
    start_page_index = start_page - 1
    end_page = args.end_page

    file_path = Path(file_path)
    file_stem = file_path.stem
    output_file_name = name_output_file(file_stem, start_page, end_page)

    congress = file_stem[:CONGRESS_POSITION]
    session = file_stem[SESSION_POSITION]

    metadata, summaries, text_file_names = extract_summaries_and_metadata(
        file_path, start_page_index, end_page
    )

    action_dates = format_latest_action_dates(summaries)
    output_file_names = format_output_files(text_file_names, action_dates)

    write_metadata(metadata, output_file_name, congress)
    write_summaries(summaries, output_file_names)

Up next

Now it’s time for the manual work. In the next post, I’ll detail how I use GNU Emacs to review each text file and Google Sheets to input additional metadata about each bill.

get_summaries.py

Table of Contents