I am an inexperienced shell scripter, but this project provided opportunities to do some productive work with the shell. My workflow starts with running prep.sh, a script that takes 2 or 3 arguments and is explained below.

Shebang

I’m on macOS, so my shebang line references zsh.

#! /usr/bin/env zsh

Arguments

FILE=$(realpath $1) # full path to file name. must be pdf
START=$2 # starting page of pdf file
if [[ -n $3 ]]
then
	FINISH=$3 # ending page of pdf file
fi

Here, I use realpath to take an argument like ../../74.2.pdf and expand it to its full path. The first argument, $1, is should be a path to a PDF file. $2 should be the page the python script begins processing from, and $3 should be the last page it processes.

Processing

# pull metadata into csv and write summaries out to text files
# named following this schema: {congress}_{session}_{bill type}_{bill number}.txt
if [[ -n $FINISH ]]
then
	python3 get_summaries.py $FILE $START $FINISH
else
	python3 get_summaries.py $FILE $START
fi

If there is no $3, no worries: we just pass the file path and the start page to the python script.

# get file's name for name of new directory and iteration purposes
FILE_BASENAME=$(basename $FILE .pdf) 

Here, I use basename with the name of the file and the extension .pdf to get the basename of the file the script is processing. This will be used to name the directories created by the script.

# make new directory and move generated files into it
if [[ -n $FINISH ]]
then
	NEW_DIR="${FILE_BASENAME}_${START}_${FINISH}"
else
	NEW_DIR="${FILE_BASENAME}_${START}"
fi

Above, we name the directory that the files generated by the script will be moved into.

Manual processing of the bill summaries pointed out some common OCR errors, so below I run a sed script to deal with some of those.

for f in *.{txt,csv}
do
	gsed -f common_errors.sed -i $f
done

Here’s what that script looks like:

s/\bJudiciarv\b/Judiciary/ig
s/\bTvdings\b/Tydings/g
s/\bbv\b/by/g
s/\banv\b/any/g
s/\bdenved\b/derived/g
s/\bm\b/in/g
s/\bpiopose\b/propose/g
s/\bwhioh\b/which/g
s/\beto.\b/etc./g
s/\bwit inn\b/within/g
s/\bwrorks\b/works/g
s/\bLnited\b/United/g
s/\bsolelv\b/solely/g
s/\blavs\b/lays/g
s/\bArmv\b/Army/g
s/\biniured\b/injured/g
s/\bDedares\b/Declares/ig
s/\bcrcp\b/crop/ig
s/\bWilev\b/Wiley/ig
s/\bdomestio\b/domestic/ig

Simple stuff, but it saves some time.

The OCR preserved the typesetting of the lines in the original document, so I run another sed script to make the lines flow and to replace some of the entities the script brings in and normalize some of the punctuation.

for f in *.txt
do
	gsed -i.bak -Ez -f clean_lines.sed $f
done

Here’s that script:

s/- \n//g
s/(\.:) \n/\1\n\n/g
s/([^.:] )\n/\1/g
s/(‘‘|’’|“|”)/"/g
s/[‘’]/'/g
s/-?— ?/--/g
s/■//g
s/ (#|\.+) / /g
s/(\w\.)\.+/\1./g

Wrapping up

Finally, I move the files generated by the python script into a directory named after the file and page number(s) the being procssed and then copy that directory and its contents into a backup directory.

mkdir $NEW_DIR
mv $FILE_BASENAME*.??? $NEW_DIR
cp -R $NEW_DIR "${NEW_DIR}.bak"

Pasting metadata

The data entry for this workflow happens in Google Sheets, so I use Miller to convert the generated CSV file to TSV for easier pasting into Sheets. I could probably go back and have the script spit out TSV instead of CSV.

mlr --c2t --headerless-csv-output cat 74_2_{PAGE_NUM}.csv | pbcopy

--c2t tells Miller to convert CSV to TSV, and --headerless-csv-output ignores the first line of the file.

Up next

As you can probably tell, this script is really just a harness for the get_summaries.py script that does the bulk of the work. That script will be explained in the next post.