(meta)data(sette)

This post is part of my Metadata Workflow Series.

Intro

After running cleanup.sh I like to make a local copy of it as a tab-separated file I can do various things with, including storing it in SQLite database with Datasette which, among other things, allows me to link the metadata with their associated text files.

The Code

As usual, it’s a shell script.

#!/usr/bin/env zsh

I use wget to save the data to my local machine. Getting the URL right took a little Googling and turned up this great answer on Stack Overflow. The spreadsheet has to be published to the web in order for this to work.

wget -q -O ./all_data.tsv https://docs.google.com/spreadsheets/d/e/2PACX-1vQkoFBuqjcBX-a_xT3IfTmLmUJ0LZme8Mj16sIdqqEC9Z_vPmyHJKiTCRrZHPaSsxI-4lKaLkWD_XSk/pub\?gid\=0\&single\=true\&output\=tsv

The Database

I then (re)create the SQLite database containing the bill metadata and the summary text files using Simon Willison’s extraordinarily easy to use sqlite-utils.

First, I drop the existing tables that hold the core data.

sqlite-utils drop-table summaries.db actions
sqlite-utils drop-table summaries.db files

Then I recreate the actions table, which is the table that holds the bill metadata

sqlite-utils insert summaries.db actions all_data.tsv --tsv -d

I rename the columns for from the spreadsheet for use in the database. There may be a better way to do this, but this works.

sqlite-utils transform summaries.db actions --rename "Congress" "congress" --rename "Session" "session" --rename "PL Num (if applicable)/PVTL Num" "pl_num" --rename "Bill Type" "bill_type" --rename "Bill Number " "bill_number" --rename "Sponsor" "sponsor" --rename "Committee" "committee" --drop "Action Code" --drop "Summary Version Code" --drop "Report Number" --rename "Action" "action" --rename "Action Date" "action_date" --rename "Associated Summary Text File Name" "summary_text_file_name" --rename "Questions/Comments" "questions_comments"

Now I insert all the text file summaries on my local machine into the database.

sqlite-utils insert-files --text summaries.db files 74*/*.txt

This ends up saving the directory name and file name in the files tables’s path column. We just want the file name since that’s what is recorded in the actions table. The following fixes that for us.

sqlite-utils convert summaries.db files path 'value.split("/")[1]'

Now we can link up the two tables

sqlite-utils add-foreign-key summaries.db actions summary_text_file_name files path --ignore

and publish the database as an app on Heroku.

datasette publish heroku summaries.db --name "llc" --tar "/usr/local/bin/gtar" --install=datasette-saved-queries

Local Copies of the Data

Next I massage the data a little bit to allow for some simple analysis on the command line with MIller. The awk command drops the columns that don’t have any data in them and the Miller command fills empty spots in the Sponsor and Committee columns, making the data bit tidy-er.

awk 'BEGIN{FS=OFS="\t"}{print $1, $2, $4, $5, $6, $7, $11, $12}' all_data.tsv | mlr --tsv fill-down -f Sponsor,Committee then cat > filled_metadata.tsv

Finally, to allow for easier reading of the spreadsheet by humans, the following awk command puts a blank line between each bill. $5 refers to the Bill Number column.

# insert blank lines between bills for pasting into final spreadsheet
awk 'BEGIN{FS="\t"} {cur=$5} NR>1 && cur!=prev {print ""} {prev=cur; print}' all_data.tsv > spaced_metadata.tsv

Next Time

This wraps up the core workflow, so now it’s time to move on how I actually use the Datasette web interface and SQL queries to check metadata quality.