A High Level Metadata Workflow Overview
The semester is winding down and so is my time as a remote metadata intern for the Law Library of Congress. I ended up learning a ton about Unix text processing workhorses like aspell, awk, diff, find, grep, sed, and GNU Emacs as well as newer tools like Miller and Datasette. I incorporated these tools into several shell scripts to expedite my workflow, which is built around a python script I wrote to extract specific text from the PDF I was working with using regular expressions. I also did a fair amount of work with Google Sheets and SQLite, and towards the end of the internship I even dabbled with the Google Drive API.
My plan now is to write a series of posts detailing the steps in this workflow and my use of the tools mentioned above. Below is a high-level overview of my workflow as it stands towards the end of the semester.
- Extract text with the
prep.sh
script. - Paste initial metadata into Google Sheets.
- This initial metadata comes from running the command
mlr --c2t --headerless-csv-output cat 74_2_{PAGE_NUM}.csv | pbcopy
. -
74_2_{PAGE_NUM}.csv
is one of the files generated by the theprep.sh
script.
- This initial metadata comes from running the command
- Review each of the text files generated by
prep.sh
(this includes runningaspell
). - Run
cleanup.sh
to delete backup files, move the working directory to its final location, and copy the names of the text files to the clipboard for pasting into Google Sheets. - Run
update_tsv_data.sh
to create local copies of the metadata, load the metadata and text files into a SQLite database, and publish that database to Heroku.
Stay tuned for more posts about each of the above steps, starting with the prep.sh
script.