Wednesday, December 20, 2017

Why Librarians Scrape Data

It all started here:

Created at:

Tue, Nov 28, 2017 at 10:24 AM (Delivered after 87 seconds)

From:

Brad Coffield

To:

CODE4LIB@lists.clir.org

Subject:

[CODE4LIB] Anyone web scraping to benefit their library?


I thought this email thread would be interesting to summarize and report back to the general web in a way that a computer can't, yet. The resulting 10+ email as of December 12 were summarized, de-duplicated, and categorized under three main themes:

#1 - Fixing the Web

You could describe all of the ideas that came up in this category as ultimately fixing some functionality in a website or (vendor) system that hasn't been built-in yet or isn't available for some other reason. Web scraping allows you to work around this and collect data how and when you want it and in most cases is pretty easy to do once you have the right software.
  • Creating APIs where there are none
  • Modify RSS Feeds
  • Collecting wishlist info

#2 - Doing the Work

The 21st century librarian is steeped in data not just as a gatekeeper but as a caretaker as well. Web scraping and ETL tools are the backbone of the profession in this sense. Software in this case might be specialized or at least require reading a trip to README.txt or finding a walk-through.
  • Collecting and aggregating search results in vendor databases
  • Archiving for Archive.org with WGET
  • QA and double-checking your work when you push up to a larger catalog system or databases

#3 - Pure Computation

Third and final, there are the data nerds out there who use scraping as a way to generate more data for further analysis. They are collecting data in the most general sense of the word, and working with it in novel ways. This can range in difficulty from super easy if it's consistent to super difficult if it's not.

  • Scientific data (NOAA and other government agencies)
  • Import XLM function in Google Sheets
  • Interacting with Wikis and Wiki-like sites
  • Getting at Pre-combined or Pre-proceesed Data

One thing I noticed in doing this is that this breakdown seems to coincide with library roles fairly well, in the sense that system librarians might be doing activities that public services would not, but that idea is for another time, place, and blog post.