Scripts to scrape a set of URLs from Medium and parse out html, body text, links, and metadata, as well as a Shiny app for exploring the results of the corpus.
To run all of this code, you will need R >= 3.6.
Package dependencies are tracked using the renv package. When you first load this project, renv will be automatically installed. Next, run renv::restore()
to read the required package list form renv.lock
, which will install any required package versions.
scraping/scrape_medium.R
reads the list of links from scraping/links.txt
and dowloads the full HTML from each page, storing it in an sqlite3 database (which makes it easier to interrupt the script and restart it without losing progress). Once the scrape has been finished, this script outputs the full HTML data as medium_html.rds
.
Once the raw HTML has been scraped, scraping/preprocess_medium.R
contains an incremental data processing pipeline (based on the extremely useful drake package that consumes medium_html.rds
and using helper methods from parse_medium.R
to extract metadata, and then uses functions from quanteda to construct a full Corpus object with the Medium texts, the core_data metadata data frame, and two document-feature matrices - one stemmed and one unstemmed. These four objects are saved to shiny/data.rda
to be referenced by the exploratory shiny app.
The shiny/
directory contains the Shiny app for creating an interactive web interface for data exploration and visualization. This is a conventional Shiny app, taking advantage of reactivity to dynamically update menus, tables, and graph outputs based on input selections across the site.
The UI uses ShinyDashboard elements.
Searches on Medium used to assemble the list of links in scraping/links.txt
:
- "artificial intelligence"
- "ethics"
- "ethical"
- "big data"
- "machine learning"
- "algorithm"
- "regulation"
- "governance"
- "robotics"
These were scraped using the Firefox plugin webscraper.io, under the following configuration:
{
"_id": "mediumlinks",
"startUrl": [
"https://medium.com/search?q=ethics"
],
"selectors": [
{
"id": "postlinks",
"type": "SelectorElementScroll",
"parentSelectors": [
"_root"
],
"selector": "div.postArticle",
"multiple": true,
"delay": "3500"
},
{
"id": "link",
"type": "SelectorElementAttribute",
"parentSelectors": [
"postlinks"
],
"selector": "a[data-action=open-post]",
"multiple": true,
"extractAttribute": "href",
"delay": 0
}
]
}