Skip to content

A tool to scrape Vélib website and collect raw trip data.

Notifications You must be signed in to change notification settings

raphaelberly/velibity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Velibity

This tool is a web scraper used to collect Vélib data and to insert it into a postgres database.

1. Description

The scraper runs on a dockerized Selenium browser. The HTML parser used is BeautifulSoup4. The parsed data is inserted into the postgres database using psycopg2.

Python generators are being used for efficiency and scalability. At the end of the process, insert queries are being grouped into small batches before being executed.

2. How To

Setting things up

To get the scraper to run, you will need to have:

  • A python environment (3.6+) meeting the requirements specified in the file requirements.txt
  • A "Standalone-Chrome" selenium docker image (3.14+), with the proper name and version specified in the driver configuration file (conf/driver.yaml per default). Such an image can be pulled from here, or built using this for armhf (Armv7) based devices.
  • A postgres database with one table, which creation SQL queries can be found in the ddl folder (trips.sql). The name must be matching the one in the configuration file conf/scraper.yaml
  • A credentials file conf/credentials.yaml following the format described in conf/credentials_format.yaml

Running the scraper

The file main.py may be run in order to run the scraper. It has no required argument. If --user is not specified, all users will be scraped for.

Example bash command:

workon velibityenv3
python main.py --user raphaelberly

Example console output:

Example console output

About

A tool to scrape Vélib website and collect raw trip data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages