This tool is a web scraper used to collect Vélib data and to insert it into a postgres database.
The scraper runs on a dockerized Selenium
browser. The HTML parser used is BeautifulSoup4
. The
parsed data is inserted into the postgres database using psycopg2
.
Python generators are being used for efficiency and scalability. At the end of the process, insert queries are being grouped into small batches before being executed.
To get the scraper to run, you will need to have:
- A python environment (3.6+) meeting the requirements specified in the file
requirements.txt
- A "Standalone-Chrome" selenium docker image (3.14+), with the proper name and version specified in
the driver configuration file (
conf/driver.yaml
per default). Such an image can be pulled from here, or built using this for armhf (Armv7) based devices. - A postgres database with one table, which creation SQL queries can be found in the
ddl
folder (trips.sql
). The name must be matching the one in the configuration fileconf/scraper.yaml
- A credentials file
conf/credentials.yaml
following the format described inconf/credentials_format.yaml
The file main.py
may be run in order to run the scraper. It has no required argument.
If --user
is not specified, all users will be scraped for.
Example bash command:
workon velibityenv3
python main.py --user raphaelberly
Example console output: