This analysis consists of using big data tools to answer questions about datasets from Wikipedia. There are a series of analysis questions, answered using Hive and MapReduce. The tools used are determined based on the context for each question. The output of the analysis includes MapReduce .jar files and .hql files so that the analysis is a repeatable process that works on a larger dataset, not just an ad hoc calculation.
- Find, organize, and format pageviews on any given day.
- Follow clickstreams to find relative frequencies of different pages.
- Determine relative popularity of page access methods.
- Compare yearly popularity of pages.
Most of the code was done using HQL in a Hive GUI interface via DBeaver
- Download DBeaver Community Edition
- Install Hive on your machine or virtual machine
- Clone my code -
git clone https://github.com/samye760/Wikipedia-Big-Data-Analysis.git
- Setup a Hive connection in DBeaver, import my script, and start querying the data.
- The HQL commands can be used on similar large datasets, specifically those found in Wikipedia Dumps
- This script was designed to answer all sorts of questions pertaining to big data.