Skip to content

In this project i have implemented the hadoop pipeline using sqoop for ingestion,hive for sumaarising and implementing the warehosue logics and MYSQL as an DB for validationa and storage.The entire thing was automated using the script and with help of bash commands we made it each and every incident is logged properly

Notifications You must be signed in to change notification settings

SubhashMurugesan/Project_1-_Loading_Online_Event_Hits_using_Sqoop_to_Hive_via_Shell_Script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Loading Online Event Hits using Sqoop to Hive via Shell Script

Client Requirement

  • Daily a file will be coming from the client side about the customer hit data of file type CSV

  • There will be new records every day and there might also be old records that need to be updated

  • Client requires SCD TYPE 01 logic to be in the warehouse

  • Also at end of the processing of each day ,there data need to be reconciled

Data ingestion

  • Data is loaded into MYSQL DBMS using command prompt loading

  • The data after some pre-processing then ingested to HDFS using sqoop

Date Summarisation and Warehousing

  • Hive is used to manage the Warehousing part

  • Implemented SCD TYPE 02 logic

  • Implemented Partitioning on Year & Month for fast retrieval

Validation

  • Once the pipeline is completed the data is at last checked with input records for the count

  • After every successful operation or failure the log is generated and can be seen for the report and analysis

Script

  • Entire warehousing solution is automated using bash scripts

  • All credentails,output directories,DBMS details are made dynamic using parmeter file and credential files

Execution

Project Work Flow

Alt text

Datasets Folder here

  • The place where daily data comes in and there are some sub folders which is used for testing while coding dev

Scripts here

  • Entire automation is done here, you can find the entire logic here and intermediate files generated

Env here

  • Support files required for the Script file are kept under this directory

Reference File here

  • Under this direcotry you can find the details regarding the column description at schema level

About

In this project i have implemented the hadoop pipeline using sqoop for ingestion,hive for sumaarising and implementing the warehosue logics and MYSQL as an DB for validationa and storage.The entire thing was automated using the script and with help of bash commands we made it each and every incident is logged properly

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published