-
Daily a file will be coming from the client side about the customer hit data of file type CSV
-
There will be new records every day and there might also be old records that need to be updated
-
Client requires SCD TYPE 01 logic to be in the warehouse
-
Also at end of the processing of each day ,there data need to be reconciled
-
Data is loaded into MYSQL DBMS using command prompt loading
-
The data after some pre-processing then ingested to HDFS using sqoop
-
Hive is used to manage the Warehousing part
-
Implemented SCD TYPE 02 logic
-
Implemented Partitioning on Year & Month for fast retrieval
-
Once the pipeline is completed the data is at last checked with input records for the count
-
After every successful operation or failure the log is generated and can be seen for the report and analysis
-
Entire warehousing solution is automated using bash scripts
-
All credentails,output directories,DBMS details are made dynamic using parmeter file and credential files
Datasets Folder here
- The place where daily data comes in and there are some sub folders which is used for testing while coding dev
Scripts here
- Entire automation is done here, you can find the entire logic here and intermediate files generated
Env here
- Support files required for the Script file are kept under this directory
Reference File here
- Under this direcotry you can find the details regarding the column description at schema level