Skip to content

Commit

Permalink
devrupt submission
Browse files Browse the repository at this point in the history
  • Loading branch information
Hoang Anh committed Apr 22, 2021
1 parent 927df54 commit be06ad9
Show file tree
Hide file tree
Showing 220 changed files with 54,395 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/etc
/log
/api/__pycache__
53 changes: 53 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# These are some examples of commonly ignored file patterns.
# You should customize this list as applicable to your project.
# Learn more about .gitignore:
# https://www.atlassian.com/git/tutorials/saving-changes/gitignore

# Node artifact files
node_modules/
dist/

# Compiled Java class files
*.class

# Compiled Python bytecode
*.py[cod]

# Log files
*.log

# Package files
*.jar

# Maven
target/
dist/

# JetBrains IDE
.idea/

# Unit test reports
TEST*.xml

# Generated by MacOS
.DS_Store

# Generated by Windows
Thumbs.db

# Applications
*.app
*.exe
*.war

# Large media files
*.mp4
*.tiff
*.avi
*.flv
*.mov
*.wmv

../api/__pycache__/
src/__pycache__/
utils/__pycache__/
40 changes: 40 additions & 0 deletions ai/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Download base image ubuntu 18.04
FROM ubuntu:18.04


# Add source code and saved file
ENV HOME /root
ENV PYTHONPATH "/usr/lib/python3/dist-packages:/usr/local/lib/python3.6/site-packages"

RUN apt-get update -y && apt-get install -y python3 \
python3-pip \
build-essential
CMD python3 --version
RUN pip3 install pandas
RUN pip3 install requests
RUN pip3 install flask
RUN pip3 install numpy
RUN pip3 install lxml
RUN pip3 install gunicorn
RUN pip3 install gevent
RUN pip3 install scikit-learn
RUN pip3 install featuretools
RUN pip3 install fuzzymatcher
RUN pip3 install cmake
RUN pip3 install xgboost
# COPY SourceCode /app

# WORKDIR /app

# # CMD pip3 freeze


# CMD ls
#copy app
COPY . /ai
WORKDIR /
RUN chmod -x /ai/app.py

CMD ["gunicorn", "--workers", "3", "--worker-class", "gthread ", "--threads", "3", "-b", ":5000", "-t", "900", "--reload", "ai.wsgi:app"]

#docker stop $(docker ps -a -q)
43 changes: 43 additions & 0 deletions ai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Working Flow Explanation
![Pipeline](../pipeline.png)
## 1. Guest Identification
The first task in Guest Customer Lifetime Value model is to identify all possible guest profiles exist in the hospitality database.

Based on data that provided by Apaleo API, we propose using the [**Fuzzy Matcher**](https://github.com/RobinL/fuzzymatcher) method for guest identification task:
- Identification fields: First Name, Last Name, Email
- Threshold of [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance): 0.15

## 2. Guest Type Detection
By checking number of bookings per guest, we build 2 datasets from original one:
- **1st** reservations of **ALL** guests
- **ALL** reservations of **returning** guests

## 3. Feature Engineering
To solve [**CRM**](https://en.wikipedia.org/wiki/Customer_relationship_management) problem, we learn that [Recency-Frequency-Monetary (**RFM**)](https://en.wikipedia.org/wiki/RFM_(market_research)) will be our key features. For fast implementation, we use [**featuretools**](https://github.com/alteryx/featuretools) to apply **aggregation** and **transformation** on raw features.

- With **returning guest**, we use **domain knowledge** to generate new features and then, apply [featuretools](https://github.com/alteryx/featuretools) for one-hot encoding **categorical** features
- **Recency** features: InactiveDays, ...
- **Frequency** features: NumberOfOrders, ...
- **Monetary** features: TotalRevenue, ...
- **Non-RFM** features: AverageNightsInHouse, NumberOfRooms, ...
- For **1st reservation**, our pipeline depends on [featuretools](https://github.com/alteryx/featuretools) to build new features and then, automatically remove noisy and correlated features.

## 4. Modeling
### 4.1. Customer Lifetime Value [CLV]
#### 4.1.1. Segmentation
We filter customers whose lifetime is more than **k=12** months for this model; the others whose lifetime is less than 12 months will be classified by next model. For simplicity and speed, we choose **k-means clustering** as base model.
- <ins>Step 1</ins>: perform clustering on each **RFM features** (with adaptive **k** ranging from 2 to 9)
- <ins>Step 2</ins>: calculate **RFM score** by weighted aggregation from 3 models above
- <ins>Step 3</ins>: perform clustering on **RFM score** (with fixed **k=3** as Low, Middle, High classes of CLV)

The result is **CLV classes** of so-called **long-term** customers, this result will be used as **label** for next models.

#### 4.1.2. Classification
For the **short-term** customers (who have spent less than 12 months as customer), we remove **RFM features** and train [**XGBoost**](https://xgboost.readthedocs.io/en/latest/get_started.html) with labels from the above model.

**Noted:** If the database does not have any guest over 12 months lifetime, the model will automatically pick top 50% lifetime for training the CLV classification model

### 4.3. Potential VIP Guest Prediction
The First-Time Guest does not have the RFM information which means they can not be predicted by the LTV classification model. So we propose the method for calculating the probability LTV Classes of First-Time Guest based on the correlation between the 1st reservation of Returing Guest (which has LTV classes) and First-Time Guest.

Our approach is Density-based spatial clustering of applications with noise ([**DBSCAN**](https://en.wikipedia.org/wiki/DBSCAN)) clustering algorithm for its efficent in density information retrieval, noise removal and good performance.
192 changes: 192 additions & 0 deletions ai/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
from flask import Flask, request
import ast
import json
import pandas as pd
import traceback
from datetime import datetime
import os
import sys
import shutil
from glob import glob

from ai.src.guest_identification import guest_identification_process
from ai.src.guest_type_detection import guest_type_detector
from ai.src.feature_engineering import feature_engineering_pipeline
from ai.src.modeling_for_LTV import LTV_model
from ai.src.potential_guest_segmentation import Potential_model

app = Flask(__name__)
parent_directory = "/ai/storage/"
def return_object(dataList_output, return_code = 200, return_status = "SUCCESS", return_message = "Returns success", error_log = ""):
return_body = json.dumps(
{
"code": return_code,
"status": return_status,
"message": return_message,
"dataList": dataList_output,
"errorlog": error_log
}
)
return return_body

def clean_up_storage():
folder_path = parent_directory + "*"
list_sub_folder = glob(folder_path)
for folder in list_sub_folder:
folder_lifetime = (datetime.now().timestamp() - os.stat(folder).st_ctime)/3600
## Try to remove tree; if failed show an error using try...except on screen
if folder_lifetime > 1.5:
try:
shutil.rmtree(folder)
except OSError as e:
print ("Error: %s - %s." % (e.filename, e.strerror))

@app.route('/send-input-data', methods=['POST'])
def get_input_data():
# Clean up the storage
try:
clean_up_storage()
except Exception as error_sum:
print("Error summary: ", error_sum)
error_log = traceback.format_exc()
print("Error Details: ", str(error_log))

try:
req_data = request.get_json()

file_id = request.headers['FileID']
try:
force_run = request.headers['ForceRun']
except:
force_run = 0
if force_run != 0:
print(f"\n\n*************************************************************************")
print(f"!!! WARNING !!! RUNNING WITH FORCE RUN MAY OCCURED ERRORS WHEN PROCESSING")
print(f"*************************************************************************\n\n")

try:
current_year = request.headers['CurrentYear']
if len(str(current_year)) != 4 or current_year < 2000:
current_year = 0
except:
current_year = 0

folder_path = parent_directory + str(file_id) + "/"
# Create Directory with Header FileID
print(f"Current directory: {os.getcwd()}")

try:
os.mkdir(os.path.join(parent_directory, str(file_id)))
except Exception as error_sum:
print("Error summary: ", error_sum)
error_log = traceback.format_exc()
print("Error Details: ", str(error_log))

# RUN THE MAIN PROCESS
df_guest_identification = guest_identification_process(req_data, str(folder_path), force_run)

if len(df_guest_identification) < 1000 and force_run == 0:
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "File is too small, required at least 1,000 observations for running the model.")
return json_return_output
elif len(df_guest_identification) > 75000 and force_run == 0:
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "File is too large, the maximum observations supported is 75,000.")
return json_return_output
elif len(df_guest_identification) < 300 and force_run == 1:
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "File is too small, the force-run only run with at least 300 observations.")
return json_return_output

guest_type_detector(str(folder_path))

feature_engineering_pipeline(str(folder_path), current_year)

LTV_model(str(folder_path), True)

Potential_model(str(folder_path))

json_return_output = return_object(dataList_output = [], return_code = 200, return_status = "SUCCESS", return_message = "Successfully Run.")
except Exception as error_sum:
print("Error summary: ", error_sum)
error_log = traceback.format_exc()
print("Error Details: ", str(error_log))

with open(folder_path + "error_log.txt", "w") as text_file:
text_file.write(str(error_log))

json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing", error_log = str(error_log))
return json_return_output

@app.route('/check-process-status', methods=['GET'])
def check_process_status():
try:
fileID = request.headers['FileID']
directory_path = parent_directory + str(fileID)

if os.path.isfile(directory_path + '/final_output.json'):
return_status = "READY"
return_message = "File is ready."
error_log = ""
elif os.path.isfile(directory_path + '/error_log.txt'):
return_status = "ERROR"
file_content = open(directory_path + '/error_log.txt', "r")
error_log = str(file_content.read())
return_message = "File occured error when processing."
elif os.path.isfile(directory_path + '/data_input_with_guest_id.csv'):
return_status = "PROCESSING"
return_message = "File is processing."
error_log = ""
else:
return_status = "PROCESSING"
return_message = "File was not found yet."
error_log = ""

json_return_output = return_object(dataList_output = [], return_code = 200, return_status = return_status, return_message = return_message, error_log = str(error_log))
except Exception as error_sum:
print("Error summary: ", error_sum)
error_log = traceback.format_exc()
print("Error Details: ", str(error_log))
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing", error_log = str(error_log))
return json_return_output

@app.route('/get-prediction-file', methods=['GET'])
def send_prediction_file():
try:
fileID = request.headers['FileID']
directory_path = parent_directory + str(fileID)
# Opening JSON file
file_name = open(directory_path + '/final_output.json',)
# returns JSON object as
# a dictionary
data = json.load(file_name)
json_return_output = return_object(dataList_output = data, return_code = 200, return_status = "SUCCESS", return_message = "Successfully Run.")
except Exception as error_sum:
print("Error summary: ", error_sum)
error_log = traceback.format_exc()
print("Error Details: ", str(error_log))
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing. ", error_log = str(error_log))
return json_return_output

@app.route('/get-file-details', methods=['GET'])
def get_file_details():
try:
fileID = request.headers['FileID']
directory_path = parent_directory + str(fileID)
file_name = directory_path + '/data_input_with_guest_id.csv'
try:
df_input = pd.read_csv(file_name)
except:
df_input = pd.DataFrame()
return_total_records = {"TotalRecords": len(df_input)}
json_return_output = return_object(dataList_output = [return_total_records], return_code = 200, return_status = "SUCCESS", return_message = "Successfully Retrieve Input File.")
except Exception as error_sum:
print("Error summary: ", error_sum)
error_log = traceback.format_exc()
print("Error Details: ", str(error_log))
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing. ", error_log = str(error_log))
return json_return_output

@app.route('/', methods=['GET'])
def home():
return "<h1>Distant Reading Archive</h1><p>This site is a prototype API for distant reading of science fiction novels.</p>"

if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000, debug=True)
Loading

0 comments on commit be06ad9

Please sign in to comment.