-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Hoang Anh
committed
Apr 22, 2021
1 parent
927df54
commit be06ad9
Showing
220 changed files
with
54,395 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
/etc | ||
/log | ||
/api/__pycache__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# These are some examples of commonly ignored file patterns. | ||
# You should customize this list as applicable to your project. | ||
# Learn more about .gitignore: | ||
# https://www.atlassian.com/git/tutorials/saving-changes/gitignore | ||
|
||
# Node artifact files | ||
node_modules/ | ||
dist/ | ||
|
||
# Compiled Java class files | ||
*.class | ||
|
||
# Compiled Python bytecode | ||
*.py[cod] | ||
|
||
# Log files | ||
*.log | ||
|
||
# Package files | ||
*.jar | ||
|
||
# Maven | ||
target/ | ||
dist/ | ||
|
||
# JetBrains IDE | ||
.idea/ | ||
|
||
# Unit test reports | ||
TEST*.xml | ||
|
||
# Generated by MacOS | ||
.DS_Store | ||
|
||
# Generated by Windows | ||
Thumbs.db | ||
|
||
# Applications | ||
*.app | ||
*.exe | ||
*.war | ||
|
||
# Large media files | ||
*.mp4 | ||
*.tiff | ||
*.avi | ||
*.flv | ||
*.mov | ||
*.wmv | ||
|
||
../api/__pycache__/ | ||
src/__pycache__/ | ||
utils/__pycache__/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Download base image ubuntu 18.04 | ||
FROM ubuntu:18.04 | ||
|
||
|
||
# Add source code and saved file | ||
ENV HOME /root | ||
ENV PYTHONPATH "/usr/lib/python3/dist-packages:/usr/local/lib/python3.6/site-packages" | ||
|
||
RUN apt-get update -y && apt-get install -y python3 \ | ||
python3-pip \ | ||
build-essential | ||
CMD python3 --version | ||
RUN pip3 install pandas | ||
RUN pip3 install requests | ||
RUN pip3 install flask | ||
RUN pip3 install numpy | ||
RUN pip3 install lxml | ||
RUN pip3 install gunicorn | ||
RUN pip3 install gevent | ||
RUN pip3 install scikit-learn | ||
RUN pip3 install featuretools | ||
RUN pip3 install fuzzymatcher | ||
RUN pip3 install cmake | ||
RUN pip3 install xgboost | ||
# COPY SourceCode /app | ||
|
||
# WORKDIR /app | ||
|
||
# # CMD pip3 freeze | ||
|
||
|
||
# CMD ls | ||
#copy app | ||
COPY . /ai | ||
WORKDIR / | ||
RUN chmod -x /ai/app.py | ||
|
||
CMD ["gunicorn", "--workers", "3", "--worker-class", "gthread ", "--threads", "3", "-b", ":5000", "-t", "900", "--reload", "ai.wsgi:app"] | ||
|
||
#docker stop $(docker ps -a -q) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Working Flow Explanation | ||
 | ||
## 1. Guest Identification | ||
The first task in Guest Customer Lifetime Value model is to identify all possible guest profiles exist in the hospitality database. | ||
|
||
Based on data that provided by Apaleo API, we propose using the [**Fuzzy Matcher**](https://github.com/RobinL/fuzzymatcher) method for guest identification task: | ||
- Identification fields: First Name, Last Name, Email | ||
- Threshold of [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance): 0.15 | ||
|
||
## 2. Guest Type Detection | ||
By checking number of bookings per guest, we build 2 datasets from original one: | ||
- **1st** reservations of **ALL** guests | ||
- **ALL** reservations of **returning** guests | ||
|
||
## 3. Feature Engineering | ||
To solve [**CRM**](https://en.wikipedia.org/wiki/Customer_relationship_management) problem, we learn that [Recency-Frequency-Monetary (**RFM**)](https://en.wikipedia.org/wiki/RFM_(market_research)) will be our key features. For fast implementation, we use [**featuretools**](https://github.com/alteryx/featuretools) to apply **aggregation** and **transformation** on raw features. | ||
|
||
- With **returning guest**, we use **domain knowledge** to generate new features and then, apply [featuretools](https://github.com/alteryx/featuretools) for one-hot encoding **categorical** features | ||
- **Recency** features: InactiveDays, ... | ||
- **Frequency** features: NumberOfOrders, ... | ||
- **Monetary** features: TotalRevenue, ... | ||
- **Non-RFM** features: AverageNightsInHouse, NumberOfRooms, ... | ||
- For **1st reservation**, our pipeline depends on [featuretools](https://github.com/alteryx/featuretools) to build new features and then, automatically remove noisy and correlated features. | ||
|
||
## 4. Modeling | ||
### 4.1. Customer Lifetime Value [CLV] | ||
#### 4.1.1. Segmentation | ||
We filter customers whose lifetime is more than **k=12** months for this model; the others whose lifetime is less than 12 months will be classified by next model. For simplicity and speed, we choose **k-means clustering** as base model. | ||
- <ins>Step 1</ins>: perform clustering on each **RFM features** (with adaptive **k** ranging from 2 to 9) | ||
- <ins>Step 2</ins>: calculate **RFM score** by weighted aggregation from 3 models above | ||
- <ins>Step 3</ins>: perform clustering on **RFM score** (with fixed **k=3** as Low, Middle, High classes of CLV) | ||
|
||
The result is **CLV classes** of so-called **long-term** customers, this result will be used as **label** for next models. | ||
|
||
#### 4.1.2. Classification | ||
For the **short-term** customers (who have spent less than 12 months as customer), we remove **RFM features** and train [**XGBoost**](https://xgboost.readthedocs.io/en/latest/get_started.html) with labels from the above model. | ||
|
||
**Noted:** If the database does not have any guest over 12 months lifetime, the model will automatically pick top 50% lifetime for training the CLV classification model | ||
|
||
### 4.3. Potential VIP Guest Prediction | ||
The First-Time Guest does not have the RFM information which means they can not be predicted by the LTV classification model. So we propose the method for calculating the probability LTV Classes of First-Time Guest based on the correlation between the 1st reservation of Returing Guest (which has LTV classes) and First-Time Guest. | ||
|
||
Our approach is Density-based spatial clustering of applications with noise ([**DBSCAN**](https://en.wikipedia.org/wiki/DBSCAN)) clustering algorithm for its efficent in density information retrieval, noise removal and good performance. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
from flask import Flask, request | ||
import ast | ||
import json | ||
import pandas as pd | ||
import traceback | ||
from datetime import datetime | ||
import os | ||
import sys | ||
import shutil | ||
from glob import glob | ||
|
||
from ai.src.guest_identification import guest_identification_process | ||
from ai.src.guest_type_detection import guest_type_detector | ||
from ai.src.feature_engineering import feature_engineering_pipeline | ||
from ai.src.modeling_for_LTV import LTV_model | ||
from ai.src.potential_guest_segmentation import Potential_model | ||
|
||
app = Flask(__name__) | ||
parent_directory = "/ai/storage/" | ||
def return_object(dataList_output, return_code = 200, return_status = "SUCCESS", return_message = "Returns success", error_log = ""): | ||
return_body = json.dumps( | ||
{ | ||
"code": return_code, | ||
"status": return_status, | ||
"message": return_message, | ||
"dataList": dataList_output, | ||
"errorlog": error_log | ||
} | ||
) | ||
return return_body | ||
|
||
def clean_up_storage(): | ||
folder_path = parent_directory + "*" | ||
list_sub_folder = glob(folder_path) | ||
for folder in list_sub_folder: | ||
folder_lifetime = (datetime.now().timestamp() - os.stat(folder).st_ctime)/3600 | ||
## Try to remove tree; if failed show an error using try...except on screen | ||
if folder_lifetime > 1.5: | ||
try: | ||
shutil.rmtree(folder) | ||
except OSError as e: | ||
print ("Error: %s - %s." % (e.filename, e.strerror)) | ||
|
||
@app.route('/send-input-data', methods=['POST']) | ||
def get_input_data(): | ||
# Clean up the storage | ||
try: | ||
clean_up_storage() | ||
except Exception as error_sum: | ||
print("Error summary: ", error_sum) | ||
error_log = traceback.format_exc() | ||
print("Error Details: ", str(error_log)) | ||
|
||
try: | ||
req_data = request.get_json() | ||
|
||
file_id = request.headers['FileID'] | ||
try: | ||
force_run = request.headers['ForceRun'] | ||
except: | ||
force_run = 0 | ||
if force_run != 0: | ||
print(f"\n\n*************************************************************************") | ||
print(f"!!! WARNING !!! RUNNING WITH FORCE RUN MAY OCCURED ERRORS WHEN PROCESSING") | ||
print(f"*************************************************************************\n\n") | ||
|
||
try: | ||
current_year = request.headers['CurrentYear'] | ||
if len(str(current_year)) != 4 or current_year < 2000: | ||
current_year = 0 | ||
except: | ||
current_year = 0 | ||
|
||
folder_path = parent_directory + str(file_id) + "/" | ||
# Create Directory with Header FileID | ||
print(f"Current directory: {os.getcwd()}") | ||
|
||
try: | ||
os.mkdir(os.path.join(parent_directory, str(file_id))) | ||
except Exception as error_sum: | ||
print("Error summary: ", error_sum) | ||
error_log = traceback.format_exc() | ||
print("Error Details: ", str(error_log)) | ||
|
||
# RUN THE MAIN PROCESS | ||
df_guest_identification = guest_identification_process(req_data, str(folder_path), force_run) | ||
|
||
if len(df_guest_identification) < 1000 and force_run == 0: | ||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "File is too small, required at least 1,000 observations for running the model.") | ||
return json_return_output | ||
elif len(df_guest_identification) > 75000 and force_run == 0: | ||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "File is too large, the maximum observations supported is 75,000.") | ||
return json_return_output | ||
elif len(df_guest_identification) < 300 and force_run == 1: | ||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "File is too small, the force-run only run with at least 300 observations.") | ||
return json_return_output | ||
|
||
guest_type_detector(str(folder_path)) | ||
|
||
feature_engineering_pipeline(str(folder_path), current_year) | ||
|
||
LTV_model(str(folder_path), True) | ||
|
||
Potential_model(str(folder_path)) | ||
|
||
json_return_output = return_object(dataList_output = [], return_code = 200, return_status = "SUCCESS", return_message = "Successfully Run.") | ||
except Exception as error_sum: | ||
print("Error summary: ", error_sum) | ||
error_log = traceback.format_exc() | ||
print("Error Details: ", str(error_log)) | ||
|
||
with open(folder_path + "error_log.txt", "w") as text_file: | ||
text_file.write(str(error_log)) | ||
|
||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing", error_log = str(error_log)) | ||
return json_return_output | ||
|
||
@app.route('/check-process-status', methods=['GET']) | ||
def check_process_status(): | ||
try: | ||
fileID = request.headers['FileID'] | ||
directory_path = parent_directory + str(fileID) | ||
|
||
if os.path.isfile(directory_path + '/final_output.json'): | ||
return_status = "READY" | ||
return_message = "File is ready." | ||
error_log = "" | ||
elif os.path.isfile(directory_path + '/error_log.txt'): | ||
return_status = "ERROR" | ||
file_content = open(directory_path + '/error_log.txt', "r") | ||
error_log = str(file_content.read()) | ||
return_message = "File occured error when processing." | ||
elif os.path.isfile(directory_path + '/data_input_with_guest_id.csv'): | ||
return_status = "PROCESSING" | ||
return_message = "File is processing." | ||
error_log = "" | ||
else: | ||
return_status = "PROCESSING" | ||
return_message = "File was not found yet." | ||
error_log = "" | ||
|
||
json_return_output = return_object(dataList_output = [], return_code = 200, return_status = return_status, return_message = return_message, error_log = str(error_log)) | ||
except Exception as error_sum: | ||
print("Error summary: ", error_sum) | ||
error_log = traceback.format_exc() | ||
print("Error Details: ", str(error_log)) | ||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing", error_log = str(error_log)) | ||
return json_return_output | ||
|
||
@app.route('/get-prediction-file', methods=['GET']) | ||
def send_prediction_file(): | ||
try: | ||
fileID = request.headers['FileID'] | ||
directory_path = parent_directory + str(fileID) | ||
# Opening JSON file | ||
file_name = open(directory_path + '/final_output.json',) | ||
# returns JSON object as | ||
# a dictionary | ||
data = json.load(file_name) | ||
json_return_output = return_object(dataList_output = data, return_code = 200, return_status = "SUCCESS", return_message = "Successfully Run.") | ||
except Exception as error_sum: | ||
print("Error summary: ", error_sum) | ||
error_log = traceback.format_exc() | ||
print("Error Details: ", str(error_log)) | ||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing. ", error_log = str(error_log)) | ||
return json_return_output | ||
|
||
@app.route('/get-file-details', methods=['GET']) | ||
def get_file_details(): | ||
try: | ||
fileID = request.headers['FileID'] | ||
directory_path = parent_directory + str(fileID) | ||
file_name = directory_path + '/data_input_with_guest_id.csv' | ||
try: | ||
df_input = pd.read_csv(file_name) | ||
except: | ||
df_input = pd.DataFrame() | ||
return_total_records = {"TotalRecords": len(df_input)} | ||
json_return_output = return_object(dataList_output = [return_total_records], return_code = 200, return_status = "SUCCESS", return_message = "Successfully Retrieve Input File.") | ||
except Exception as error_sum: | ||
print("Error summary: ", error_sum) | ||
error_log = traceback.format_exc() | ||
print("Error Details: ", str(error_log)) | ||
json_return_output = return_object(dataList_output = [], return_code = 400, return_status = "ERROR", return_message = "Fail when processing. ", error_log = str(error_log)) | ||
return json_return_output | ||
|
||
@app.route('/', methods=['GET']) | ||
def home(): | ||
return "<h1>Distant Reading Archive</h1><p>This site is a prototype API for distant reading of science fiction novels.</p>" | ||
|
||
if __name__ == "__main__": | ||
app.run(host='0.0.0.0', port=5000, debug=True) |
Oops, something went wrong.