Data Scientist Coding Exercise
- All coding exercises have been verified using Google Colab https://colab.research.google.com/
- It is suggested that you as well use Google Colab but please feel free to use whatever platform or coding environment you feel most comfortable with.
- Ensure that your coding platform (e.g., Google Colab) is set up and fully functional.
- Confirm that the development environment has the necessary libraries pre-installed (e.g., Python, pandas, scikit-learn, Jupyter Notebook).
- Download the repository
git clone https://github.com/stordco/data-science-coding-exercise.git
- There are 3 problem statments below. You are expected to develop the code that fulfills the requirements and provide the desired outputs.
- There are 3 CSV files are included in this repository that will be needed for this exercise as well.
- You will have 48 hours to complete the assignment.
- Once completed, please share your google colab notebook and outputs or zip up the code and outputs. Email it to the Stord/Maxima talent acquisition team member that you have been working with.
- If you have any questions, please send them via email to the Stord/Maxima talent acquisition team member that you have been working with.
Model Building and Evaluation
- You are provided with a dataset from a marketing campaign. The dataset includes user_id, age, income, previous_purchases, and purchase (1 if the user made a purchase, 0 otherwise).
- Your task is to build a logistic regression model to predict whether a user will make a purchase based on the provided features.
Instructions
- Load the dataset.
- Split the dataset into training and testing sets.
- Train a logistic regression model on the training set.
- Evaluate the model on the testing set using accuracy, precision, recall, and F1 score.
- Output the evaluation metrics.
Expected Output Accuracy, precision, recall, and F1 score
Dataset
Model Building and Data Analysis
- You are given a dataset of customer transactions in a CSV file.
- The dataset includes transaction_id, customer_id, transaction_amount, and transaction_date.
- Your task is to write a function that reads the dataset and returns the total transaction amount for each customer.
Instructions
- Write a function named total_transaction_amount_per_customer.
- The function should take a file path as input.
- Read the CSV file into a pandas DataFrame.
- Group the data by customer_id and calculate the total transaction amount for each customer.
- Return a dictionary where the keys are customer_id and the values are the total transaction amounts.
Expected Output A dictionary with customer IDs and their corresponding total transaction amounts.
Dataset
Model Building and Hyperparameter Tuning
- You are given a dataset containing customer churn information for a telecommunications company.
- The dataset includes customer_id, tenure, monthly_charges, total_charges, contract_type, and churn (1 if the customer churned, 0 otherwise).
- Your task is to build a logistic regression model to predict customer churn and perform hyperparameter tuning using cross-validation.
Instructions
- Load the dataset.
- Perform any necessary data cleaning and feature engineering.
- Split the dataset into training and testing sets.
- Train a logistic regression model and use cross-validation to find the best hyperparameters.
- Evaluate the final model on the testing set using accuracy, precision, recall, and F1 score.
Expected Output Accuracy, precision, recall, and F1 score for the test set
Dataset