Purpose
- I want to know the relationship between badge and discount rate and the number of reviews. The target variable is the badge (like Etsy’s Pick). The predictor variables (used to predict the target variable) are discount rate and the number of reviews.
Content
- I web-scraped the data on the product name, original price, discounted price, number of reviews, and badge in ‘books movies and music’ category on the website 'Etsy'. I have scraped a total of 100 pages, and the total data size is 6400.
To calculate Discount Rate
- I used the original price and the discounted price to find the discount rate of the product. The equation is as follows.
(Original Price - Discounted Price) / Original Price * 100
To get nemeric Review Counts
- I wrote a code to remove parentheses and remove commas between numbers to get the number of reviews only numeric values.
df['Review Counts'] = pd.to_numeric(df['Review Counts'].apply(lambda x: re.sub(r'[^\d.]', '', x)), errors='coerce')
To find out the correlation between the categorical variable and numeric variables
- Since the data on the badge was stored as a categorical variable, either 'No badge' or 'Etsy's Pick', I went through the process of changing it to 0 and 1 respectively to turn it into a numeric variable.
- I measured the strength and direction of the linear relationship between the two variables using the Pearson correlation coefficient.
The value was
0.10971606884428392
, which showed little linear relationship between the two variables because a value close to zero came out.
- I calculated the Point-Bisial Correlation Coefficient to find out the correlation between the badge and the discount rate.
- This value ranges from -1 to 1, and the closer to zero, the weaker or less the correlation.
- The P-value in the picture indicates statistical significance. If this value is less than 0.05, it is considered statistically significant.
- The Point-Bisserial Correlation Coefficient value of the two variables is quite large at -0.5944. This indicates a strong negative correlation. The P-value is very small at 0.0, indicating that it is a statistically very significant relationship. So, it was found that as the badge increased, the Discount Rate tended to decrease.
- In other words, if it is a product with a badge, there is a possibility that the discount rate will be low.
- The Point-Biserial Correlation Coefficient is significantly smaller, at -0.034238652504330114. This indicates a very weak negative correlation. The P-value is statistically significant at 0.006155887265821593, but the magnitude of the correlation is very small.
- Since there is little correlation between these two variables, it is judged that they do not affect each other.
- EDA Guide: PS-HW1-EDA-Guide-EunSuSeo.pdf