This repository has been made to showcase how to obtain correlation between feature pairs of iris dataset This is the url from where i have obtained the iris dataset - https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
POC steps - I have uploaded the dataset on ADLS(Azure Datalake Storage). I have used notebook in databricks for all the python codes. I have used sparkML using pyspark interface.
This is the correlation on whole dataset. refer code in file corr_distro.
Many times, it can happen that the dataset is very large. In such situation bringing whole data on master node for calculating correlation is not a good idea. So, bootstrap sampling can be used. Refer code in file - corr_with_bootstrap_sample
This is the correlation obtained on bootstrap sample. The result using sample is very near to the results using whole dataset.
![image](https://private-user-images.githubusercontent.com/104931628/253931903-5f8cd71d-9ae8-4f0d-8c10-cb6d8d9ab4ba.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2NDEyOTUsIm5iZiI6MTczOTY0MDk5NSwicGF0aCI6Ii8xMDQ5MzE2MjgvMjUzOTMxOTAzLTVmOGNkNzFkLTlhZTgtNGYwZC04YzEwLWNiNmQ4ZDlhYjRiYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNVQxNzM2MzVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mY2RmNjk4ZjVkMWVhZGIyYmY3OWE1NWFkNTU1ZjE1NjJlZTgwMTFlZmQ5NGYwMmNlOTBkMzk4ZGNhZWVmYjU4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.TZLriKjLHkehNsD0Svyy2f19JQ6WOxl0cnWWwCFoSfo)