This repository has been made to showcase how to obtain correlation between feature pairs of iris dataset This is the url from where i have obtained the iris dataset -
POC steps - I have uploaded the dataset on ADLS(Azure Datalake Storage). I have used notebook in databricks for all the python codes. I have used sparkML using pyspark interface.
This is the correlation on whole dataset. refer code in file corr_distro.
Many times, it can happen that the dataset is very large. In such situation bringing whole data on master node for calculating correlation is not a good idea. So, bootstrap sampling can be used. Refer code in file - corr_with_bootstrap_sample
This is the correlation obtained on bootstrap sample. The result using sample is very near to the results using whole dataset.