You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Importing the necessary libraries, reading the data and performing basic checks on the data
# importing the required librariesimportnumpyasnpimportpandasaspdpd.set_option("display.precision", 2)
importseabornassnssns.set_style('whitegrid')
importmatplotlib.pyplotaspltimportrefromsklearn.preprocessingimportLabelEncoderfromsklearn.model_selectionimporttrain_test_splitfromsklearn.feature_extraction.textimportTfidfVectorizerfromscipy.sparseimporthstackfromsklearn.multiclassimportOneVsRestClassifierfromsklearn.neighborsimportKNeighborsClassifierfromsklearnimportmetrics
# importing and reading the .csv filedf=pd.read_csv('ResumeDataSet.csv')
print("The number of rows are", df.shape[0],"and the number of columns are", df.shape[1])
df.head()
The number of rows are 962 and the number of columns are 2
Category
Resume
0
Data Science
Skills * Programming Languages: Python (pandas...
1
Data Science
Education Details \r\nMay 2013 to May 2017 B.E...
2
Data Science
Areas of Interest Deep Learning, Control Syste...
3
Data Science
Skills � R � Python � SAP HANA � Table...
4
Data Science
Education Details \r\n MCA YMCAUST, Faridab...
# Checking the information of the dataframe(i.e the dataset)df.info()
# Checking all the different unique valuesdf.nunique()
Category 25
Resume 166
dtype: int64
Plotting the share of each Category as a count plot and pie plot
# Plotting the distribution of Categories as a Count Plotplt.figure(figsize= (15,15))
sns.countplot(y="Category", data=df)
df["Category"].value_counts()
# Plotting the distribution of Categories as a Pie Plotplt.figure(figsize= (18,18))
Category=df['Category'].value_counts().reset_index()['Category']
Labels=df['Category'].value_counts().reset_index()['index']
plt.title("Categorywise Distribution", fontsize=20)
plt.pie(Category, labels=Labels, autopct='%1.2f%%', shadow=True)
df["Category"].value_counts()*100/df.shape[0]
Cleaning out all the unnecessary content from the Resume column
# Function to clean the datadefclean(data):
data=re.sub('httpS+s*', ' ', data) # Removing the linksdata=re.sub('RT|cc', ' ', data) # Removing the RT and ccdata=re.sub('#S+', ' ', data) # Removing the hashtagsdata=re.sub('@S+', ' ', data) # Removing the mentionsdata=data.lower() # Changing the test to lowercasedata=''.join([iif32<ord(i) <128else' 'foriindata]) # Removing all the special charactersdata=re.sub('s+', 's', data) # Removing extra whitespacesdata=re.sub('[%s]'%re.escape("""!"#$%&'()*+,-./:;<=>?@[]^_`{|}~"""), ' ', data) # Removing punctuationsreturndatacleaned_df=df['Category'].to_frame()
cleaned_df['Resume'] =df['Resume'].apply(lambdax: clean(x)) # Applying the clean function cleaned_df
Category
Resume
0
Data Science
skills programming languages python pandas...
1
Data Science
education details may 2013 to may 2017 b e ...
2
Data Science
areas of interest deep learning control syste...
3
Data Science
skills r python sap hana table...
4
Data Science
education details mca ymcaust faridabad...
...
...
...
957
Testing
computer skills proficient in ms office ...
958
Testing
willingnes to a ept the challenges po...
959
Testing
personal skills quick learner eagerne...
960
Testing
computer skills software knowledge ms power ...
961
Testing
skill set os windows xp 7 8 8 1 10 database my...
962 rows × 2 columns
Encoding the Category data
# Encoding the Category column using LabelEncoderencoder=LabelEncoder()
cleaned_df['Category'] =encoder.fit_transform(cleaned_df['Category'])
cleaned_df
# Creating a Word Vectorizer and transforming itResume=cleaned_df['Resume'].valuesCategory=cleaned_df['Category'].valuesword_vectorizer=TfidfVectorizer(sublinear_tf=True, stop_words='english', max_features=1000)
word_vectorizer.fit(Resume)
WordFeatures=word_vectorizer.transform(Resume)
Training our Machine Learning Model
Splitting the dataset into train and test data
# Splitting the data into train, test, printing the shape of each and running KNeighborsClassifier with OneVsRest methodX_train, X_test, y_train, y_test=train_test_split(WordFeatures, Category, random_state=2, test_size=0.2)
print(f'The shape of the training data {X_train.shape}')
print(f'The shape of the test data {X_test.shape}')
clf=OneVsRestClassifier(KNeighborsClassifier())
clf.fit(X_train, y_train)
The shape of the training data (769, 1000)
The shape of the test data (193, 1000)
OneVsRestClassifier(estimator=KNeighborsClassifier())
Computing the accuracy metrics and classification report
# Predicting the values using the model built with train data and checking the appropriate metricsprediction=clf.predict(X_test)
print(f'Accuracy of KNeighbors Classifier on test set: {clf.score(X_test, y_test):.2f}\n')
print(f'The classification report \n{metrics.classification_report(y_test, prediction)}\n\n')