1) Sentence Tokenization (Paragraphs to Sentence) (. ! ? ;)
2) Word Tokenization (Sentance to word) (space, _ :)
3) Punctual and Special character removal and Making text lowercase
4) Stop word removal (is, a, an, the, them, couldn't, ....)
5) Lemmatization and Stemming (Extract only the root words from data)
1) Generate a Word Cloud by plotting the data.
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Score of words in a particular row = (Number of times words in row / Total number of words in row) * log (Number of rows / Number of rows containing the word in them)
Features (X-axis) (2D Matrix)
Targets (Y - axis) (1D Array)
Train, Test, Split, Random state
1) Import model
2) Initialize
3) Fit (Learning process)
4) Transform
1) Import model
2) Initialize
3) Fit (learning process)
4) Predict
1) Regression - The evaluation metric for regression is R^2 between minus infinite to 1
A higher the R^2 is a better model
2) Classification - The evaluation metric for classification is
1) Accuracy score [ Higher accuracy is a better model (The value should be near 1) ]
2) F1 score [ F1 score between 0 (low) to 1 (high), a Higher F1 score is better for the model ]