Skip to content

Latest commit

 

History

History
17 lines (13 loc) · 633 Bytes

README.md

File metadata and controls

17 lines (13 loc) · 633 Bytes

NLP initial feature engineering

This notebook creates some initial features from the dataset of a Kaggle essay scoring competition and asseses their efficacy with a random forest model using the quadratic weighted cohen kappa score. A number of useful starting features are identified, including:

  • Total words
  • Average word length
  • Paragraph number
  • Comma to fullstop ratio
  • Conjunctions count
  • Conjunctive adverb count
  • Academic words count
  • Words per sentence
  • No space after comma count

The Kaggle competition score for this notebook is 0.73

The notebook can be found here.