Skip to content

Latest commit

 

History

History
50 lines (30 loc) · 2.2 KB

mvp_writeup.md

File metadata and controls

50 lines (30 loc) · 2.2 KB

Topic Modeling State of the Union Addresses 1970-2022

The goal of this project is to distill the main topics of each State of the Union speech delivered by a president between 1790 and 2022.

MVP Script

As an initial pass, I preprocessed the data using SpaCy, CountVectorized the data with sklearn, did some dimensionality reduction with TruncatedSVD, and pulled out the top 10 topics:

Topic 0: government, year, congress, united, states, country, state, great, law, people

Topic 1: program, year, world, new, work, need, help, america, nation, federal

Topic 2: program, dollar, year, fiscal, united, war, expenditure, policy, administration, states

Topic 3: man, law, court, service, business, department, dollar, legislation, national, need

Topic 4: war, dollar, man, expenditure, power, people, great, peace, public, state

Topic 5: nation, administration, state, policy, man, energy, effort, continue, program, power

Topic 6: mexico, united, war, states, american, texas, mexican, man, peace, army

Topic 7: mexico, country, nof, nthe, texas, mexican, nto, nand, army, public

Topic 8: state, constitution, dollar, american, government, program, department, business, canal, united

Topic 9: world, government, nof, nthe, american, free, shall, nand, nto, great

The results were vague, with a lot of repetition. In practical terms, this is largely meaningless. I need to re-think my approach.

  • Text Preprocessing: I should try adding in more custom stop words, and maybe cleaning up the dataset using regex for errors like 'nto' and 'nof'.

  • Vectorization: I should try something other than a CountVectorizer for this dataset as there is a lot of noise from those results. Tfid seems like a good option.

  • Dimensionality Reduction: I should try additional techniques for dimensionality reduction beyond TruncatedSVD.

  • Results: Instead of 10 topics overall I should be producing 3-5 topics on a per-docoument basis so they are more discrete.

A lot of work to be done here. But it may also be that presidents are prone to generalities in their State of the Union addresses, making it hard to find signal in a lot of noise.