The goal of this project is to distill the main topics of each State of the Union speech delivered by a president between 1790 and 2022.
As an initial pass, I preprocessed the data using SpaCy, CountVectorized the data with sklearn, did some dimensionality reduction with TruncatedSVD, and pulled out the top 10 topics:
Topic 0: government, year, congress, united, states, country, state, great, law, people
Topic 1: program, year, world, new, work, need, help, america, nation, federal
Topic 2: program, dollar, year, fiscal, united, war, expenditure, policy, administration, states
Topic 3: man, law, court, service, business, department, dollar, legislation, national, need
Topic 4: war, dollar, man, expenditure, power, people, great, peace, public, state
Topic 5: nation, administration, state, policy, man, energy, effort, continue, program, power
Topic 6: mexico, united, war, states, american, texas, mexican, man, peace, army
Topic 7: mexico, country, nof, nthe, texas, mexican, nto, nand, army, public
Topic 8: state, constitution, dollar, american, government, program, department, business, canal, united
Topic 9: world, government, nof, nthe, american, free, shall, nand, nto, great
The results were vague, with a lot of repetition. In practical terms, this is largely meaningless. I need to re-think my approach.
-
Text Preprocessing: I should try adding in more custom stop words, and maybe cleaning up the dataset using regex for errors like 'nto' and 'nof'.
-
Vectorization: I should try something other than a CountVectorizer for this dataset as there is a lot of noise from those results. Tfid seems like a good option.
-
Dimensionality Reduction: I should try additional techniques for dimensionality reduction beyond TruncatedSVD.
-
Results: Instead of 10 topics overall I should be producing 3-5 topics on a per-docoument basis so they are more discrete.
A lot of work to be done here. But it may also be that presidents are prone to generalities in their State of the Union addresses, making it hard to find signal in a lot of noise.