This project is an intelligent system that can learn to answer Test of English as a Foreign Language (TOEFL). The system approximates the semantic similarity (closeness of meaning) of words using data from english texts.
The program answers multiple choice questions of the form: given a word, w, find which word s1, s2, s3, s4 is a synonym of w.
The program computes the semantic similarities of (w, s1), (w, s2), (w, s3), (w, s4) and choose the word whose similarity to w is the highest.
To measure semantic similarity of pairs of words, the program computes a semantic descriptor vector of each of the word, and then takes the similarity measure to be the cosine similairty between the two vecotrs.
The semantic descriptor vector (descw) of a word w, computed using a text with n words denoted by (w1, w2, ..., wn) will have n-dimensions. The i-th coordinate of descw is the number of sentences in which both w and wi occur. In this program the semantic descriptor vector is sotred as a dicitonary, so that the zeros that correspond to words which don't co-occur with w are not stored.
math
synonyms.py
. A python file containning the program.war_and_peace.txt
andswann's_way.txt
. Contain works of english litterature in .txt format from http://www.gutenber.org, used to create sematic descriptor vectorstesting.txt
. A file where each line represents a TOEFL question. The first word on a line is the given word w, the second word is the answer to the TOEFL question and the remaining words in the line are the choices of possible synonyms.
- Build semantic descriptor vectors
In this example I build the semantic descriptor vectos using the War and Peace and Swann's Way texts, however this can be done using any english litterature text in the form of a .txt
file.
sem = build_semantic_descriptors_from_files(["war_and_peace.txt", "swann's_way.txt"])
- Identify which word is a synonym using semantic descriptor vectors
''' word --> a string, the word given in the question
choices --> an array of strings, the options given in the question (one of which is a synonym of _w_)
sem --> a dictionary of dictionaries, which holds the semantic descriptor vectors as defined in 1.
cosine_similarity --> a funciton defined in the program for determining similarity of semantic descriptor vectors, do not vary this funciton argument
answer --> the string in choices which is synomous with _word_
'''
answer = most_similar_word(word, choices, sem, cosine_similarity)