Skip to content

This project is an intelligent system that can learn to answer Test of English as a Foreign Language (TOEFL). The system approximates the semantic similarly (closeness of meaning) of words using data from english texts.

Notifications You must be signed in to change notification settings

clara-fleisig/Semantic-Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic-Similarity

This project is an intelligent system that can learn to answer Test of English as a Foreign Language (TOEFL). The system approximates the semantic similarity (closeness of meaning) of words using data from english texts.

Describing the Problem

The program answers multiple choice questions of the form: given a word, w, find which word s1, s2, s3, s4 is a synonym of w.

How it Works

The program computes the semantic similarities of (w, s1), (w, s2), (w, s3), (w, s4) and choose the word whose similarity to w is the highest.

To measure semantic similarity of pairs of words, the program computes a semantic descriptor vector of each of the word, and then takes the similarity measure to be the cosine similairty between the two vecotrs.

Semantic Descriptor Vector

The semantic descriptor vector (descw) of a word w, computed using a text with n words denoted by (w1, w2, ..., wn) will have n-dimensions. The i-th coordinate of descw is the number of sentences in which both w and wi occur. In this program the semantic descriptor vector is sotred as a dicitonary, so that the zeros that correspond to words which don't co-occur with w are not stored.

Modules Used

math

Files

  • synonyms.py. A python file containning the program.
  • war_and_peace.txt and swann's_way.txt. Contain works of english litterature in .txt format from http://www.gutenber.org, used to create sematic descriptor vectors
  • testing.txt. A file where each line represents a TOEFL question. The first word on a line is the given word w, the second word is the answer to the TOEFL question and the remaining words in the line are the choices of possible synonyms.

How to run the program

  1. Build semantic descriptor vectors

In this example I build the semantic descriptor vectos using the War and Peace and Swann's Way texts, however this can be done using any english litterature text in the form of a .txt file.

sem = build_semantic_descriptors_from_files(["war_and_peace.txt", "swann's_way.txt"])
  1. Identify which word is a synonym using semantic descriptor vectors
''' word --> a string, the word given in the question
choices  --> an array of strings, the options given in the question (one of which is a synonym of _w_)
sem --> a dictionary of dictionaries, which holds the semantic descriptor vectors as defined in 1.
cosine_similarity --> a funciton defined in the program for determining similarity of semantic descriptor vectors, do not vary this funciton argument
answer --> the string in choices which is synomous with _word_
'''

answer = most_similar_word(word, choices, sem, cosine_similarity)

About

This project is an intelligent system that can learn to answer Test of English as a Foreign Language (TOEFL). The system approximates the semantic similarly (closeness of meaning) of words using data from english texts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages