You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 11, 2020. It is now read-only.
Analogy is one of the properties of dense word embedding vectors.
I evaluated 'analogy' based upon cosine-similarity distance metric on word embedding vectors learned from the twenty newsgroups dataset.
I saved the word and topic vectors at the end of each epoch.
Here's my test code:
topic_embedding_199 = np.load("topic_weights_199.npy")
word_embedding_199 = np.load("word_weights_199.npy")
from lda2vec import utils, model
data_path = "data/clean_data_twenty_newsgroups"
load_embeds = False
# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
target_ids, doc_ids) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)
def cosine_similarity(u, v):
"""
Cosine similarity reflects the degree of similariy between u and v
Arguments:
u -- a word vector of shape (n,)
v -- a word vector of shape (n,)
Returns:
cosine_similarity -- the cosine similarity between u and v defined by the formula above.
"""
distance = 0.0
### START CODE HERE ###
# Compute the dot product between u and v (≈1 line)
dot = np.dot(u, v);
# Compute the L2 norm of u (≈1 line)
norm_u = np.linalg.norm(u);
# Compute the L2 norm of v (≈1 line)
norm_v = np.linalg.norm(v);
# Compute the cosine similarity defined by formula (1) (≈1 line)
cosine_similarity = dot / norm_u / norm_v;
### END CODE HERE ###
return cosine_similarity
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
"""
Performs the word analogy task as explained above: a is to b as c is to ____.
Arguments:
word_a -- a word, string
word_b -- a word, string
word_c -- a word, string
word_to_vec_map -- dictionary that maps words to their corresponding vectors.
Returns:
best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
"""
# convert words to lower case
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
### START CODE HERE ###
# Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
e_a, e_b, e_c = word_to_idx[word_a], word_to_idx[word_b], word_to_idx[word_c];
### END CODE HERE ###
#words = word_to_vec_map.keys()
words = word_to_idx.keys();
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output
# loop over the whole word vector set
for w in words:
# to avoid best_word being one of the input words, pass on them.
if w in [word_a, word_b, word_c] :
continue
### START CODE HERE ###
# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = cosine_similarity(e_b - e_a, word_to_idx[w] - e_c);
# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim;
best_word = w;
### END CODE HERE ###
return best_word
triads_to_try = [('king', 'man', 'lady'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_embedding_199)))
king -> man :: lady -> x
man -> woman :: boy -> deny
small -> smaller :: large -> kind
This isn't what I expected ... I will test this using the GloVe embeddings.
Nearest embedding vector to topic:
idx = np.array([cosine_similarity(x, topic_embedding_199[9]) for x in word_embedding_199]).argmin()
print(idx_to_word[idx])
sure
... and what was 'learned' between epoch 198 and epoch 199 for topic #9
idx = np.array([cosine_similarity(x,-topic_embedding_198[9]) for x in word_embedding_199]).argmin()
print(idx_to_word[idx])
den
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Analogy is one of the properties of dense word embedding vectors.
I evaluated 'analogy' based upon cosine-similarity distance metric on word embedding vectors learned from the twenty newsgroups dataset.
I saved the word and topic vectors at the end of each epoch.
Here's my test code:
king -> man :: lady -> x
man -> woman :: boy -> deny
small -> smaller :: large -> kind
This isn't what I expected ... I will test this using the GloVe embeddings.
Nearest embedding vector to topic:
sure
... and what was 'learned' between epoch 198 and epoch 199 for topic #9
den
The text was updated successfully, but these errors were encountered: