Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

<UNK> Returned for Multiple Topics #56

Open
dbl001 opened this issue Jun 30, 2019 · 1 comment
Open

<UNK> Returned for Multiple Topics #56

dbl001 opened this issue Jun 30, 2019 · 1 comment

Comments

@dbl001
Copy link
Collaborator

dbl001 commented Jun 30, 2019

"UNK " is added to the tokenizer word lists in nlppip.py because the
from keras.preprocessing.text import Tokenizer is one-based.

self.tokenizer.word_index["<UNK>"] = 0
self.tokenizer.word_docs["<UNK>"] = 0
self.tokenizer.word_counts["<UNK>"] = 0

The Tensorflow implementation of word embedding and embedding lookup are zero-based.

word_embedding_000[0]
array([-0.72940636,  0.7893076 , -0.5647843 , -0.73255396,  0.7901778 ,
       -0.49344468,  0.11772466, -0.5727272 ,  0.527349  , -0.06881762,
        0.44169998, -0.20452452,  0.3124647 ,  0.86845255, -0.9390068 ,
       -0.6195681 ,  0.89950705,  0.3356259 , -0.8492527 ,  0.45032454,
        0.6324513 ,  0.75457215,  0.21222615, -0.44409204, -0.06979871,
       -0.6462743 , -0.36795807,  0.27780175,  0.94171906,  0.40449977,
       -0.16222072, -0.34851456, -0.9734571 , -0.46344304, -0.80052805,
        0.39213514, -0.23919392, -0.60179496,  0.34500718, -0.6585071 ,
        0.18976736, -0.49871182, -0.31101155,  0.8082261 ,  0.5178263 ,
       -0.9620471 , -0.98253274,  0.5575602 , -0.5283928 , -0.05512738,
       -0.46859574, -0.9827881 ,  0.4550724 , -0.4175427 , -0.6799257 ,
        0.32043505, -0.60924935, -0.08730078, -0.76487565, -0.11529756,
       -0.05081773, -0.423831  , -0.69595194, -0.39993382,  0.01512861,
        0.82286215, -0.96196485, -0.96162105, -0.69300675, -0.23160791,
       -0.8725774 , -0.62869287, -0.21675658,  0.22361946, -0.7145815 ,
        0.25228357,  0.300138  ,  0.1944983 , -0.20161653, -0.00947928,
       -0.50661993,  0.24620843,  0.8336489 , -0.6433666 ,  0.4633739 ,
        0.42356896, -0.2927196 ,  0.7726562 , -0.77078557,  0.42736077,
        0.2361381 ,  0.8253889 , -0.03234029,  0.16903758,  0.64719176,
        0.12639523,  0.468915  ,  0.36462903, -0.63329506,  0.46308804,
        0.9785025 , -0.60487294, -0.8659482 ,  0.80265903,  0.08614421,
       -0.6846776 , -0.2840774 , -0.05165243,  0.7902992 ,  0.7554364 ,
        0.07603502, -0.82541203, -0.03127742, -0.45349932, -0.6321502 ,
       -0.75881124,  0.10189629,  0.7766483 , -0.02184248,  0.30532098,
        0.40934992, -0.3520453 , -0.4991796 ,  0.89320135, -0.5294213 ,
        0.08958745, -0.2862544 ,  0.694613  , -0.2933941 , -0.2711556 ,
       -0.778697  , -0.90801215, -0.4771154 ,  0.9393649 ,  0.02598763,
       -0.6128385 ,  0.6687329 , -0.00300312,  0.39082742, -0.62328243,
       -0.1326313 , -0.04318118,  0.5147674 ,  0.30447197, -0.15042996,
       -0.29966593, -0.19948554, -0.15503025, -0.07965088, -0.18107772,
       -0.6654799 ,  0.16734552, -0.6545446 , -0.19038987,  0.11273432,
       -0.37501454, -0.01779771,  0.10266089,  0.6059449 ,  0.53478146,
        0.8791959 , -0.71896863, -0.50831914,  0.51859474,  0.7803166 ,
        0.85757375,  0.58769774, -0.01653957,  0.35751534, -0.66742086,
        0.09473515, -0.89558864,  0.5007875 ,  0.6572523 ,  0.47241664,
        0.5635514 ,  0.32414556, -0.53437877,  0.84779453,  0.6378653 ,
        0.81033015, -0.9580946 ,  0.4329822 ,  0.7842884 , -0.02432752,
       -0.26144147,  0.51170826,  0.18752575,  0.716552  ,  0.19081879,
        0.76230717,  0.95465493,  0.587734  ,  0.9609244 , -0.95637846,
       -0.8732126 , -0.4947157 ,  0.4163556 ,  0.08395147,  0.48358202,
        0.6750531 ,  0.6933727 , -0.66409326, -0.6555612 , -0.77092767,
        0.77507496,  0.6416006 , -0.10126472, -0.20890045,  0.12876058,
       -0.7351172 ,  0.68103194, -0.575778  ,  0.1444602 , -0.42351747,
       -0.81415844, -0.58244324, -0.6112335 , -0.16471076,  0.5918329 ,
        0.6705165 , -0.9932399 ,  0.1535554 ,  0.02513838, -0.6433432 ,
        0.0850389 , -0.10692096,  0.21783972, -0.00443554, -0.5312202 ,
        0.16654754,  0.1691029 ,  0.9144945 , -0.20212364, -0.7347467 ,
        0.1740458 , -0.8262415 , -0.05594969, -0.04339361,  0.439353  ,
       -0.00228357, -0.6715636 ,  0.879483  ,  0.10999107,  0.8576815 ,
       -0.38673759, -0.2496996 ,  0.8718543 ,  0.77182436, -0.91532016,
        0.8322928 , -0.95677876,  0.11354065,  0.31194258, -0.7994232 ,
        0.8070309 , -0.12008953, -0.555902  , -0.6638913 ,  0.4023559 ,
       -0.77688384,  0.12601566, -0.3632667 , -0.6541252 ,  0.10901499,
        0.3102548 , -0.40334034,  0.03114676, -0.7885685 , -0.20401645,
        0.939183  ,  0.17131758,  0.47609544, -0.17927122, -0.5007596 ,
        0.9717326 , -0.0057416 ,  0.81249833,  0.39427924,  0.18702984,
       -0.4081514 , -0.47332573, -0.0909853 , -0.5931864 ,  0.7257166 ,
        0.18550944,  0.21591997, -0.02170038, -0.0661478 , -0.67937946,
       -0.28355837,  0.7463348 , -0.32689762,  0.9659898 , -0.54855466,
        0.72903705, -0.32373667, -0.92316556,  0.01121569,  0.17884326],
      dtype=float32)

Curiously, the closest (e.g. - cosine-similarity) embedding vector after training for 200 epochs to embedding vector before training 0 is:

word_embedding_000 = np.load("word_weights_000.npy")
word_embedding_199 = np.load("word_weights_199.npy")

idx = np.array([cosine_similarity(x, word_embedding_000[0]) for x in word_embedding_199]).argmin()
print(idx)
2905
print(idx_to_word[2905])
disc

How could one embedding vector appear in so many [orthogonal?] topics.

EPOCH: 85
LOSS 950.43896 w2v 8.754408 lda 941.6846 lda-sim 3.299621869012659
---------Closest 10 words to given indexes----------
Topic 0 : <UNK>, vending, confidential, offender, drainage, terrace, overtime, unintended, documentation, yan
Topic 1 : <UNK>, meaning, refrain, spent, largely, ran, equally, considered, decade, exact
Topic 2 : mim, lite, lea, recalibration, sonny, l, skip, unsold, vive, allen
Topic 3 : marathi, recalibration, assamese, uzbek, tagalog, gaelic, romansh, galician, razoo, recast
Topic 4 : loophole, vive, estonian, gaelic, slovenian, maracaibo, slovak, faroese, magyar, romansh
Topic 5 : <UNK>, vacant, jos, bleeding, kivu, bye, aunt, sundar, whilst, cowboy
Topic 6 : depending, closely, decided, applied, considered, spent, contrary, isolated, especially, frequently
Topic 7 : <UNK>, spinach, frost, slew, confined, yakan, ironically, dusty, shelf, bleeding
Topic 8 : basque, assamese, azerbaijani, razoo, haitian, kiswahili, recast, icelandic, nederlands, mommy
Topic 9 : spiky, recast, andhra, tauranga, revoke, recalibration, thread, mull, menacing, motoring
Topic 10 : <UNK>, rightly, inflammatory, severity, owen, incitement, disappearance, forge, magistrate, campaigner
Topic 11 : burke, chronicle, resend, fico, activation, tauranga, fetish, interstitial, unspoken, mommy
Topic 12 : <UNK>, confined, labrador, rope, modeling, shane, terrace, downpour, vernon, nutritional
Topic 13 : nederlands, suomi, allen, icelandic, seed, afrikaans, razoo, assamese, latvian, gaelic
Topic 14 : unacceptable, practically, exact, impression, mixture, certainly, hardly, toxic, younger, capture
Topic 15 : <UNK>, emergence, nose, straw, abundant, confined, copper, decreasing, ironically, litigation
Topic 16 : <UNK>, charlotte, straw, ironically, taps, spinach, yakan, confined, slew, maduro
Topic 17 : <UNK>, aggregate, crushing, knockout, versatile, distinctive, admired, pleasure, applause, wishing
Topic 18 : gaelic, romansh, slovenian, folder, resend, recast, assamese, creole, slovak, unspoken
Topic 19 : kiswahili, newsstand, ossetic, banat, assamese, faroese, creole, oriya, confucianism, romansh

@stalhaa
Copy link

stalhaa commented Aug 1, 2019

@dbl001 means??

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants