Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

UnicodeDecodeError when loading embeddings using nlppipe Preprocessor method load_glove #64

Open
BrutishGuy opened this issue May 24, 2020 · 1 comment

Comments

@BrutishGuy
Copy link

BrutishGuy commented May 24, 2020

Hi,
I am currently experiencing an issue (on Windows, Python 3.7), whereby the Preprocessor class function throws a UnicodeDecodeError when loading pre-trained word embeddings

Sufficient code to reproduce this is simply to instantiate a Preprocessor class with some Pandas dataframe and attempt to load a word embedding using one of the GloVe embedding files. In my case, I am using glove.6B.300d (Wikipedia 2014 + Gigaword 5) taken from the official site as linked on this repository too: https://github.com/stanfordnlp/GloVe. I have attempted using other embeddings as well to no avail. I use 7zip to unpack the zip file in order to retrieve the .txt embeddings as per the lda2vec example provided.

from lda2vec.nlppipe import Preprocessor
P = Preprocessor(YOUR_DF, "ANY_TEXT_COLUMN", max_features=30000, maxlen=10000, min_count=30)
embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.100d.txt")

The specific error thrown is as below:

Traceback (most recent call last):

  File "<ipython-input-4-e5cf0a369051>", line 3, in <module>
    embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.300d.txt")

  File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in load_glove
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

  File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in <genexpr>
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>

I believe this can be fixed by adding

, encoding="utf8"
to the call to the function open on line 127 of the nlppipe.py code.

@nateraw
Copy link
Owner

nateraw commented May 24, 2020

Feel free to make a PR if you have a fix - I don't have the bandwidth to work on this repo anymore. Cheers

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants