You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 11, 2020. It is now read-only.
Hi,
I am currently experiencing an issue (on Windows, Python 3.7), whereby the Preprocessor class function throws a UnicodeDecodeError when loading pre-trained word embeddings
Sufficient code to reproduce this is simply to instantiate a Preprocessor class with some Pandas dataframe and attempt to load a word embedding using one of the GloVe embedding files. In my case, I am using glove.6B.300d (Wikipedia 2014 + Gigaword 5) taken from the official site as linked on this repository too: https://github.com/stanfordnlp/GloVe. I have attempted using other embeddings as well to no avail. I use 7zip to unpack the zip file in order to retrieve the .txt embeddings as per the lda2vec example provided.
from lda2vec.nlppipe import Preprocessor
P = Preprocessor(YOUR_DF, "ANY_TEXT_COLUMN", max_features=30000, maxlen=10000, min_count=30)
embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.100d.txt")
The specific error thrown is as below:
Traceback (most recent call last):
File "<ipython-input-4-e5cf0a369051>", line 3, in <module>
embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.300d.txt")
File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in load_glove
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in <genexpr>
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>
I believe this can be fixed by adding
, encoding="utf8"
to the call to the function open on line 127 of the nlppipe.py code.
The text was updated successfully, but these errors were encountered:
Hi,
I am currently experiencing an issue (on Windows, Python 3.7), whereby the Preprocessor class function throws a UnicodeDecodeError when loading pre-trained word embeddings
Sufficient code to reproduce this is simply to instantiate a Preprocessor class with some Pandas dataframe and attempt to load a word embedding using one of the GloVe embedding files. In my case, I am using glove.6B.300d (Wikipedia 2014 + Gigaword 5) taken from the official site as linked on this repository too: https://github.com/stanfordnlp/GloVe. I have attempted using other embeddings as well to no avail. I use 7zip to unpack the zip file in order to retrieve the .txt embeddings as per the lda2vec example provided.
The specific error thrown is as below:
I believe this can be fixed by adding
, encoding="utf8"
to the call to the function
open
on line 127 of the nlppipe.py code.The text was updated successfully, but these errors were encountered: