-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6f541c6
commit d9467be
Showing
2 changed files
with
34 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,55 +1,57 @@ | ||
# EEA Corpus (alpha) | ||
# EEA Corpus (alpha stage) | ||
|
||
This docker image is based on spaCy, Textacy and pyLDAvis to analyse the | ||
EEA Corpus (the collection of all published EEA documents). | ||
This docker image is based on spaCy, Textacy, pyLDAvis & others to analyse the | ||
EEA Corpus (the collection of all published EEA documents) or any other CSV | ||
file with a column of text. | ||
|
||
It provides a number of Machine Learning and Natural Language Processing algorthims | ||
that can be run on top of the EEA Corpus or a subset of it. | ||
It provides a number of Machine Learning and Natural Language Processing | ||
algorthims that can be run on top of the EEA Corpus or a subset of it. | ||
|
||
The idea is to provide these methods over a REST API when possible. | ||
~The idea is to provide these methods over a REST API when possible.~ | ||
|
||
## Current features | ||
|
||
Create and visualise topic models via pyLDAvis. | ||
### Compose a text transformation pipeline to prepare a corpus | ||
|
||
The topics are found via a text-mining technique called [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model). | ||
First upload a CSV file, then use the "Create a corpus" button to enter | ||
the pipeline composition page. | ||
|
||
In machine learning and natural language processing, a topic model is a | ||
### C*reate and visualise topic models via pyLDAvis. | ||
|
||
The topics are found via a text-mining technique called | ||
[Topic Modeling](https://en.wikipedia.org/wiki/Topic_model). | ||
|
||
In machine learning and natural language processing, a topic model is a | ||
type of statistical model for discovering the abstract "topics" that occur in a | ||
collection of documents. | ||
|
||
[Video demonstration](https://www.youtube.com/watch?v=IksL96ls4o0&t=255s) | ||
|
||
 | ||
|
||
How to run: | ||
## How to run: | ||
|
||
``` | ||
docker-compose build | ||
docker-compose up -d | ||
``` | ||
|
||
Enter the shell of eeacorpus container: | ||
|
||
``` | ||
docker exec -it eeacorpus_shell_1 bash | ||
``` | ||
|
||
Inside the shell you run: | ||
|
||
``` | ||
python load_eea_corpus.py --normalize --data data-small.csv | ||
``` | ||
|
||
This will (after some time) start the visualisation browser on 0.0.0.0:8888 | ||
This will (after some time) start the EEA Corpus application server on | ||
[localhost:8181]*(http://0.0.0.0:8181) | ||
|
||
## EEA Corpus Data | ||
|
||
The latest EEA Corpus dataset can be produced by visiting | ||
[global catalogue](http://search.apps.eea.europa.eu/) > See all results > download csv. | ||
The latest EEA Corpus dataset can be produced by visiting [global | ||
catalogue](http://search.apps.eea.europa.eu/) > See all results > download | ||
csv. | ||
|
||
Once the csv file is downloaded, you can pass it to this application to be analysed. Make sure your | ||
first column is the "document text" to be analysed. The other columns are considered metadata. | ||
Once the csv file is downloaded, you can pass it to this application to be | ||
analysed. Make sure your first column is the "document text" to be analysed. | ||
The other columns are considered metadata. | ||
|
||
You may download an already generated large EEA corpus data for testing like this: | ||
You may download an already generated large EEA corpus data for testing like | ||
this: | ||
|
||
``` | ||
curl -L -o data.csv https://www.dropbox.com/s/sihmoc4wwpl0kr2/data_all.csv?dl=1 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters