Skip to content

Commit

Permalink
Update instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
tiberiuichim committed Aug 18, 2017
1 parent 6f541c6 commit d9467be
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 32 deletions.
58 changes: 30 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,57 @@
# EEA Corpus (alpha)
# EEA Corpus (alpha stage)

This docker image is based on spaCy, Textacy and pyLDAvis to analyse the
EEA Corpus (the collection of all published EEA documents).
This docker image is based on spaCy, Textacy, pyLDAvis & others to analyse the
EEA Corpus (the collection of all published EEA documents) or any other CSV
file with a column of text.

It provides a number of Machine Learning and Natural Language Processing algorthims
that can be run on top of the EEA Corpus or a subset of it.
It provides a number of Machine Learning and Natural Language Processing
algorthims that can be run on top of the EEA Corpus or a subset of it.

The idea is to provide these methods over a REST API when possible.
~The idea is to provide these methods over a REST API when possible.~

## Current features

Create and visualise topic models via pyLDAvis.
### Compose a text transformation pipeline to prepare a corpus

The topics are found via a text-mining technique called [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model).
First upload a CSV file, then use the "Create a corpus" button to enter
the pipeline composition page.

In machine learning and natural language processing, a topic model is a
### C*reate and visualise topic models via pyLDAvis.

The topics are found via a text-mining technique called
[Topic Modeling](https://en.wikipedia.org/wiki/Topic_model).

In machine learning and natural language processing, a topic model is a
type of statistical model for discovering the abstract "topics" that occur in a
collection of documents.

[Video demonstration](https://www.youtube.com/watch?v=IksL96ls4o0&t=255s)

![LDA visualisation example](ldavis.png?raw=true "LDA visualisation example")

How to run:
## How to run:

```
docker-compose build
docker-compose up -d
```

Enter the shell of eeacorpus container:

```
docker exec -it eeacorpus_shell_1 bash
```

Inside the shell you run:

```
python load_eea_corpus.py --normalize --data data-small.csv
```

This will (after some time) start the visualisation browser on 0.0.0.0:8888
This will (after some time) start the EEA Corpus application server on
[localhost:8181]*(http://0.0.0.0:8181)

## EEA Corpus Data

The latest EEA Corpus dataset can be produced by visiting
[global catalogue](http://search.apps.eea.europa.eu/) > See all results > download csv.
The latest EEA Corpus dataset can be produced by visiting [global
catalogue](http://search.apps.eea.europa.eu/) > See all results > download
csv.

Once the csv file is downloaded, you can pass it to this application to be analysed. Make sure your
first column is the "document text" to be analysed. The other columns are considered metadata.
Once the csv file is downloaded, you can pass it to this application to be
analysed. Make sure your first column is the "document text" to be analysed.
The other columns are considered metadata.

You may download an already generated large EEA corpus data for testing like this:
You may download an already generated large EEA corpus data for testing like
this:

```
curl -L -o data.csv https://www.dropbox.com/s/sihmoc4wwpl0kr2/data_all.csv?dl=1
```
8 changes: 4 additions & 4 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ services:
- corpus-data:/corpus
command: sh -c "pserve production.ini"

redis:
image: redis

worker:
image: eeacms/corpus:pyramid_service
image: eeacms/corpus:latest
command: sh -c "worker production.ini"
environment:
- REDIS_URL=redis://redis:6379/0
volumes:
- corpus-data:/corpus

redis:
image: redis

volumes:
corpus-data:

0 comments on commit d9467be

Please sign in to comment.