- 1300+ Towards DataScience Medium Articles Dataset (https://www.kaggle.com/datasets/meruvulikith/1300-towards-datascience-medium-articles-dataset)
Note: Startup of the app as well as LLM querying can take a long time, especially without a GPU.
If you don't have llama2 downloaded in ollama (or your running with docker), first invocation of LLM querying can be very time-consuming due to ollama downloading the llama2 model (it is known to have connection issues).
[16GB RAM, Intel Core i7-9750H CPU, Nvidia GeForce 1660-Ti] -- 3 minutes startup with GPU (no cache) -- 10 minutes startup CPU-only (no cache)
- Run
git clone https://github.com/AmevinLS/ds-article-rag
- Change working directory to the cloned repository (
cd ds-article-rag
) - Download dataset and extract it to
./data/medium.csv
(create the./data
directory if needed) - Run:
docker-compose -f docker-compose_cpu.yml up --build
to run on CPU, if you don't have an Nvidia GPUdocker-compose -f docker-compose_gpu.yml up --build
to run using your Nvidia GPU
- Go to
http://localhost:8501
in your browser to open the streamlit app
Make sure you have Ollama installed (you can download it here here).
Optionally, you can run ollama pull llama2
yourself to avoid download issues at runtime
- Run
git clone https://github.com/AmevinLS/ds-article-rag
- Change working directory to the cloned repository (
cd ds-article-rag
) - Download dataset and extract it to
./data/medium.csv
(create the./data
directory if needed) - Run
pip install -r requirements.txt
(tested for Python=3.9) - Run
streamlit run ds_article_rag/app.py
- Go to
http://localhost:8501
in your browser to open the streamlit app
- Download 'cache' folder from Google Drive (here)
- Export it into the
./data
, so the resulting structure looks as follows:
data/
medium.csv
cache/
code_reduced_par_embeds.npy
code_reduced_paragraphs.csv
You can find the details pertaining to how the whole system is structured and other relevant information in the ./docs/report.md
file in this repository