Skip to content

Commit

Permalink
feat: add support for 2024; add bash scripts; bugfix log-transform li…
Browse files Browse the repository at this point in the history
…ne-plots; restructure NGBoost model to simplify handling for 2024
  • Loading branch information
fif911 committed Aug 31, 2024
1 parent 0f15827 commit f3ef68b
Show file tree
Hide file tree
Showing 14 changed files with 72,882 additions and 2,561 deletions.
42 changes: 36 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Can we predict wars? How certain would we be in our predictions?
This research presents the first publicly available and explainable
early conflict forecasting model capable of forecasting distribution
early conflict forecasting model capable of forecasting the distribution
of conflict-related fatalities on a country-month level. The model seeks
to be maximally transparent, uses publicly available data and produces
predictions up to 14 months into the future. Our model improves over
Expand All @@ -17,19 +17,22 @@ a reference against which future improvements can be evaluated.
The report with details is
available [here](https://drive.google.com/file/d/1r63S5BRPRl8G2HuTjyWtFpOxvVNsNV7o/view?usp=sharing).

The shortened version of the report is available
at [Medium](https://medium.com/@zakotianskyi/predicting-wars-explainable-probabilistic-forecasting-of-conflict-related-fatalities-50c00cac02e4).

## Model

The NGBoost model code is stored in the model folder in two representations: `.py` and `.ipynb`. Simply run the script,
and it will produce plots along with submission files.

## Model Evaluation
### Model Evaluation

The model is evaluated using `evaluate_submissions.py` file, and the aggregated statistics about the model can be
gathered via `compare_submissions.ipynb`.

## Development
## For developers

### Set up environment
### Install dependencies

Set up the environment using poetry by running the following command:

Expand All @@ -47,5 +50,32 @@ pre-commit install

Ensure that you have the following dependencies installed:

1) black (for formatting)
2) jupiter (for removing output from notebooks)
1) Black (for Python code formatting)
2) Jupyter (for removing output from notebooks)

### Jupytext

For better development experience and version control, the Jupytext library is used to generate `.py` files based on
their `.ipynb` representation and vice-versa. Additionally, Jupytext provides a convenient syncing logic between both
representations.

### Bash scripts

There are two bash scripts available:

- **data_preprocessing_pipeline.sh** - script for running all steps of the data preprocessing pipeline. Note that this
requires an R and Python environment set-up. The reason for this is that the pipeline uses some libraries exclusively
available in R only.
- **jupytext_sync.sh** - script to create a Jupyter model file and sync it with its Python representation.

Run the following command to give execute permission to bash script:

```bash
chmod +x [file].sh
```

Run the following command to execute the bash script:

```bash
./[file].sh [args]
```
Binary file not shown.
12 changes: 12 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Changed

- Migrated to the most recent version of `cm_features` published by ViEWS.

## [1.1] - 2023-08-31

### Added

- Support for 2024 year.
- Bash scripts for running data preprocessing pipeline and syncing Jupyter notebooks with Python files.

### Changed

- Bugfixes for line plots logic to ensure correct conversion of values in case log transform flag is on.
- Structure of the NGBoost model to simplify handling for 2024
66,927 changes: 66,927 additions & 0 deletions data/cm_features_v2.5_Y2024.csv

Large diffs are not rendered by default.

6 changes: 1 addition & 5 deletions data_preprocessing_pipeline/5. create_vdem_PCAs.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,6 @@
# format_name: light
# format_version: '1.5'
# jupytext_version: 1.16.4
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---

import numpy as np
Expand All @@ -29,7 +25,7 @@
scaler = StandardScaler()
vdem_columns_centered = scaler.fit_transform(vdem_columns)

pca = PCA()
pca = PCA(random_state=42)
pca.fit(vdem_columns_centered)

cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)
Expand Down
61 changes: 52 additions & 9 deletions data_preprocessing_pipeline/6. shift yearly cm_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,11 @@
# format_name: light
# format_version: '1.5'
# jupytext_version: 1.16.4
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---

# +
import pandas as pd
import sys

version = "v2.5"
cm_features = pd.read_csv(f"../data/cm_features_{version}.csv")
Expand All @@ -22,11 +19,18 @@
country_id_to_ccode = cm_features[["country_id", "ccode"]].drop_duplicates()
cm_features

# +
month_to_date = lambda x: f"{1980 + (x - 1) // 12}-{((x - 1) % 12) + 1}-01"

last_month_id = cm_features["month_id"].max()
print(f"Last month_id: {last_month_id}")
print(f"Last month: {month_to_date(last_month_id)}")

# +
import numpy as np
import pandas as pd

# TODO: Figure out partial creation for 2024
prediction_years = [2018, 2019, 2020, 2021, 2022, 2023]
prediction_years = [2018, 2019, 2020, 2021, 2022, 2023, 2024]
prediction_window = 14
column_name = f"ged_sb_{prediction_window}"
for prediction_year in prediction_years:
Expand All @@ -38,6 +42,7 @@
# get last month_id
last_month_id = cm_features_year["month_id"].max()
print(f"Last month_id: {last_month_id}")
print(f"Last month: {month_to_date(last_month_id)}")
last_month_cm_features = cm_features_year[
cm_features_year["month_id"] == last_month_id
]
Expand Down Expand Up @@ -65,12 +70,40 @@
# add ccode column to actuals_year
actuals_year = actuals_year.merge(country_id_to_ccode, on="country_id", how="left")
actuals_year = actuals_year[~actuals_year["ccode"].isnull()]
print(f"Expected actuals: {last_month_id + 3} to {last_month_id + 3 + 11}")
print(
f"Expected actuals: {month_to_date(last_month_id + 3)} to {month_to_date(last_month_id + 3 + 11)}"
)
if prediction_year == 2024:
# append missing actuals for 2024
# add 8 months with actual value of -1
last_month_actuals = actuals_year[
actuals_year["month_id"] == actuals_year["month_id"].max()
]
amount_of_missing_months = 12 - actuals_year["month_id"].nunique()
actuals_month_buffer_features = []
for counter in range(1, amount_of_missing_months + 1):
temp_month = last_month_actuals.copy()
temp_month["month_id"] = last_month_id + counter
temp_month[
"ged_sb"
] = sys.maxsize # CAREFUL WITH THIS; NOT TO EVALUATE AGAINST IT LATER!
actuals_month_buffer_features.append(temp_month)

# Concatenate the list of DataFrames into a single DataFrame
actuals_month_buffer_df = pd.concat(actuals_month_buffer_features)

# Then concatenate this DataFrame with the actuals_year DataFrame
actuals_year = pd.concat([actuals_year, actuals_month_buffer_df])

_gap_months = two_month_buffer_features["month_id"].unique() - 11 - 3
test_set_months_min = cm_features_year["month_id"].max() - 11
test_set_months_max = cm_features_year["month_id"].max()
print(f"_gap_months: expected empty months because of the gap: {_gap_months}")
print(f"test set is from {test_set_months_min} to {test_set_months_max}")
print(
f"test set is from {month_to_date(test_set_months_min)} to {month_to_date(test_set_months_max)}"
)
print(f"two month buffer months: {two_month_buffer_features['month_id'].unique()}")

cm_features_year = pd.concat(
Expand All @@ -95,16 +128,26 @@
month_ids_is_null = cm_features_year[cm_features_year[column_name].isnull()][
"month_id"
].unique()

print("month_ids_is_null: ", month_ids_is_null)
print("month_ids_is_null to date: ", [month_to_date(x) for x in month_ids_is_null])

# Prediction year = 2024
# Test set input begins in November 2022
# Test set input ends in October 2023
# November 2022 -> predicts January 2024 (month + 14)
# October 2023 -> predicts December 2024 (month + 14)
# Test set span -> November 2022 --> October 2023 = 12 months
# Months we skip initially - September 2022, October 2022
# Months we skip later - November 2023, December 2023

assert all(_gap_months == month_ids_is_null), "Unexpected missing months"

# drop gap months
cm_features_year = cm_features_year[~cm_features_year["month_id"].isin(_gap_months)]

cm_features_year.to_csv(
f"../data/cm_features_{version}_ica_Y{prediction_year}.csv", index=False
f"../data/cm_features_{version}_Y{prediction_year}.csv", index=False
)

print("All done!")

# -
Loading

0 comments on commit f3ef68b

Please sign in to comment.