There were not that many peaceful years in our history. Let's at least forecast when the next unrest comes.
A new wave of violent conflicts around the world raises concerns about the security of the whole world. According to the ACLED Conflict Index - 12% more conflicts occurred in 2023 compared to 2022 and the trend does not seem to halt.
The goal of our research is to build a robust model for early military conflict prediiction available to people. Awareness of emerging risks should be a right, not a privilege.
This repository presents the first publicly available and explainable early conflict forecasting model capable of forecasting the distribution of conflict-related fatalities on a country-month level. The model seeks to be maximally transparent and produces predictions up to 14 months into the future. Our model improves over 4 out of 6 benchmark years but so far misses important violence spikes.
Keywords: Interstate conflict modelling · Early Conflict Warning System · Fatalities prediction · Predicting with uncertainty.
Before you proceed with running the model and iterating over existing implementations, it's important to understand the inputs and outputs of the model.
The model takes as an input a dataset
from ViEWS prediction competition 2024 augmented
with some additional features and PCAs. The full pipeline for data preprocessing is stored in
the data_preprocessing_pipeline
folder.
The original dataset consists of UCDP Geo-referenced Event Dataset (GED), the V-Dem dataset, and the World Development Indices, ACLED dataset and some others.
The model outputs the predicted distribution of conflict-related fatalities for each country-month pair (this is
a regression problem). The default prediction window is 14 months ahead, as this was required by the ViEWS
competition rules. But this can be easily
adjusted in the 6. shift yearly cm_features.py
data pipeline file.
The great sources of information about the model are the technical report and the shortened version of the report. They describe the model in detail and provide insights into the model's performance, as well as suggest possible improvements.
The technical report with details of implementation and nuances of the model is available on the Google Drive.
For your convenience, the technical report structure is shown below.
The shortened version of the report is available on Medium. This report provides a high-level overview of the model and its performance.
While the code is flexible and any model can be used, we build our model using the Natural Gradient Boosting ( NGBoost) framework. The other models are in development.
The NGBoost model code is stored in the model folder in two representations: .py
and .ipynb
in the model
folder.
To GitHub, we push only .py
files. The .ipynb
files are generated using Jupytext (see bash scripts in the section
below).
Simply run the script .py
or .ipynb
scripts, and it will train NGBoost model based on paramethers specified in the
header of the file, produce plots and submission files that can be evaluated to derive model accuracy.
The model is evaluated using evaluate_submissions.py
file, and the aggregated statistics about the model can be
gathered via compare_submissions.ipynb
.
Set up the environment using poetry by running the following command using base interpreter as Python 3.10:
poetry install
(install Poetry if you don't have it yet)
Run the following command to install pre-commit hooks:
```bash
pre-commit install
Ensure that you have the following dependencies installed:
- Black (for Python code formatting)
- Jupyter (for removing output from notebooks)
For better development experience and version control, the Jupytext library is used to generate .py
files based on
their .ipynb
representation and vice-versa. Additionally, Jupytext provides a convenient syncing logic between both
representations.
There are two bash scripts available:
- data_preprocessing_pipeline.sh - script for running all steps of the data preprocessing pipeline. Note that this requires an R and Python environment set-up. The reason for this is that the pipeline uses some libraries exclusively available in R only.
- jupytext_sync.sh - script to create a Jupyter model file and sync it with its Python representation.
Run the following command to give execute permission to bash script:
chmod +x [file].sh
Run the following command to execute the bash script:
./[file].sh [args]
Run the following command to generate jupyter notebook:
jupytext --to ipynb [file_name].py
Run the following command to turn jupyter notebook into a paired ipynb/py notebook:
jupytext --set-formats ipynb,py [file_name].ipynb
Run the following command to syncronize the jupyter notebook with changes in python file:
jupytext --sync [file_name].ipynb
The technical report is structured as follows:
- Introduction
- Related Work
- Summary of contributions
- Methodology
- Level of analysis and prediction window
- Original Competition Dataset
- Data preprocessing
- Data cleaning
- Dependent variable shifting
- Regions addition
- Parametrization
- Least Important Features Drop
- Natural Gradient Boosting
- Handling Negative Predictions
- Handling of removed countries
- Scoring Criteria
- Continuous Ranked Probability Score
- Ignorance Score
- Mean Interval Score
- Metrics Implementation
- Model fine-tuning
- Competition Benchmarks
- Last Historical Poisson
- Bootstraps from actuals
- Results
- General Performance
- Additional evaluation for the 2022 year
- Model accuracy dependency on input fatalities distribution of the month
- Feature Importance
- Analysis of country forecasts
- Discussion
- Future work
- Appedix with tables and figures
I hope you have fun reading it :P