Introduction to Docker and Containerization
Containers are portable, lightweight, and efficient tools for application deployment. Unlike virtual machines, they allow multiple isolated environments to run on a single host operating system (OS), often supporting hundreds or thousands of containers simultaneously. By decoupling software from its runtime environment, containers enable developers to build applications on one OS, such as Linux, and deploy them seamlessly on another, like Windows, without facing configuration issues.
Docker is a platform that simplifies the creation, provisioning, and execution of containers. A container bundles an application with everything it needs to run, including libraries, configuration files, and dependencies. Instead of requiring separate operating systems for each application, containers share the underlying OS services of the host system, making them highly resource-efficient.
How Containers Differ from Virtual Machines?
Unlike virtual machines (VMs), which include a full operating system along with the application and its dependencies, containers share the host OS kernel. This makes containers much lighter and faster to start compared to VMs, which require hardware-level virtualization and more resources. Containers focus on isolating applications, while VMs isolate entire operating systems.
Docker Installation Guide
To get started with Docker, follow the installation instructions based on your operating system:
- macOS and Windows: Install Docker Desktop using the official guide here.
- Linux: Follow the instructions for your distribution, such as Ubuntu.
If you're using Windows, it's recommended to enable Windows Subsystem for Linux (WSL) for better performance and compatibility. Learn more about WSL integration here.
Fixing Permissions on Linux
To avoid permission issues when running Docker commands on Linux, add your user to the Docker group:
sudo usermod -aG docker $USER
Testing Docker with "Hello World"
Run the following command to verify your Docker installation:
docker run hello-world
This will download and execute a simple Docker image, confirming that Docker is set up correctly.
Experiment with Docker Online
You can try Docker without installing it by using the Play With Docker platform: Play With Docker.
Running an Ubuntu Container
To launch an interactive Ubuntu container, use:
docker run -it ubuntu bash
The -it
flag enables interactive mode, allowing you to access the Ubuntu container's command line directly.
Docker from Scratch
To deeply understand Docker and containers, you can explore how to build containers from scratch. Check out this resource: Containers From Scratch.
What Are Containers?
A container is a lightweight, standalone, and executable unit of software that includes everything needed to run an application: the code, runtime, libraries, and dependencies. Containers are created from images and can be managed using the Docker API or CLI.
With containers, you can:
- Create, start, stop, move, or delete instances.
- Connect containers to networks or attach storage volumes.
- Build new images based on a container's current state.
Containers are isolated by default, meaning their network, storage, and subsystems are separate from the host machine and other containers. However, you can configure the level of isolation based on your needs.
Docker Architecture
Docker operates using a client-server architecture:
- Docker Client: This is the interface used to interact with Docker. Commands like
docker run
ordocker build
are sent from the client to the daemon. - Docker Daemon: The daemon handles the heavy lifting of building, running, and managing containers.
- Communication: The client and daemon communicate using a REST API over UNIX sockets or network interfaces.
Docker Compose is another client that helps manage multi-container applications. It allows you to define and run applications consisting of multiple interconnected containers.
Let’s break down the following Dockerfile and understand its key concepts:
# Our base image
FROM python:3.10.5-alpine
# Set working directory inside the image
WORKDIR /app
# Copy our requirements
COPY requirements.txt requirements.txt
# Install dependencies
RUN pip3 install -r requirements.txt
# Copy this folder's contents to the image
COPY . .
# Tell the port number the container should expose
EXPOSE 5000
Every instruction in the Dockerfile creates a layer. Layers are intermediate images that store changes compared to the previous state of the image.
FROM python:3.10.5-alpine
:- This is the base image layer. It provides a lightweight Python 3.10.5 environment optimized for Alpine Linux.
WORKDIR /app
:- Sets the working directory inside the container to
/app
. Any subsequent commands likeCOPY
orRUN
will be executed relative to this directory.
- Sets the working directory inside the container to
COPY requirements.txt requirements.txt
:- Adds the
requirements.txt
file from the local system to the container’s/app
directory.
- Adds the
RUN pip3 install -r requirements.txt
:- Installs the Python dependencies listed in the
requirements.txt
file. This forms another layer storing the installed packages.
- Installs the Python dependencies listed in the
COPY . .
:- Copies all the files from the current directory on the host machine into the container’s
/app
directory.
- Copies all the files from the current directory on the host machine into the container’s
EXPOSE 5000
:- Informs Docker that the container will listen on port
5000
. This doesn’t automatically map the port but acts as documentation for users.
- Informs Docker that the container will listen on port
Each instruction (e.g., FROM
, COPY
, RUN
) creates a layer. Layers optimize the build process by reusing unchanged layers when the Dockerfile is re-built. Think of it like saving "checkpoints" during a build process.
-
COPY
: Used for basic file copying from the local machine to the container.- Example:
COPY requirements.txt requirements.txt
- Example:
-
ADD
: Provides extra functionality, such as extracting.tar
files or downloading files from a URL.- Example:
ADD myfiles.tar.xz /app
Best Practice: Use
COPY
for simple file operations andADD
only when additional features are required. - Example:
-
CMD
:- Specifies the default command to execute when the container starts.
- Example:
CMD ["python", "app.py"]
- This executes the Python script
app.py
as the default.
-
ENTRYPOINT
:- Specifies the command that will always run when the container starts.
- Example:
ENTRYPOINT ["python"] CMD ["app.py"]
- This sets
python
as the main executable, withapp.py
as the default argument.
Best Practice: Use
ENTRYPOINT
for fixed commands andCMD
for configurable arguments.
-
Exec Form: Directly specifies the executable and its arguments as a JSON array.
- Example:
CMD ["python", "app.py"]
- Advantages: Signals like
CTRL-C
(SIGINT) are correctly passed to the running process, ensuring graceful termination.
- Example:
-
Shell Form: Runs commands through a shell (e.g.,
/bin/sh -c
).- Example:
CMD python app.py
- Limitation: Shells often don’t forward signals, causing issues with process management.
Best Practice: Always use exec form to ensure proper signal handling.
- Example:
-
docker stop
:- Sends a
SIGTERM
signal to the process, allowing it to shut down gracefully. - Example: Python applications can catch a
KeyboardInterrupt
and clean up resources.
- Sends a
-
docker kill
:- Sends a
SIGKILL
signal, immediately terminating the process without cleanup.
Best Practice: Use
stop
whenever possible to allow the application to exit cleanly. - Sends a
This Dockerfile demonstrates how to set up a Python application in a lightweight container. By understanding concepts like layers, ADD
vs COPY
, and CMD
vs ENTRYPOINT
, you can build efficient, reusable Docker images while following best practices.
Let’s break down this Dockerfile and the Python script main.py
that handles signals gracefully, along with key concepts about Docker signals.
FROM python:3.7.13-alpine
# Copy the Python script into the container
COPY main.py main.py
# Define the default command to execute
CMD ./main.py
# Send SIGINT instead of SIGTERM when stopping the container
STOPSIGNAL SIGINT
The main.py
script is designed to handle system signals like SIGTERM
and SIGINT
.
#!/usr/local/bin/python3 -u
import sys
import signal
import time
# Define the signal handler function
def signal_handler(signum, frame):
print(f"Gracefully shutting down after receiving signal {signum}")
sys.exit(0)
if __name__ == "__main__":
# Attach signal handlers for SIGTERM and SIGINT
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
# Simulate work in a loop
while True:
time.sleep(0.5) # Simulating some task
print("Interrupt me")
-
Building the Docker Image:
Run the following command to build the image:docker build -t signal-handling-example .
-
Running the Container:
Start the container:docker run signal-handling-example
-
Stopping the Container Gracefully:
Usedocker stop
to send aSIGTERM
signal (orSIGINT
as defined bySTOPSIGNAL
in the Dockerfile):docker stop <container_id>
The container will terminate gracefully, and you’ll see the message:
Gracefully shutting down after receiving signal 15
-
Forcefully Killing the Container:
Usedocker kill
to send aSIGKILL
signal, which terminates the container immediately without cleanup:docker kill <container_id>
The exit status will be
137
(128 + 9, where9
is theSIGKILL
signal).
The STOPSIGNAL
instruction in the Dockerfile allows you to customize the signal sent when stopping the container.
- Default Behavior:
By default,docker stop
sends aSIGTERM
signal. - Customizing with
STOPSIGNAL
:
Adding the following line in the Dockerfile changes the default signal toSIGINT
:Now, when stopping the container, it will sendSTOPSIGNAL SIGINT
SIGINT
instead ofSIGTERM
, ensuring proper handling by Python.
You can send any signal to a container using the docker kill
command with the --signal
flag.
-
Send
SIGTERM
:docker kill --signal=SIGTERM <container_id>
-
Send
SIGINT
:docker kill --signal=SIGINT <container_id>
Signal Name | Signal Number | Description |
---|---|---|
SIGHUP | 1 | Hang up detected on the controlling terminal. |
SIGINT | 2 | Issued when the user sends an interrupt (Ctrl + C). |
SIGQUIT | 3 | Issued when the user sends a quit signal (Ctrl + D). |
SIGFPE | 8 | Issued for illegal mathematical operations. |
SIGKILL | 9 | Immediately terminates the process without cleanup. |
SIGALRM | 14 | Alarm clock signal (used for timers). |
SIGTERM | 15 | Default termination signal (sent by `docker stop`). |
-
Graceful Shutdown:
- Python applications should handle
SIGTERM
orSIGINT
gracefully to clean up resources and exit properly.
- Python applications should handle
-
Use
STOPSIGNAL
:- Customize the default stop signal in the Dockerfile to align with your application’s requirements.
-
Avoid Forceful Termination (
SIGKILL
):- Only use
docker kill
when absolutely necessary, as it doesn’t allow the application to perform cleanup.
- Only use
-
Version Pinning:
- Always specify exact versions in the Dockerfile (e.g.,
python:3.7.13-alpine
) to ensure reproducibility.
- Always specify exact versions in the Dockerfile (e.g.,
refer to the Dockerfile complete syntax guide.
This project implements a Convolutional Neural Network (CNN) on the MNIST dataset using PyTorch. The project is containerized using Docker to ensure easy setup and consistent environments across different machines. The script allows training the model from scratch, resuming training from a checkpoint, and evaluating the model's performance.
- Overview
- What is Docker?
- Why use Docker?
- Requirements
- Docker Setup
- Training Script Arguments
- Model Architecture
- Data Loading and Transformations
- Model Initialization
- Checkpoint Loading and Saving
- Training and Evaluation Loop
- Results
The goal of this project is to classify handwritten digits (0-9) from the MNIST dataset using a Convolutional Neural Network (CNN). The project uses PyTorch for the model implementation, and Docker is used to containerize the application for ease of use and portability.
The MNIST (Modified National Institute of Standards and Technology) dataset is a database of handwritten digits that is usually used for training multiple image processing systems. Here are some key details about the dataset:
- Content: 28x28 grayscale images of handwritten digits (0-9)
- Size:
- 60,000 training images
- 10,000 test images
- Format: Each image is represented as a 2D PyTorch tensor
- Labels: Each image is associated with a label (0-9)
- Source: The dataset is built into PyTorch and can be easily downloaded using
torchvision.datasets.MNIST
In this project, we use PyTorch's torchvision.datasets.MNIST
to download and load the MNIST dataset. The data is normalized and transformed into PyTorch tensors for training and testing.
Docker is an open-source platform that automates the deployment of applications in lightweight, portable containers. These containers package an application and all of its dependencies, ensuring it runs the same regardless of the environment. Docker provides a way to isolate applications from the underlying system, preventing dependency conflicts and making it easier to manage and deploy applications across different systems.
Setting up environments for machine learning and deep learning projects can be challenging because of dependencies on hardware (such as CUDA for GPUs) and incompatibilities across Python versions and libraries. Docker offers a self-contained environment that resolves such issues.
For this project, Docker is especially useful because of the following:
- Environment Consistency: Every user runs the project in exactly the same environment. This solves the "
it works on my machine
" conundrum. - Easy Setup: PyTorch, torchvision, and other dependencies don't need to be manually installed when using Docker.
- Reproducibility: By specifying dependencies in a
Dockerfile
, you can duplicate the environment required to perform the training pipeline.
To run this project, you need to have Docker installed on your system. The installation process varies depending on your operating system. Once installed, verify by running:
docker --version
Here's the Dockerfile for containerizing the MNIST training:
FROM python:3.9-slim
WORKDIR /workspace
COPY requirements.txt requirements.txt
RUN pip3 --no-cache-dir install torch==1.9.0+cpu torchvision==0.10.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
RUN pip3 --no-cache-dir install numpy==1.23.4
COPY train.py /workspace/
CMD ["python", "train.py"]
To build the Docker image for this project, navigate to the root directory of your project and run:
docker build --tag mnist-classifier .
To run the container for training, use the following command:
docker run --name mnist-container --rm -v $(pwd):/workspace mnist-classifier python /workspace/train.py
To resume training from a saved checkpoint, mount the directory where the checkpoint is stored and pass the --resume
argument:
docker run --name mnist-container --rm -v $(pwd):/workspace mnist-classifier python /workspace/train.py --resume
You can specify the following command-line arguments while running the training script:
Argument | Default | Type | Description |
---|---|---|---|
--batch-size |
64 | int |
Input batch size for training. |
--test-batch-size |
1000 | int |
Input batch size for testing. |
--epochs |
15 | int |
Number of epochs to train. |
--lr |
0.001 | float |
Learning rate for the optimizer. |
--gamma |
0.7 | float |
Learning rate step gamma for the learning rate scheduler. |
--no-cuda |
False |
bool |
Disables CUDA (GPU) training. |
--no-mps |
False |
bool |
Disables macOS GPU training (MPS backend). |
--dry-run |
False |
bool |
Quickly check a single pass for debugging purposes. |
--seed |
1 | int |
Random seed for reproducibility. |
--log-interval |
10 | int |
Number of batches to wait before logging training status. |
--save-model |
True |
bool |
Save the model after each epoch. |
--resume |
True |
bool |
Resume training from the last checkpoint if available. |
Here’s the CNN architecture for MNIST digit classification:
import torch.nn as nn
import torch.nn.functional as F
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
# Defining the model architecture
self.conv1 = torch.nn.Conv2d(1, 32, 3, 1)
self.conv2 = torch.nn.Conv2d(32, 64, 3, 1)
self.dropout1 = torch.nn.Dropout(0.1)
self.dropout2 = torch.nn.Dropout(0.2)
self.fc1 = torch.nn.Linear(9216, 128)
self.fc2 = torch.nn.Linear(128, 10)
def forward(self, x):
# Define the forward pass
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
The dataset undergoes normalization and is converted into tensors for PyTorch training and testing. Here’s the data-loading and transformation setup:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
training_data = datasets.MNIST(
"../data", train=True, download=True, transform=transform
)
test_data = datasets.MNIST("../data", train=False, transform=transform)
train_loader = DataLoader(training_data, batch_size=args.batch_size, shuffle=True)
test_loader = DataLoader(test_data, batch_size=args.test_batch_size, shuffle=False)
Initializing the model and optimizer with parameters for training:
import torch.optim as optim
from model import Net # Assuming Net is your CNN architecture
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
optimizer = optim.AdamW(model.parameters(), lr=args.lr)
To resume training from a checkpoint, the --resume
argument is provided:
import torch
model_checkpoint_path = "model_checkpoint.pth"
start_epoch = 1
# Loading the model checkpoint if 'resume' argument is True
if args.resume:
checkpoint = torch.load(model_checkpoint_path)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch'] + 1
# Saving the model after each epoch
def save_checkpoint(model, optimizer, epoch, path):
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict()
}, path)
The main training and evaluation loop, including checkpoint saving:
from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
for epoch in range(start_epoch, args.epochs + 1):
train_epoch(epoch, args, model, device, train_loader, optimizer) # Training for one epoch
test_epoch(model, device, test_loader) # Testing on validation data
scheduler.step()
if args.save_model:
save_checkpoint(model, optimizer, epoch, model_checkpoint_path)
After training the model for 15 epochs, the results are:
Train Epoch: 15 [59520/60000 (99%)]
Loss: 0.000001
Test set:
Average loss: 0.0306,
Accuracy: 9927/10000 (99%)