diff --git a/.gitignore b/.gitignore
index 619d98e..156b785 100644
--- a/.gitignore
+++ b/.gitignore
@@ -21,4 +21,5 @@ yarn-error.log*
 
 /tutorials/*.md
 /tutorials/data
+/tutorials/*.pth
 
diff --git a/docs/01-Introduction.md b/docs/01-Introduction.md
new file mode 100644
index 0000000..6e643bb
--- /dev/null
+++ b/docs/01-Introduction.md
@@ -0,0 +1,43 @@
+**Learn the Basics** ||
+[Quickstart](Quickstart.html) ||
+[Tensors](Tensors.html) ||
+[Datasets & DataLoaders](Data.html) ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Learn the Basics
+
+Authors:
+[Suraj Subramanian](https://github.com/suraj813),
+[Seth Juarez](https://github.com/sethjuarez/),
+[Cassie Breviu](https://github.com/cassieview/),
+[Dmitry Soshnikov](https://soshnikov.com/),
+[Ari Bornstein](https://github.com/aribornstein/)
+
+Most machine learning workflows involve working with data, creating models, optimizing model
+parameters, and saving the trained models. This tutorial introduces you to a complete ML workflow
+implemented in PyTorch, with links to learn more about each of these concepts.
+
+We'll use the FashionMNIST dataset to train a neural network that predicts if an input image belongs
+to one of the following classes: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker,
+Bag, or Ankle boot.
+
+`This tutorial assumes a basic familiarity with Python and Deep Learning concepts.`
+
+
+## Running the Tutorial Code
+You can run this tutorial in a couple of ways:
+
+- **In the cloud**: This is the easiest way to get started! Each section has a "Run in Microsoft Learn" link at the top, which opens an integrated notebook in Microsoft Learn with the code in a fully-hosted environment.
+- **Locally**: This option requires you to setup PyTorch and TorchVision first on your local machine ([installation instructions](https://pytorch.org/get-started/locally/)). Download the notebook or copy the code into your favorite IDE.
+
+
+## How to Use this Guide
+If you're familiar with other deep learning frameworks, check out the [0. Quickstart](quickstart_tutorial.html) first
+to quickly familiarize yourself with PyTorch's API.
+
+If you're new to deep learning frameworks, head right into the first section of our step-by-step guide: [1. Tensors](tensor_tutorial.html).
+
diff --git a/docs/02-Quickstart.md b/docs/02-Quickstart.md
new file mode 100644
index 0000000..c529ed9
--- /dev/null
+++ b/docs/02-Quickstart.md
@@ -0,0 +1,385 @@
+[Learn the Basics](Introduction.html) ||
+**Quickstart** ||
+[Tensors](Tensors.html) ||
+[Datasets & DataLoaders](Data.html) ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Quickstart
+This section runs through the API for common tasks in machine learning. Refer to the links in each section to dive deeper.
+
+## Working with data
+PyTorch has two [primitives to work with data](https://pytorch.org/docs/stable/data.html):
+``torch.utils.data.DataLoader`` and ``torch.utils.data.Dataset``.
+``Dataset`` stores the samples and their corresponding labels, and ``DataLoader`` wraps an iterable around
+the ``Dataset``.
+
+
+
+```python
+import torch
+from torch import nn
+from torch.utils.data import DataLoader
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+```
+
+PyTorch offers domain-specific libraries such as [TorchText](https://pytorch.org/text/stable/index.html),
+[TorchVision](https://pytorch.org/vision/stable/index.html), and [TorchAudio](https://pytorch.org/audio/stable/index.html),
+all of which include datasets. For this tutorial, we  will be using a TorchVision dataset.
+
+The ``torchvision.datasets`` module contains ``Dataset`` objects for many real-world vision data like
+CIFAR, COCO ([full list here](https://pytorch.org/vision/stable/datasets.html)). In this tutorial, we
+use the FashionMNIST dataset. Every TorchVision ``Dataset`` includes two arguments: ``transform`` and
+``target_transform`` to modify the samples and labels respectively.
+
+
+
+
+```python
+# Download training data from open datasets.
+training_data = datasets.FashionMNIST(
+    root="data",
+    train=True,
+    download=True,
+    transform=ToTensor(),
+)
+
+# Download test data from open datasets.
+test_data = datasets.FashionMNIST(
+    root="data",
+    train=False,
+    download=True,
+    transform=ToTensor(),
+)
+```
+
+We pass the ``Dataset`` as an argument to ``DataLoader``. This wraps an iterable over our dataset, and supports
+automatic batching, sampling, shuffling and multiprocess data loading. Here we define a batch size of 64, i.e. each element
+in the dataloader iterable will return a batch of 64 features and labels.
+
+
+
+
+```python
+batch_size = 64
+
+# Create data loaders.
+train_dataloader = DataLoader(training_data, batch_size=batch_size)
+test_dataloader = DataLoader(test_data, batch_size=batch_size)
+
+for X, y in test_dataloader:
+    print(f"Shape of X [N, C, H, W]: {X.shape}")
+    print(f"Shape of y: {y.shape} {y.dtype}")
+    break
+```
+
+    Shape of X [N, C, H, W]: torch.Size([64, 1, 28, 28])
+    Shape of y: torch.Size([64]) torch.int64
+
+
+Read more about [loading data in PyTorch](data_tutorial.html).
+
+
+
+
+--------------
+
+
+
+
+## Creating Models
+To define a neural network in PyTorch, we create a class that inherits
+from [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). We define the layers of the network
+in the ``__init__`` function and specify how data will pass through the network in the ``forward`` function. To accelerate
+operations in the neural network, we move it to the GPU if available.
+
+
+
+
+```python
+# Get cpu or gpu device for training.
+device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
+print(f"Using {device} device")
+
+# Define model
+class NeuralNetwork(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.flatten = nn.Flatten()
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(28*28, 512),
+            nn.ReLU(),
+            nn.Linear(512, 512),
+            nn.ReLU(),
+            nn.Linear(512, 10)
+        )
+
+    def forward(self, x):
+        x = self.flatten(x)
+        logits = self.linear_relu_stack(x)
+        return logits
+
+model = NeuralNetwork().to(device)
+print(model)
+```
+
+    Using mps device
+    NeuralNetwork(
+      (flatten): Flatten(start_dim=1, end_dim=-1)
+      (linear_relu_stack): Sequential(
+        (0): Linear(in_features=784, out_features=512, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=512, out_features=512, bias=True)
+        (3): ReLU()
+        (4): Linear(in_features=512, out_features=10, bias=True)
+      )
+    )
+
+
+Read more about [building neural networks in PyTorch](buildmodel_tutorial.html).
+
+
+
+
+--------------
+
+
+
+
+## Optimizing the Model Parameters
+To train a model, we need a [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions)
+and an [optimizer](https://pytorch.org/docs/stable/optim.html).
+
+
+
+
+```python
+loss_fn = nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
+```
+
+In a single training loop, the model makes predictions on the training dataset (fed to it in batches), and
+backpropagates the prediction error to adjust the model's parameters.
+
+
+
+
+```python
+def train(dataloader, model, loss_fn, optimizer):
+    size = len(dataloader.dataset)
+    model.train()
+    for batch, (X, y) in enumerate(dataloader):
+        X, y = X.to(device), y.to(device)
+
+        # Compute prediction error
+        pred = model(X)
+        loss = loss_fn(pred, y)
+
+        # Backpropagation
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        if batch % 100 == 0:
+            loss, current = loss.item(), batch * len(X)
+            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
+```
+
+We also check the model's performance against the test dataset to ensure it is learning.
+
+
+
+
+```python
+def test(dataloader, model, loss_fn):
+    size = len(dataloader.dataset)
+    num_batches = len(dataloader)
+    model.eval()
+    test_loss, correct = 0, 0
+    with torch.no_grad():
+        for X, y in dataloader:
+            X, y = X.to(device), y.to(device)
+            pred = model(X)
+            test_loss += loss_fn(pred, y).item()
+            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
+    test_loss /= num_batches
+    correct /= size
+    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
+```
+
+The training process is conducted over several iterations (*epochs*). During each epoch, the model learns
+parameters to make better predictions. We print the model's accuracy and loss at each epoch; we'd like to see the
+accuracy increase and the loss decrease with every epoch.
+
+
+
+
+```python
+epochs = 5
+for t in range(epochs):
+    print(f"Epoch {t+1}\n-------------------------------")
+    train(train_dataloader, model, loss_fn, optimizer)
+    test(test_dataloader, model, loss_fn)
+print("Done!")
+```
+
+    Epoch 1
+    -------------------------------
+    loss: 2.300704  [    0/60000]
+    loss: 2.294491  [ 6400/60000]
+    loss: 2.270792  [12800/60000]
+    loss: 2.270757  [19200/60000]
+    loss: 2.246651  [25600/60000]
+    loss: 2.223734  [32000/60000]
+    loss: 2.230299  [38400/60000]
+    loss: 2.197789  [44800/60000]
+    loss: 2.186385  [51200/60000]
+    loss: 2.171854  [57600/60000]
+    Test Error: 
+     Accuracy: 40.4%, Avg loss: 2.158354 
+    
+    Epoch 2
+    -------------------------------
+    loss: 2.157282  [    0/60000]
+    loss: 2.157837  [ 6400/60000]
+    loss: 2.098653  [12800/60000]
+    loss: 2.123712  [19200/60000]
+    loss: 2.070209  [25600/60000]
+    loss: 2.017735  [32000/60000]
+    loss: 2.044564  [38400/60000]
+    loss: 1.971302  [44800/60000]
+    loss: 1.963748  [51200/60000]
+    loss: 1.920766  [57600/60000]
+    Test Error: 
+     Accuracy: 55.5%, Avg loss: 1.902382 
+    
+    Epoch 3
+    -------------------------------
+    loss: 1.919148  [    0/60000]
+    loss: 1.903148  [ 6400/60000]
+    loss: 1.782882  [12800/60000]
+    loss: 1.834309  [19200/60000]
+    loss: 1.722989  [25600/60000]
+    loss: 1.676954  [32000/60000]
+    loss: 1.698752  [38400/60000]
+    loss: 1.602475  [44800/60000]
+    loss: 1.614792  [51200/60000]
+    loss: 1.532669  [57600/60000]
+    Test Error: 
+     Accuracy: 61.7%, Avg loss: 1.533873 
+    
+    Epoch 4
+    -------------------------------
+    loss: 1.585873  [    0/60000]
+    loss: 1.560321  [ 6400/60000]
+    loss: 1.407954  [12800/60000]
+    loss: 1.488211  [19200/60000]
+    loss: 1.364034  [25600/60000]
+    loss: 1.362447  [32000/60000]
+    loss: 1.370802  [38400/60000]
+    loss: 1.302972  [44800/60000]
+    loss: 1.327800  [51200/60000]
+    loss: 1.235748  [57600/60000]
+    Test Error: 
+     Accuracy: 63.4%, Avg loss: 1.260575 
+    
+    Epoch 5
+    -------------------------------
+    loss: 1.331637  [    0/60000]
+    loss: 1.313866  [ 6400/60000]
+    loss: 1.153163  [12800/60000]
+    loss: 1.257744  [19200/60000]
+    loss: 1.137783  [25600/60000]
+    loss: 1.162715  [32000/60000]
+    loss: 1.172138  [38400/60000]
+    loss: 1.120971  [44800/60000]
+    loss: 1.149632  [51200/60000]
+    loss: 1.069323  [57600/60000]
+    Test Error: 
+     Accuracy: 64.6%, Avg loss: 1.093657 
+    
+    Done!
+
+
+Read more about [Training your model](optimization_tutorial.html).
+
+
+
+
+--------------
+
+
+
+
+## Saving Models
+A common way to save a model is to serialize the internal state dictionary (containing the model parameters).
+
+
+
+
+```python
+torch.save(model.state_dict(), "model.pth")
+print("Saved PyTorch Model State to model.pth")
+```
+
+    Saved PyTorch Model State to model.pth
+
+
+## Loading Models
+
+The process for loading a model includes re-creating the model structure and loading
+the state dictionary into it.
+
+
+
+
+```python
+model = NeuralNetwork()
+model.load_state_dict(torch.load("model.pth"))
+```
+
+
+
+
+    <All keys matched successfully>
+
+
+
+This model can now be used to make predictions.
+
+
+
+
+```python
+classes = [
+    "T-shirt/top",
+    "Trouser",
+    "Pullover",
+    "Dress",
+    "Coat",
+    "Sandal",
+    "Shirt",
+    "Sneaker",
+    "Bag",
+    "Ankle boot",
+]
+
+model.eval()
+x, y = test_data[0][0], test_data[0][1]
+with torch.no_grad():
+    pred = model(x)
+    predicted, actual = classes[pred[0].argmax(0)], classes[y]
+    print(f'Predicted: "{predicted}", Actual: "{actual}"')
+```
+
+    Predicted: "Ankle boot", Actual: "Ankle boot"
+
+
+Read more about [Saving & Loading your model](saveloadrun_tutorial.html).
+
+
+
diff --git a/docs/03-Tensors.md b/docs/03-Tensors.md
new file mode 100644
index 0000000..d87b048
--- /dev/null
+++ b/docs/03-Tensors.md
@@ -0,0 +1,355 @@
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+**Tensors** ||
+[Datasets & DataLoaders](data_tutorial.html) ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Tensors
+
+Tensors are a specialized data structure that are very similar to arrays and matrices.
+In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.
+
+Tensors are similar to [NumPy’s](https://numpy.org/) ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and
+NumPy arrays can often share the same underlying memory, eliminating the need to copy data (see `bridge-to-np-label`). Tensors
+are also optimized for automatic differentiation (we'll see more about that later in the [Autograd](autogradqs_tutorial.html)_
+section). If you’re familiar with ndarrays, you’ll be right at home with the Tensor API. If not, follow along!
+
+
+
+```python
+import torch
+import numpy as np
+```
+
+## Initializing a Tensor
+
+Tensors can be initialized in various ways. Take a look at the following examples:
+
+**Directly from data**
+
+Tensors can be created directly from data. The data type is automatically inferred.
+
+
+
+
+```python
+data = [[1, 2],[3, 4]]
+x_data = torch.tensor(data)
+```
+
+**From a NumPy array**
+
+Tensors can be created from NumPy arrays (and vice versa - see `bridge-to-np-label`).
+
+
+
+
+```python
+np_array = np.array(data)
+x_np = torch.from_numpy(np_array)
+```
+
+**From another tensor:**
+
+The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.
+
+
+
+
+```python
+x_ones = torch.ones_like(x_data) # retains the properties of x_data
+print(f"Ones Tensor: \n {x_ones} \n")
+
+x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
+print(f"Random Tensor: \n {x_rand} \n")
+```
+
+    Ones Tensor: 
+     tensor([[1, 1],
+            [1, 1]]) 
+    
+    Random Tensor: 
+     tensor([[0.0504, 0.9505],
+            [0.6485, 0.6105]]) 
+    
+
+
+**With random or constant values:**
+
+``shape`` is a tuple of tensor dimensions. In the functions below, it determines the dimensionality of the output tensor.
+
+
+
+
+```python
+shape = (2,3,)
+rand_tensor = torch.rand(shape)
+ones_tensor = torch.ones(shape)
+zeros_tensor = torch.zeros(shape)
+
+print(f"Random Tensor: \n {rand_tensor} \n")
+print(f"Ones Tensor: \n {ones_tensor} \n")
+print(f"Zeros Tensor: \n {zeros_tensor}")
+```
+
+    Random Tensor: 
+     tensor([[0.6582, 0.2838, 0.1244],
+            [0.1692, 0.0394, 0.2638]]) 
+    
+    Ones Tensor: 
+     tensor([[1., 1., 1.],
+            [1., 1., 1.]]) 
+    
+    Zeros Tensor: 
+     tensor([[0., 0., 0.],
+            [0., 0., 0.]])
+
+
+--------------
+
+
+
+
+## Attributes of a Tensor
+
+Tensor attributes describe their shape, datatype, and the device on which they are stored.
+
+
+
+
+```python
+tensor = torch.rand(3,4)
+
+print(f"Shape of tensor: {tensor.shape}")
+print(f"Datatype of tensor: {tensor.dtype}")
+print(f"Device tensor is stored on: {tensor.device}")
+```
+
+    Shape of tensor: torch.Size([3, 4])
+    Datatype of tensor: torch.float32
+    Device tensor is stored on: cpu
+
+
+--------------
+
+
+
+
+## Operations on Tensors
+
+Over 100 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing,
+indexing, slicing), sampling and more are
+comprehensively described [here](https://pytorch.org/docs/stable/torch.html)_.
+
+Each of these operations can be run on the GPU (at typically higher speeds than on a
+CPU). If you’re using Colab, allocate a GPU by going to Runtime > Change runtime type > GPU.
+
+By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using
+``.to`` method (after checking for GPU availability). Keep in mind that copying large tensors
+across devices can be expensive in terms of time and memory!
+
+
+
+
+```python
+# We move our tensor to the GPU if available
+if torch.cuda.is_available():
+    tensor = tensor.to("cuda")
+```
+
+Try out some of the operations from the list.
+If you're familiar with the NumPy API, you'll find the Tensor API a breeze to use.
+
+
+
+
+**Standard numpy-like indexing and slicing:**
+
+
+
+
+```python
+tensor = torch.ones(4, 4)
+print(f"First row: {tensor[0]}")
+print(f"First column: {tensor[:, 0]}")
+print(f"Last column: {tensor[..., -1]}")
+tensor[:,1] = 0
+print(tensor)
+```
+
+    First row: tensor([1., 1., 1., 1.])
+    First column: tensor([1., 1., 1., 1.])
+    Last column: tensor([1., 1., 1., 1.])
+    tensor([[1., 0., 1., 1.],
+            [1., 0., 1., 1.],
+            [1., 0., 1., 1.],
+            [1., 0., 1., 1.]])
+
+
+**Joining tensors** You can use ``torch.cat`` to concatenate a sequence of tensors along a given dimension.
+See also [torch.stack](https://pytorch.org/docs/stable/generated/torch.stack.html)_,
+another tensor joining op that is subtly different from ``torch.cat``.
+
+
+
+
+```python
+t1 = torch.cat([tensor, tensor, tensor], dim=1)
+print(t1)
+```
+
+    tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
+            [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
+            [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
+            [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.]])
+
+
+**Arithmetic operations**
+
+
+
+
+```python
+# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
+# ``tensor.T`` returns the transpose of a tensor
+y1 = tensor @ tensor.T
+y2 = tensor.matmul(tensor.T)
+
+y3 = torch.rand_like(y1)
+torch.matmul(tensor, tensor.T, out=y3)
+
+
+# This computes the element-wise product. z1, z2, z3 will have the same value
+z1 = tensor * tensor
+z2 = tensor.mul(tensor)
+
+z3 = torch.rand_like(tensor)
+torch.mul(tensor, tensor, out=z3)
+```
+
+
+
+
+    tensor([[1., 0., 1., 1.],
+            [1., 0., 1., 1.],
+            [1., 0., 1., 1.],
+            [1., 0., 1., 1.]])
+
+
+
+**Single-element tensors** If you have a one-element tensor, for example by aggregating all
+values of a tensor into one value, you can convert it to a Python
+numerical value using ``item()``:
+
+
+
+
+```python
+agg = tensor.sum()
+agg_item = agg.item()
+print(agg_item, type(agg_item))
+```
+
+    12.0 <class 'float'>
+
+
+**In-place operations**
+Operations that store the result into the operand are called in-place. They are denoted by a ``_`` suffix.
+For example: ``x.copy_(y)``, ``x.t_()``, will change ``x``.
+
+
+
+
+```python
+print(f"{tensor} \n")
+tensor.add_(5)
+print(tensor)
+```
+
+    tensor([[1., 0., 1., 1.],
+            [1., 0., 1., 1.],
+            [1., 0., 1., 1.],
+            [1., 0., 1., 1.]]) 
+    
+    tensor([[6., 5., 6., 6.],
+            [6., 5., 6., 6.],
+            [6., 5., 6., 6.],
+            [6., 5., 6., 6.]])
+
+
+<div class="alert alert-info"><h4>Note</h4><p>In-place operations save some memory, but can be problematic when computing derivatives because of an immediate loss
+     of history. Hence, their use is discouraged.</p></div>
+
+
+
+--------------
+
+
+
+
+
+## Bridge with NumPy
+Tensors on the CPU and NumPy arrays can share their underlying memory
+locations, and changing one will change	the other.
+
+
+
+### Tensor to NumPy array
+
+
+
+
+```python
+t = torch.ones(5)
+print(f"t: {t}")
+n = t.numpy()
+print(f"n: {n}")
+```
+
+    t: tensor([1., 1., 1., 1., 1.])
+    n: [1. 1. 1. 1. 1.]
+
+
+A change in the tensor reflects in the NumPy array.
+
+
+
+
+```python
+t.add_(1)
+print(f"t: {t}")
+print(f"n: {n}")
+```
+
+    t: tensor([2., 2., 2., 2., 2.])
+    n: [2. 2. 2. 2. 2.]
+
+
+### NumPy array to Tensor
+
+
+
+
+```python
+n = np.ones(5)
+t = torch.from_numpy(n)
+```
+
+Changes in the NumPy array reflects in the tensor.
+
+
+
+
+```python
+np.add(n, 1, out=n)
+print(f"t: {t}")
+print(f"n: {n}")
+```
+
+    t: tensor([2., 2., 2., 2., 2.], dtype=torch.float64)
+    n: [2. 2. 2. 2. 2.]
+
diff --git a/docs/04-Data.md b/docs/04-Data.md
new file mode 100644
index 0000000..8cd12e9
--- /dev/null
+++ b/docs/04-Data.md
@@ -0,0 +1,286 @@
+```python
+%matplotlib inline
+```
+
+
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+[Tensors](tensorqs_tutorial.html) ||
+**Datasets & DataLoaders** ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Datasets & DataLoaders
+
+
+Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code
+to be decoupled from our model training code for better readability and modularity.
+PyTorch provides two data primitives: ``torch.utils.data.DataLoader`` and ``torch.utils.data.Dataset``
+that allow you to use pre-loaded datasets as well as your own data.
+``Dataset`` stores the samples and their corresponding labels, and ``DataLoader`` wraps an iterable around
+the ``Dataset`` to enable easy access to the samples.
+
+PyTorch domain libraries provide a number of pre-loaded datasets (such as FashionMNIST) that
+subclass ``torch.utils.data.Dataset`` and implement functions specific to the particular data.
+They can be used to prototype and benchmark your model. You can find them
+here: [Image Datasets](https://pytorch.org/vision/stable/datasets.html),
+[Text Datasets](https://pytorch.org/text/stable/datasets.html), and
+[Audio Datasets](https://pytorch.org/audio/stable/datasets.html)
+
+
+
+
+## Loading a Dataset
+
+Here is an example of how to load the [Fashion-MNIST](https://research.zalando.com/project/fashion_mnist/fashion_mnist/) dataset from TorchVision.
+Fashion-MNIST is a dataset of Zalando’s article images consisting of 60,000 training examples and 10,000 test examples.
+Each example comprises a 28×28 grayscale image and an associated label from one of 10 classes.
+
+We load the [FashionMNIST Dataset](https://pytorch.org/vision/stable/datasets.html#fashion-mnist) with the following parameters:
+ - ``root`` is the path where the train/test data is stored,
+ - ``train`` specifies training or test dataset,
+ - ``download=True`` downloads the data from the internet if it's not available at ``root``.
+ - ``transform`` and ``target_transform`` specify the feature and label transformations
+
+
+
+
+```python
+import torch
+from torch.utils.data import Dataset
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+import matplotlib.pyplot as plt
+
+
+training_data = datasets.FashionMNIST(
+    root="data",
+    train=True,
+    download=True,
+    transform=ToTensor()
+)
+
+test_data = datasets.FashionMNIST(
+    root="data",
+    train=False,
+    download=True,
+    transform=ToTensor()
+)
+```
+
+## Iterating and Visualizing the Dataset
+
+We can index ``Datasets`` manually like a list: ``training_data[index]``.
+We use ``matplotlib`` to visualize some samples in our training data.
+
+
+
+
+```python
+labels_map = {
+    0: "T-Shirt",
+    1: "Trouser",
+    2: "Pullover",
+    3: "Dress",
+    4: "Coat",
+    5: "Sandal",
+    6: "Shirt",
+    7: "Sneaker",
+    8: "Bag",
+    9: "Ankle Boot",
+}
+figure = plt.figure(figsize=(8, 8))
+cols, rows = 3, 3
+for i in range(1, cols * rows + 1):
+    sample_idx = torch.randint(len(training_data), size=(1,)).item()
+    img, label = training_data[sample_idx]
+    figure.add_subplot(rows, cols, i)
+    plt.title(labels_map[label])
+    plt.axis("off")
+    plt.imshow(img.squeeze(), cmap="gray")
+plt.show()
+```
+
+
+    
+![png](../docs/04-Data_files/../docs/04-Data_6_0.png)
+    
+
+
+..
+ .. figure:: /_static/img/basics/fashion_mnist.png
+   :alt: fashion_mnist
+
+
+
+--------------
+
+
+
+
+## Creating a Custom Dataset for your files
+
+A custom Dataset class must implement three functions: `__init__`, `__len__`, and `__getitem__`.
+Take a look at this implementation; the FashionMNIST images are stored
+in a directory ``img_dir``, and their labels are stored separately in a CSV file ``annotations_file``.
+
+In the next sections, we'll break down what's happening in each of these functions.
+
+
+
+
+```python
+import os
+import pandas as pd
+from torchvision.io import read_image
+
+class CustomImageDataset(Dataset):
+    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
+        self.img_labels = pd.read_csv(annotations_file)
+        self.img_dir = img_dir
+        self.transform = transform
+        self.target_transform = target_transform
+
+    def __len__(self):
+        return len(self.img_labels)
+
+    def __getitem__(self, idx):
+        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
+        image = read_image(img_path)
+        label = self.img_labels.iloc[idx, 1]
+        if self.transform:
+            image = self.transform(image)
+        if self.target_transform:
+            label = self.target_transform(label)
+        return image, label
+```
+
+### __init__
+
+The __init__ function is run once when instantiating the Dataset object. We initialize
+the directory containing the images, the annotations file, and both transforms (covered
+in more detail in the next section).
+
+The labels.csv file looks like: ::
+
+    tshirt1.jpg, 0
+    tshirt2.jpg, 0
+    ......
+    ankleboot999.jpg, 9
+
+
+
+
+```python
+def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
+    self.img_labels = pd.read_csv(annotations_file)
+    self.img_dir = img_dir
+    self.transform = transform
+    self.target_transform = target_transform
+```
+
+### __len__
+
+The __len__ function returns the number of samples in our dataset.
+
+Example:
+
+
+
+
+```python
+def __len__(self):
+    return len(self.img_labels)
+```
+
+### __getitem__
+
+The __getitem__ function loads and returns a sample from the dataset at the given index ``idx``.
+Based on the index, it identifies the image's location on disk, converts that to a tensor using ``read_image``, retrieves the
+corresponding label from the csv data in ``self.img_labels``, calls the transform functions on them (if applicable), and returns the
+tensor image and corresponding label in a tuple.
+
+
+
+
+```python
+def __getitem__(self, idx):
+    img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
+    image = read_image(img_path)
+    label = self.img_labels.iloc[idx, 1]
+    if self.transform:
+        image = self.transform(image)
+    if self.target_transform:
+        label = self.target_transform(label)
+    return image, label
+```
+
+--------------
+
+
+
+
+## Preparing your data for training with DataLoaders
+The ``Dataset`` retrieves our dataset's features and labels one sample at a time. While training a model, we typically want to
+pass samples in "minibatches", reshuffle the data at every epoch to reduce model overfitting, and use Python's ``multiprocessing`` to
+speed up data retrieval.
+
+``DataLoader`` is an iterable that abstracts this complexity for us in an easy API.
+
+
+
+
+```python
+from torch.utils.data import DataLoader
+
+train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
+test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)
+```
+
+## Iterate through the DataLoader
+
+We have loaded that dataset into the ``DataLoader`` and can iterate through the dataset as needed.
+Each iteration below returns a batch of ``train_features`` and ``train_labels`` (containing ``batch_size=64`` features and labels respectively).
+Because we specified ``shuffle=True``, after we iterate over all batches the data is shuffled (for finer-grained control over
+the data loading order, take a look at [Samplers](https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler)).
+
+
+
+
+```python
+# Display image and label.
+train_features, train_labels = next(iter(train_dataloader))
+print(f"Feature batch shape: {train_features.size()}")
+print(f"Labels batch shape: {train_labels.size()}")
+img = train_features[0].squeeze()
+label = train_labels[0]
+plt.imshow(img, cmap="gray")
+plt.show()
+print(f"Label: {label}")
+```
+
+    Feature batch shape: torch.Size([64, 1, 28, 28])
+    Labels batch shape: torch.Size([64])
+
+
+
+    
+![png](../docs/04-Data_files/../docs/04-Data_21_1.png)
+    
+
+
+    Label: 1
+
+
+--------------
+
+
+
+
+## Further Reading
+- [torch.utils.data API](https://pytorch.org/docs/stable/data.html)
+
+
diff --git a/docs/05-Transforms.md b/docs/05-Transforms.md
new file mode 100644
index 0000000..20d50be
--- /dev/null
+++ b/docs/05-Transforms.md
@@ -0,0 +1,77 @@
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+[Tensors](tensorqs_tutorial.html) ||
+[Datasets & DataLoaders](data_tutorial.html) ||
+**Transforms** ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Transforms
+
+Data does not always come in its final processed form that is required for
+training machine learning algorithms. We use **transforms** to perform some
+manipulation of the data and make it suitable for training.
+
+All TorchVision datasets have two parameters -``transform`` to modify the features and
+``target_transform`` to modify the labels - that accept callables containing the transformation logic.
+The [torchvision.transforms](https://pytorch.org/vision/stable/transforms.html) module offers
+several commonly-used transforms out of the box.
+
+The FashionMNIST features are in PIL Image format, and the labels are integers.
+For training, we need the features as normalized tensors, and the labels as one-hot encoded tensors.
+To make these transformations, we use ``ToTensor`` and ``Lambda``.
+
+
+
+```python
+%matplotlib inline
+
+import torch
+from torchvision import datasets
+from torchvision.transforms import ToTensor, Lambda
+
+ds = datasets.FashionMNIST(
+    root="data",
+    train=True,
+    download=True,
+    transform=ToTensor(),
+    target_transform=Lambda(lambda y: torch.zeros(10, dtype=torch.float).scatter_(0, torch.tensor(y), value=1))
+)
+```
+
+## ToTensor()
+
+[ToTensor](https://pytorch.org/vision/stable/transforms.html#torchvision.transforms.ToTensor)
+converts a PIL image or NumPy ``ndarray`` into a ``FloatTensor``. and scales
+the image's pixel intensity values in the range [0., 1.]
+
+
+
+
+## Lambda Transforms
+
+Lambda transforms apply any user-defined lambda function. Here, we define a function
+to turn the integer into a one-hot encoded tensor.
+It first creates a zero tensor of size 10 (the number of labels in our dataset) and calls
+[scatter_](https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_.html) which assigns a
+``value=1`` on the index as given by the label ``y``.
+
+
+
+
+```python
+target_transform = Lambda(lambda y: torch.zeros(
+    10, dtype=torch.float).scatter_(dim=0, index=torch.tensor(y), value=1))
+```
+
+--------------
+
+
+
+
+### Further Reading
+- [torchvision.transforms API](https://pytorch.org/vision/stable/transforms.html)
+
+
diff --git a/docs/06-BuildModel.md b/docs/06-BuildModel.md
new file mode 100644
index 0000000..92a7fca
--- /dev/null
+++ b/docs/06-BuildModel.md
@@ -0,0 +1,312 @@
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+[Tensors](tensorqs_tutorial.html) ||
+[Datasets & DataLoaders](data_tutorial.html) ||
+[Transforms](transforms_tutorial.html) ||
+**Build Model** ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Build the Neural Network
+
+Neural networks comprise of layers/modules that perform operations on data.
+The [torch.nn](https://pytorch.org/docs/stable/nn.html) namespace provides all the building blocks you need to
+build your own neural network. Every module in PyTorch subclasses the [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html).
+A neural network is a module itself that consists of other modules (layers). This nested structure allows for
+building and managing complex architectures easily.
+
+In the following sections, we'll build a neural network to classify images in the FashionMNIST dataset.
+
+
+
+```python
+%matplotlib inline
+
+import os
+import torch
+from torch import nn
+from torch.utils.data import DataLoader
+from torchvision import datasets, transforms
+```
+
+## Get Device for Training
+We want to be able to train our model on a hardware accelerator like the GPU,
+if it is available. Let's check to see if
+[torch.cuda](https://pytorch.org/docs/stable/notes/cuda.html) is available, else we
+continue to use the CPU.
+
+
+
+
+```python
+device = "cuda" if torch.cuda.is_available() else "cpu"
+print(f"Using {device} device")
+```
+
+    Using cpu device
+
+
+## Define the Class
+We define our neural network by subclassing ``nn.Module``, and
+initialize the neural network layers in ``__init__``. Every ``nn.Module`` subclass implements
+the operations on input data in the ``forward`` method.
+
+
+
+
+```python
+class NeuralNetwork(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.flatten = nn.Flatten()
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(28*28, 512),
+            nn.ReLU(),
+            nn.Linear(512, 512),
+            nn.ReLU(),
+            nn.Linear(512, 10),
+        )
+
+    def forward(self, x):
+        x = self.flatten(x)
+        logits = self.linear_relu_stack(x)
+        return logits
+```
+
+We create an instance of ``NeuralNetwork``, and move it to the ``device``, and print
+its structure.
+
+
+
+
+```python
+model = NeuralNetwork().to(device)
+print(model)
+```
+
+    NeuralNetwork(
+      (flatten): Flatten(start_dim=1, end_dim=-1)
+      (linear_relu_stack): Sequential(
+        (0): Linear(in_features=784, out_features=512, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=512, out_features=512, bias=True)
+        (3): ReLU()
+        (4): Linear(in_features=512, out_features=10, bias=True)
+      )
+    )
+
+
+To use the model, we pass it the input data. This executes the model's ``forward``,
+along with some [background operations](https://github.com/pytorch/pytorch/blob/270111b7b611d174967ed204776985cefca9c144/torch/nn/modules/module.py#L866).
+Do not call ``model.forward()`` directly!
+
+Calling the model on the input returns a 2-dimensional tensor with dim=0 corresponding to each output of 10 raw predicted values for each class, and dim=1 corresponding to the individual values of each output.
+We get the prediction probabilities by passing it through an instance of the ``nn.Softmax`` module.
+
+
+
+
+```python
+X = torch.rand(1, 28, 28, device=device)
+logits = model(X)
+pred_probab = nn.Softmax(dim=1)(logits)
+y_pred = pred_probab.argmax(1)
+print(f"Predicted class: {y_pred}")
+```
+
+    Predicted class: tensor([9])
+
+
+--------------
+
+
+
+
+## Model Layers
+
+Let's break down the layers in the FashionMNIST model. To illustrate it, we
+will take a sample minibatch of 3 images of size 28x28 and see what happens to it as
+we pass it through the network.
+
+
+
+
+```python
+input_image = torch.rand(3,28,28)
+print(input_image.size())
+```
+
+    torch.Size([3, 28, 28])
+
+
+### nn.Flatten
+We initialize the [nn.Flatten](https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html)
+layer to convert each 2D 28x28 image into a contiguous array of 784 pixel values (
+the minibatch dimension (at dim=0) is maintained).
+
+
+
+
+```python
+flatten = nn.Flatten()
+flat_image = flatten(input_image)
+print(flat_image.size())
+```
+
+    torch.Size([3, 784])
+
+
+### nn.Linear
+The [linear layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
+is a module that applies a linear transformation on the input using its stored weights and biases.
+
+
+
+
+
+```python
+layer1 = nn.Linear(in_features=28*28, out_features=20)
+hidden1 = layer1(flat_image)
+print(hidden1.size())
+```
+
+    torch.Size([3, 20])
+
+
+### nn.ReLU
+Non-linear activations are what create the complex mappings between the model's inputs and outputs.
+They are applied after linear transformations to introduce *nonlinearity*, helping neural networks
+learn a wide variety of phenomena.
+
+In this model, we use [nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) between our
+linear layers, but there's other activations to introduce non-linearity in your model.
+
+
+
+
+```python
+print(f"Before ReLU: {hidden1}\n\n")
+hidden1 = nn.ReLU()(hidden1)
+print(f"After ReLU: {hidden1}")
+```
+
+    Before ReLU: tensor([[-5.5712e-01,  4.1135e-01, -7.4510e-03, -5.4891e-02,  7.3538e-02,
+              4.6617e-01,  5.3287e-01,  7.2283e-02, -3.7471e-01, -3.9285e-01,
+             -6.7889e-01,  2.1088e-01,  1.8742e-01,  4.0150e-01, -5.6422e-02,
+             -4.8977e-02, -1.6230e-01,  3.0556e-01, -7.1455e-01, -6.6180e-02],
+            [-4.2601e-01,  6.2487e-01, -5.9415e-02,  2.3934e-02,  3.9810e-01,
+              3.2441e-01,  7.0026e-01, -1.2423e-01, -5.2260e-01, -1.7234e-01,
+             -5.5835e-01,  2.2128e-01,  2.7830e-01,  2.4191e-01, -7.7681e-02,
+             -2.4954e-01,  1.5836e-01,  1.9990e-01, -1.1715e-01, -3.2138e-01],
+            [-4.9225e-01,  4.1050e-01, -1.5492e-01,  8.9106e-03,  3.5985e-01,
+              3.1355e-01,  6.2615e-01, -1.9053e-04, -5.7080e-01, -1.7064e-01,
+             -6.5802e-01,  3.3700e-01,  4.5726e-01,  3.1022e-01, -4.0316e-01,
+             -3.8029e-01, -1.2243e-01,  3.6732e-01, -5.6789e-01, -9.4490e-02]],
+           grad_fn=<AddmmBackward0>)
+    
+    
+    After ReLU: tensor([[0.0000, 0.4113, 0.0000, 0.0000, 0.0735, 0.4662, 0.5329, 0.0723, 0.0000,
+             0.0000, 0.0000, 0.2109, 0.1874, 0.4015, 0.0000, 0.0000, 0.0000, 0.3056,
+             0.0000, 0.0000],
+            [0.0000, 0.6249, 0.0000, 0.0239, 0.3981, 0.3244, 0.7003, 0.0000, 0.0000,
+             0.0000, 0.0000, 0.2213, 0.2783, 0.2419, 0.0000, 0.0000, 0.1584, 0.1999,
+             0.0000, 0.0000],
+            [0.0000, 0.4105, 0.0000, 0.0089, 0.3599, 0.3136, 0.6262, 0.0000, 0.0000,
+             0.0000, 0.0000, 0.3370, 0.4573, 0.3102, 0.0000, 0.0000, 0.0000, 0.3673,
+             0.0000, 0.0000]], grad_fn=<ReluBackward0>)
+
+
+### nn.Sequential
+[nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) is an ordered
+container of modules. The data is passed through all the modules in the same order as defined. You can use
+sequential containers to put together a quick network like ``seq_modules``.
+
+
+
+
+```python
+seq_modules = nn.Sequential(
+    flatten,
+    layer1,
+    nn.ReLU(),
+    nn.Linear(20, 10)
+)
+input_image = torch.rand(3,28,28)
+logits = seq_modules(input_image)
+```
+
+### nn.Softmax
+The last linear layer of the neural network returns `logits` - raw values in [-\infty, \infty] - which are passed to the
+[nn.Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) module. The logits are scaled to values
+[0, 1] representing the model's predicted probabilities for each class. ``dim`` parameter indicates the dimension along
+which the values must sum to 1.
+
+
+
+
+```python
+softmax = nn.Softmax(dim=1)
+pred_probab = softmax(logits)
+```
+
+## Model Parameters
+Many layers inside a neural network are *parameterized*, i.e. have associated weights
+and biases that are optimized during training. Subclassing ``nn.Module`` automatically
+tracks all fields defined inside your model object, and makes all parameters
+accessible using your model's ``parameters()`` or ``named_parameters()`` methods.
+
+In this example, we iterate over each parameter, and print its size and a preview of its values.
+
+
+
+
+
+```python
+print(f"Model structure: {model}\n\n")
+
+for name, param in model.named_parameters():
+    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")
+```
+
+    Model structure: NeuralNetwork(
+      (flatten): Flatten(start_dim=1, end_dim=-1)
+      (linear_relu_stack): Sequential(
+        (0): Linear(in_features=784, out_features=512, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=512, out_features=512, bias=True)
+        (3): ReLU()
+        (4): Linear(in_features=512, out_features=10, bias=True)
+      )
+    )
+    
+    
+    Layer: linear_relu_stack.0.weight | Size: torch.Size([512, 784]) | Values : tensor([[ 0.0211,  0.0168,  0.0334,  ..., -0.0151, -0.0033,  0.0032],
+            [-0.0022,  0.0293, -0.0090,  ..., -0.0044, -0.0147, -0.0251]],
+           grad_fn=<SliceBackward0>) 
+    
+    Layer: linear_relu_stack.0.bias | Size: torch.Size([512]) | Values : tensor([0.0128, 0.0086], grad_fn=<SliceBackward0>) 
+    
+    Layer: linear_relu_stack.2.weight | Size: torch.Size([512, 512]) | Values : tensor([[-0.0165, -0.0068, -0.0016,  ..., -0.0098,  0.0119,  0.0326],
+            [ 0.0330, -0.0306, -0.0129,  ..., -0.0371, -0.0291, -0.0273]],
+           grad_fn=<SliceBackward0>) 
+    
+    Layer: linear_relu_stack.2.bias | Size: torch.Size([512]) | Values : tensor([ 0.0024, -0.0164], grad_fn=<SliceBackward0>) 
+    
+    Layer: linear_relu_stack.4.weight | Size: torch.Size([10, 512]) | Values : tensor([[ 0.0046,  0.0249,  0.0123,  ...,  0.0352, -0.0170,  0.0232],
+            [ 0.0038,  0.0283,  0.0235,  ..., -0.0416,  0.0304,  0.0217]],
+           grad_fn=<SliceBackward0>) 
+    
+    Layer: linear_relu_stack.4.bias | Size: torch.Size([10]) | Values : tensor([0.0118, 0.0417], grad_fn=<SliceBackward0>) 
+    
+
+
+--------------
+
+
+
+
+## Further Reading
+- [torch.nn API](https://pytorch.org/docs/stable/nn.html)
+
+
diff --git a/docs/07-Autograd.md b/docs/07-Autograd.md
new file mode 100644
index 0000000..f8e1eee
--- /dev/null
+++ b/docs/07-Autograd.md
@@ -0,0 +1,294 @@
+```python
+%matplotlib inline
+```
+
+
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+[Tensors](tensorqs_tutorial.html) ||
+[Datasets & DataLoaders](data_tutorial.html) ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+**Autograd** ||
+[Optimization](optimization_tutorial.html) ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Automatic Differentiation with ``torch.autograd``
+
+When training neural networks, the most frequently used algorithm is
+**back propagation**. In this algorithm, parameters (model weights) are
+adjusted according to the **gradient** of the loss function with respect
+to the given parameter.
+
+To compute those gradients, PyTorch has a built-in differentiation engine
+called ``torch.autograd``. It supports automatic computation of gradient for any
+computational graph.
+
+Consider the simplest one-layer neural network, with input ``x``,
+parameters ``w`` and ``b``, and some loss function. It can be defined in
+PyTorch in the following manner:
+
+
+
+```python
+import torch
+
+x = torch.ones(5)  # input tensor
+y = torch.zeros(3)  # expected output
+w = torch.randn(5, 3, requires_grad=True)
+b = torch.randn(3, requires_grad=True)
+z = torch.matmul(x, w)+b
+loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
+```
+
+## Tensors, Functions and Computational graph
+
+This code defines the following **computational graph**:
+
+.. figure:: /_static/img/basics/comp-graph.png
+   :alt:
+
+In this network, ``w`` and ``b`` are **parameters**, which we need to
+optimize. Thus, we need to be able to compute the gradients of loss
+function with respect to those variables. In order to do that, we set
+the ``requires_grad`` property of those tensors.
+
+
+
+<div class="alert alert-info"><h4>Note</h4><p>You can set the value of ``requires_grad`` when creating a
+          tensor, or later by using ``x.requires_grad_(True)`` method.</p></div>
+
+
+
+A function that we apply to tensors to construct computational graph is
+in fact an object of class ``Function``. This object knows how to
+compute the function in the *forward* direction, and also how to compute
+its derivative during the *backward propagation* step. A reference to
+the backward propagation function is stored in ``grad_fn`` property of a
+tensor. You can find more information of ``Function`` [in the
+documentation](https://pytorch.org/docs/stable/autograd.html#function)_.
+
+
+
+
+
+```python
+print(f"Gradient function for z = {z.grad_fn}")
+print(f"Gradient function for loss = {loss.grad_fn}")
+```
+
+    Gradient function for z = <AddBackward0 object at 0x10fa1ee80>
+    Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x10fa1e430>
+
+
+## Computing Gradients
+
+To optimize weights of parameters in the neural network, we need to
+compute the derivatives of our loss function with respect to parameters,
+namely, we need $\frac{\partial loss}{\partial w}$ and
+$\frac{\partial loss}{\partial b}$ under some fixed values of
+``x`` and ``y``. To compute those derivatives, we call
+``loss.backward()``, and then retrieve the values from ``w.grad`` and
+``b.grad``:
+
+
+
+
+
+```python
+loss.backward()
+print(w.grad)
+print(b.grad)
+```
+
+    tensor([[0.3244, 0.2353, 0.0700],
+            [0.3244, 0.2353, 0.0700],
+            [0.3244, 0.2353, 0.0700],
+            [0.3244, 0.2353, 0.0700],
+            [0.3244, 0.2353, 0.0700]])
+    tensor([0.3244, 0.2353, 0.0700])
+
+
+<div class="alert alert-info"><h4>Note</h4><p>- We can only obtain the ``grad`` properties for the leaf
+    nodes of the computational graph, which have ``requires_grad`` property
+    set to ``True``. For all other nodes in our graph, gradients will not be
+    available.
+  - We can only perform gradient calculations using
+    ``backward`` once on a given graph, for performance reasons. If we need
+    to do several ``backward`` calls on the same graph, we need to pass
+    ``retain_graph=True`` to the ``backward`` call.</p></div>
+
+
+
+
+## Disabling Gradient Tracking
+
+By default, all tensors with ``requires_grad=True`` are tracking their
+computational history and support gradient computation. However, there
+are some cases when we do not need to do that, for example, when we have
+trained the model and just want to apply it to some input data, i.e. we
+only want to do *forward* computations through the network. We can stop
+tracking computations by surrounding our computation code with
+``torch.no_grad()`` block:
+
+
+
+
+
+```python
+z = torch.matmul(x, w)+b
+print(z.requires_grad)
+
+with torch.no_grad():
+    z = torch.matmul(x, w)+b
+print(z.requires_grad)
+```
+
+    True
+    False
+
+
+Another way to achieve the same result is to use the ``detach()`` method
+on the tensor:
+
+
+
+
+
+```python
+z = torch.matmul(x, w)+b
+z_det = z.detach()
+print(z_det.requires_grad)
+```
+
+    False
+
+
+There are reasons you might want to disable gradient tracking:
+  - To mark some parameters in your neural network as **frozen parameters**.
+  - To **speed up computations** when you are only doing forward pass, because computations on tensors that do
+    not track gradients would be more efficient.
+
+
+
+## More on Computational Graphs
+Conceptually, autograd keeps a record of data (tensors) and all executed
+operations (along with the resulting new tensors) in a directed acyclic
+graph (DAG) consisting of
+[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)_
+objects. In this DAG, leaves are the input tensors, roots are the output
+tensors. By tracing this graph from roots to leaves, you can
+automatically compute the gradients using the chain rule.
+
+In a forward pass, autograd does two things simultaneously:
+
+- run the requested operation to compute a resulting tensor
+- maintain the operation’s *gradient function* in the DAG.
+
+The backward pass kicks off when ``.backward()`` is called on the DAG
+root. ``autograd`` then:
+
+- computes the gradients from each ``.grad_fn``,
+- accumulates them in the respective tensor’s ``.grad`` attribute
+- using the chain rule, propagates all the way to the leaf tensors.
+
+<div class="alert alert-info"><h4>Note</h4><p>**DAGs are dynamic in PyTorch**
+  An important thing to note is that the graph is recreated from scratch; after each
+  ``.backward()`` call, autograd starts populating a new graph. This is
+  exactly what allows you to use control flow statements in your model;
+  you can change the shape, size and operations at every iteration if
+  needed.</p></div>
+
+
+
+## Optional Reading: Tensor Gradients and Jacobian Products
+
+In many cases, we have a scalar loss function, and we need to compute
+the gradient with respect to some parameters. However, there are cases
+when the output function is an arbitrary tensor. In this case, PyTorch
+allows you to compute so-called **Jacobian product**, and not the actual
+gradient.
+
+For a vector function $\vec{y}=f(\vec{x})$, where
+$\vec{x}=\langle x_1,\dots,x_n\rangle$ and
+$\vec{y}=\langle y_1,\dots,y_m\rangle$, a gradient of
+$\vec{y}$ with respect to $\vec{x}$ is given by **Jacobian
+matrix**:
+
+\begin{align}J=\left(\begin{array}{ccc}
+      \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
+      \vdots & \ddots & \vdots\\
+      \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
+      \end{array}\right)\end{align}
+
+Instead of computing the Jacobian matrix itself, PyTorch allows you to
+compute **Jacobian Product** $v^T\cdot J$ for a given input vector
+$v=(v_1 \dots v_m)$. This is achieved by calling ``backward`` with
+$v$ as an argument. The size of $v$ should be the same as
+the size of the original tensor, with respect to which we want to
+compute the product:
+
+
+
+
+
+```python
+inp = torch.eye(4, 5, requires_grad=True)
+out = (inp+1).pow(2).t()
+out.backward(torch.ones_like(out), retain_graph=True)
+print(f"First call\n{inp.grad}")
+out.backward(torch.ones_like(out), retain_graph=True)
+print(f"\nSecond call\n{inp.grad}")
+inp.grad.zero_()
+out.backward(torch.ones_like(out), retain_graph=True)
+print(f"\nCall after zeroing gradients\n{inp.grad}")
+```
+
+    First call
+    tensor([[4., 2., 2., 2., 2.],
+            [2., 4., 2., 2., 2.],
+            [2., 2., 4., 2., 2.],
+            [2., 2., 2., 4., 2.]])
+    
+    Second call
+    tensor([[8., 4., 4., 4., 4.],
+            [4., 8., 4., 4., 4.],
+            [4., 4., 8., 4., 4.],
+            [4., 4., 4., 8., 4.]])
+    
+    Call after zeroing gradients
+    tensor([[4., 2., 2., 2., 2.],
+            [2., 4., 2., 2., 2.],
+            [2., 2., 4., 2., 2.],
+            [2., 2., 2., 4., 2.]])
+
+
+Notice that when we call ``backward`` for the second time with the same
+argument, the value of the gradient is different. This happens because
+when doing ``backward`` propagation, PyTorch **accumulates the
+gradients**, i.e. the value of computed gradients is added to the
+``grad`` property of all leaf nodes of computational graph. If you want
+to compute the proper gradients, you need to zero out the ``grad``
+property before. In real-life training an *optimizer* helps us to do
+this.
+
+
+
+<div class="alert alert-info"><h4>Note</h4><p>Previously we were calling ``backward()`` function without
+          parameters. This is essentially equivalent to calling
+          ``backward(torch.tensor(1.0))``, which is a useful way to compute the
+          gradients in case of a scalar-valued function, such as loss during
+          neural network training.</p></div>
+
+
+
+
+--------------
+
+
+
+
+### Further Reading
+- [Autograd Mechanics](https://pytorch.org/docs/stable/notes/autograd.html)
+
+
diff --git a/docs/08-Optimization.md b/docs/08-Optimization.md
new file mode 100644
index 0000000..aeb9aa0
--- /dev/null
+++ b/docs/08-Optimization.md
@@ -0,0 +1,369 @@
+```python
+%matplotlib inline
+```
+
+
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+[Tensors](tensorqs_tutorial.html) ||
+[Datasets & DataLoaders](data_tutorial.html) ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+**Optimization** ||
+[Save & Load Model](saveloadrun_tutorial.html)
+
+# Optimizing Model Parameters
+
+Now that we have a model and data it's time to train, validate and test our model by optimizing its parameters on
+our data. Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates
+the error in its guess (*loss*), collects the derivatives of the error with respect to its parameters (as we saw in
+the [previous section](autograd_tutorial.html)), and **optimizes** these parameters using gradient descent. For a more
+detailed walkthrough of this process, check out this video on [backpropagation from 3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8)_.
+
+## Prerequisite Code
+We load the code from the previous sections on [Datasets & DataLoaders](data_tutorial.html)
+and [Build Model](buildmodel_tutorial.html).
+
+
+
+```python
+import torch
+from torch import nn
+from torch.utils.data import DataLoader
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+
+training_data = datasets.FashionMNIST(
+    root="data",
+    train=True,
+    download=True,
+    transform=ToTensor()
+)
+
+test_data = datasets.FashionMNIST(
+    root="data",
+    train=False,
+    download=True,
+    transform=ToTensor()
+)
+
+train_dataloader = DataLoader(training_data, batch_size=64)
+test_dataloader = DataLoader(test_data, batch_size=64)
+
+class NeuralNetwork(nn.Module):
+    def __init__(self):
+        super(NeuralNetwork, self).__init__()
+        self.flatten = nn.Flatten()
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(28*28, 512),
+            nn.ReLU(),
+            nn.Linear(512, 512),
+            nn.ReLU(),
+            nn.Linear(512, 10),
+        )
+
+    def forward(self, x):
+        x = self.flatten(x)
+        logits = self.linear_relu_stack(x)
+        return logits
+
+model = NeuralNetwork()
+```
+
+## Hyperparameters
+
+Hyperparameters are adjustable parameters that let you control the model optimization process.
+Different hyperparameter values can impact model training and convergence rates
+([read more](https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html)_ about hyperparameter tuning)
+
+We define the following hyperparameters for training:
+ - **Number of Epochs** - the number times to iterate over the dataset
+ - **Batch Size** - the number of data samples propagated through the network before the parameters are updated
+ - **Learning Rate** - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.
+
+
+
+
+
+```python
+learning_rate = 1e-3
+batch_size = 64
+epochs = 5
+```
+
+## Optimization Loop
+
+Once we set our hyperparameters, we can then train and optimize our model with an optimization loop. Each
+iteration of the optimization loop is called an **epoch**.
+
+Each epoch consists of two main parts:
+ - **The Train Loop** - iterate over the training dataset and try to converge to optimal parameters.
+ - **The Validation/Test Loop** - iterate over the test dataset to check if model performance is improving.
+
+Let's briefly familiarize ourselves with some of the concepts used in the training loop. Jump ahead to
+see the `full-impl-label` of the optimization loop.
+
+### Loss Function
+
+When presented with some training data, our untrained network is likely not to give the correct
+answer. **Loss function** measures the degree of dissimilarity of obtained result to the target value,
+and it is the loss function that we want to minimize during training. To calculate the loss we make a
+prediction using the inputs of our given data sample and compare it against the true data label value.
+
+Common loss functions include [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) (Mean Square Error) for regression tasks, and
+[nn.NLLLoss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html#torch.nn.NLLLoss) (Negative Log Likelihood) for classification.
+[nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) combines ``nn.LogSoftmax`` and ``nn.NLLLoss``.
+
+We pass our model's output logits to ``nn.CrossEntropyLoss``, which will normalize the logits and compute the prediction error.
+
+
+
+
+```python
+# Initialize the loss function
+loss_fn = nn.CrossEntropyLoss()
+```
+
+### Optimizer
+
+Optimization is the process of adjusting model parameters to reduce model error in each training step. **Optimization algorithms** define how this process is performed (in this example we use Stochastic Gradient Descent).
+All optimization logic is encapsulated in  the ``optimizer`` object. Here, we use the SGD optimizer; additionally, there are many [different optimizers](https://pytorch.org/docs/stable/optim.html)
+available in PyTorch such as ADAM and RMSProp, that work better for different kinds of models and data.
+
+We initialize the optimizer by registering the model's parameters that need to be trained, and passing in the learning rate hyperparameter.
+
+
+
+
+```python
+optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
+```
+
+Inside the training loop, optimization happens in three steps:
+ * Call ``optimizer.zero_grad()`` to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.
+ * Backpropagate the prediction loss with a call to ``loss.backward()``. PyTorch deposits the gradients of the loss w.r.t. each parameter.
+ * Once we have our gradients, we call ``optimizer.step()`` to adjust the parameters by the gradients collected in the backward pass.
+
+
+
+
+## Full Implementation
+We define ``train_loop`` that loops over our optimization code, and ``test_loop`` that
+evaluates the model's performance against our test data.
+
+
+
+
+```python
+def train_loop(dataloader, model, loss_fn, optimizer):
+    size = len(dataloader.dataset)
+    for batch, (X, y) in enumerate(dataloader):
+        # Compute prediction and loss
+        pred = model(X)
+        loss = loss_fn(pred, y)
+
+        # Backpropagation
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        if batch % 100 == 0:
+            loss, current = loss.item(), (batch + 1) * len(X)
+            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
+
+
+def test_loop(dataloader, model, loss_fn):
+    size = len(dataloader.dataset)
+    num_batches = len(dataloader)
+    test_loss, correct = 0, 0
+
+    with torch.no_grad():
+        for X, y in dataloader:
+            pred = model(X)
+            test_loss += loss_fn(pred, y).item()
+            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
+
+    test_loss /= num_batches
+    correct /= size
+    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
+```
+
+We initialize the loss function and optimizer, and pass it to ``train_loop`` and ``test_loop``.
+Feel free to increase the number of epochs to track the model's improving performance.
+
+
+
+
+```python
+loss_fn = nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
+
+epochs = 10
+for t in range(epochs):
+    print(f"Epoch {t+1}\n-------------------------------")
+    train_loop(train_dataloader, model, loss_fn, optimizer)
+    test_loop(test_dataloader, model, loss_fn)
+print("Done!")
+```
+
+    Epoch 1
+    -------------------------------
+    loss: 2.310308  [   64/60000]
+    loss: 2.291682  [ 6464/60000]
+    loss: 2.282847  [12864/60000]
+    loss: 2.278148  [19264/60000]
+    loss: 2.259573  [25664/60000]
+    loss: 2.246842  [32064/60000]
+    loss: 2.237948  [38464/60000]
+    loss: 2.221490  [44864/60000]
+    loss: 2.215676  [51264/60000]
+    loss: 2.186174  [57664/60000]
+    Test Error: 
+     Accuracy: 50.1%, Avg loss: 2.185173 
+    
+    Epoch 2
+    -------------------------------
+    loss: 2.192464  [   64/60000]
+    loss: 2.176265  [ 6464/60000]
+    loss: 2.138019  [12864/60000]
+    loss: 2.155484  [19264/60000]
+    loss: 2.096774  [25664/60000]
+    loss: 2.064352  [32064/60000]
+    loss: 2.073422  [38464/60000]
+    loss: 2.019561  [44864/60000]
+    loss: 2.018754  [51264/60000]
+    loss: 1.944076  [57664/60000]
+    Test Error: 
+     Accuracy: 56.9%, Avg loss: 1.951974 
+    
+    Epoch 3
+    -------------------------------
+    loss: 1.979550  [   64/60000]
+    loss: 1.944613  [ 6464/60000]
+    loss: 1.850896  [12864/60000]
+    loss: 1.885921  [19264/60000]
+    loss: 1.766024  [25664/60000]
+    loss: 1.721881  [32064/60000]
+    loss: 1.732149  [38464/60000]
+    loss: 1.646069  [44864/60000]
+    loss: 1.663508  [51264/60000]
+    loss: 1.542335  [57664/60000]
+    Test Error: 
+     Accuracy: 60.8%, Avg loss: 1.575167 
+    
+    Epoch 4
+    -------------------------------
+    loss: 1.641383  [   64/60000]
+    loss: 1.597785  [ 6464/60000]
+    loss: 1.460881  [12864/60000]
+    loss: 1.522893  [19264/60000]
+    loss: 1.394849  [25664/60000]
+    loss: 1.381750  [32064/60000]
+    loss: 1.389999  [38464/60000]
+    loss: 1.324359  [44864/60000]
+    loss: 1.359623  [51264/60000]
+    loss: 1.242349  [57664/60000]
+    Test Error: 
+     Accuracy: 63.2%, Avg loss: 1.281596 
+    
+    Epoch 5
+    -------------------------------
+    loss: 1.364956  [   64/60000]
+    loss: 1.337699  [ 6464/60000]
+    loss: 1.179997  [12864/60000]
+    loss: 1.276043  [19264/60000]
+    loss: 1.145318  [25664/60000]
+    loss: 1.163051  [32064/60000]
+    loss: 1.179221  [38464/60000]
+    loss: 1.127842  [44864/60000]
+    loss: 1.170320  [51264/60000]
+    loss: 1.072596  [57664/60000]
+    Test Error: 
+     Accuracy: 64.8%, Avg loss: 1.102368 
+    
+    Epoch 6
+    -------------------------------
+    loss: 1.181124  [   64/60000]
+    loss: 1.175671  [ 6464/60000]
+    loss: 0.999543  [12864/60000]
+    loss: 1.125861  [19264/60000]
+    loss: 0.994338  [25664/60000]
+    loss: 1.020635  [32064/60000]
+    loss: 1.052101  [38464/60000]
+    loss: 1.005876  [44864/60000]
+    loss: 1.050259  [51264/60000]
+    loss: 0.969423  [57664/60000]
+    Test Error: 
+     Accuracy: 65.8%, Avg loss: 0.989962 
+    
+    Epoch 7
+    -------------------------------
+    loss: 1.055653  [   64/60000]
+    loss: 1.073796  [ 6464/60000]
+    loss: 0.878792  [12864/60000]
+    loss: 1.027988  [19264/60000]
+    loss: 0.902191  [25664/60000]
+    loss: 0.923560  [32064/60000]
+    loss: 0.970771  [38464/60000]
+    loss: 0.927402  [44864/60000]
+    loss: 0.969056  [51264/60000]
+    loss: 0.901827  [57664/60000]
+    Test Error: 
+     Accuracy: 66.8%, Avg loss: 0.914991 
+    
+    Epoch 8
+    -------------------------------
+    loss: 0.964512  [   64/60000]
+    loss: 1.004631  [ 6464/60000]
+    loss: 0.793878  [12864/60000]
+    loss: 0.959500  [19264/60000]
+    loss: 0.842306  [25664/60000]
+    loss: 0.854395  [32064/60000]
+    loss: 0.914801  [38464/60000]
+    loss: 0.875149  [44864/60000]
+    loss: 0.910963  [51264/60000]
+    loss: 0.853945  [57664/60000]
+    Test Error: 
+     Accuracy: 67.8%, Avg loss: 0.861828 
+    
+    Epoch 9
+    -------------------------------
+    loss: 0.895530  [   64/60000]
+    loss: 0.953656  [ 6464/60000]
+    loss: 0.731293  [12864/60000]
+    loss: 0.908750  [19264/60000]
+    loss: 0.800252  [25664/60000]
+    loss: 0.803487  [32064/60000]
+    loss: 0.873069  [38464/60000]
+    loss: 0.838708  [44864/60000]
+    loss: 0.867891  [51264/60000]
+    loss: 0.817475  [57664/60000]
+    Test Error: 
+     Accuracy: 68.9%, Avg loss: 0.821918 
+    
+    Epoch 10
+    -------------------------------
+    loss: 0.841097  [   64/60000]
+    loss: 0.913210  [ 6464/60000]
+    loss: 0.683007  [12864/60000]
+    loss: 0.869649  [19264/60000]
+    loss: 0.768555  [25664/60000]
+    loss: 0.764901  [32064/60000]
+    loss: 0.839639  [38464/60000]
+    loss: 0.811697  [44864/60000]
+    loss: 0.834432  [51264/60000]
+    loss: 0.788075  [57664/60000]
+    Test Error: 
+     Accuracy: 70.1%, Avg loss: 0.790321 
+    
+    Done!
+
+
+## Further Reading
+- [Loss Functions](https://pytorch.org/docs/stable/nn.html#loss-functions)
+- [torch.optim](https://pytorch.org/docs/stable/optim.html)
+- [Warmstart Training a Model](https://pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html)
+
+
+
diff --git a/docs/09-SaveLoad.md b/docs/09-SaveLoad.md
new file mode 100644
index 0000000..d07b56e
--- /dev/null
+++ b/docs/09-SaveLoad.md
@@ -0,0 +1,141 @@
+```python
+%matplotlib inline
+```
+
+
+[Learn the Basics](intro.html) ||
+[Quickstart](quickstart_tutorial.html) ||
+[Tensors](tensorqs_tutorial.html) ||
+[Datasets & DataLoaders](data_tutorial.html) ||
+[Transforms](transforms_tutorial.html) ||
+[Build Model](buildmodel_tutorial.html) ||
+[Autograd](autogradqs_tutorial.html) ||
+[Optimization](optimization_tutorial.html) ||
+**Save & Load Model**
+
+# Save and Load the Model
+
+In this section we will look at how to persist model state with saving, loading and running model predictions.
+
+
+
+```python
+import torch
+import torchvision.models as models
+```
+
+## Saving and Loading Model Weights
+PyTorch models store the learned parameters in an internal
+state dictionary, called ``state_dict``. These can be persisted via the ``torch.save``
+method:
+
+
+
+
+```python
+model = models.vgg16(pretrained=True)
+torch.save(model.state_dict(), 'model_weights.pth')
+```
+
+    /Users/brianjo/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
+      warnings.warn(
+    /Users/brianjo/anaconda3/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG16_Weights.IMAGENET1K_V1`. You can also use `weights=VGG16_Weights.DEFAULT` to get the most up-to-date weights.
+      warnings.warn(msg)
+
+
+To load model weights, you need to create an instance of the same model first, and then load the parameters
+using ``load_state_dict()`` method.
+
+
+
+
+```python
+model = models.vgg16() # we do not specify pretrained=True, i.e. do not load default weights
+model.load_state_dict(torch.load('model_weights.pth'))
+model.eval()
+```
+
+
+
+
+    VGG(
+      (features): Sequential(
+        (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (1): ReLU(inplace=True)
+        (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (3): ReLU(inplace=True)
+        (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
+        (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (6): ReLU(inplace=True)
+        (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (8): ReLU(inplace=True)
+        (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
+        (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (11): ReLU(inplace=True)
+        (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (13): ReLU(inplace=True)
+        (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (15): ReLU(inplace=True)
+        (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
+        (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (18): ReLU(inplace=True)
+        (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (20): ReLU(inplace=True)
+        (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (22): ReLU(inplace=True)
+        (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
+        (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (25): ReLU(inplace=True)
+        (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (27): ReLU(inplace=True)
+        (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (29): ReLU(inplace=True)
+        (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
+      )
+      (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
+      (classifier): Sequential(
+        (0): Linear(in_features=25088, out_features=4096, bias=True)
+        (1): ReLU(inplace=True)
+        (2): Dropout(p=0.5, inplace=False)
+        (3): Linear(in_features=4096, out_features=4096, bias=True)
+        (4): ReLU(inplace=True)
+        (5): Dropout(p=0.5, inplace=False)
+        (6): Linear(in_features=4096, out_features=1000, bias=True)
+      )
+    )
+
+
+
+<div class="alert alert-info"><h4>Note</h4><p>be sure to call ``model.eval()`` method before inferencing to set the dropout and batch normalization layers to evaluation mode. Failing to do this will yield inconsistent inference results.</p></div>
+
+
+
+## Saving and Loading Models with Shapes
+When loading model weights, we needed to instantiate the model class first, because the class
+defines the structure of a network. We might want to save the structure of this class together with
+the model, in which case we can pass ``model`` (and not ``model.state_dict()``) to the saving function:
+
+
+
+
+```python
+torch.save(model, 'model.pth')
+```
+
+We can then load the model like this:
+
+
+
+
+```python
+model = torch.load('model.pth')
+```
+
+<div class="alert alert-info"><h4>Note</h4><p>This approach uses Python [pickle](https://docs.python.org/3/library/pickle.html) module when serializing the model, thus it relies on the actual class definition to be available when loading the model.</p></div>
+
+
+
+## Related Tutorials
+[Saving and Loading a General Checkpoint in PyTorch](https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html)
+
+
diff --git a/docs/docs/04-Data_21_1.png b/docs/docs/04-Data_21_1.png
new file mode 100644
index 0000000..01d7edb
Binary files /dev/null and b/docs/docs/04-Data_21_1.png differ
diff --git a/docs/docs/04-Data_6_0.png b/docs/docs/04-Data_6_0.png
new file mode 100644
index 0000000..5f07fe2
Binary files /dev/null and b/docs/docs/04-Data_6_0.png differ
diff --git a/tutorials/01-Introduction.ipynb b/tutorials/01-Introduction.ipynb
index 751f83d..8fcbf9f 100644
--- a/tutorials/01-Introduction.ipynb
+++ b/tutorials/01-Introduction.ipynb
@@ -1,6 +1,7 @@
 {
  "cells": [
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -46,13 +47,7 @@
     "If you're familiar with other deep learning frameworks, check out the [0. Quickstart](quickstart_tutorial.html) first\n",
     "to quickly familiarize yourself with PyTorch's API.\n",
     "\n",
-    "If you're new to deep learning frameworks, head right into the first section of our step-by-step guide: [1. Tensors](tensor_tutorial.html).\n",
-    "\n",
-    "\n",
-    ".. include:: /beginner_source/basics/qs_toc.txt\n",
-    "\n",
-    ".. toctree::\n",
-    "   :hidden:\n"
+    "If you're new to deep learning frameworks, head right into the first section of our step-by-step guide: [1. Tensors](tensor_tutorial.html).\n"
    ]
   }
  ],