main.tex

\documentclass{article}
\usepackage[utf8]{inputenc}

\title{COL216 A-5 Write-up}
\author{Sayam Sethi 2019CS10399 \\ Mallika Prabhakar 2019CS50440}
\date{April 2021}
\begin{document}


\maketitle
\section{Assumptions}
The following hardware assumptions were made in the design of the \textbf{MIPS Processor}:
\begin{enumerate}
    \item There are \textbf{three ports} for communication with the processor:
        \begin{enumerate}
            \item Two \textbf{output ports} to send the DRAM request to the \textbf{DRAM queue} (explained in detail in the next point) and signify if there is a ``priority load" required by the processor if it is being stalled, respectively.
            \item An \textbf{input port} to receive, and accept or reject the result of the DRAM request after delay.
            \item A \textbf{pair of \{address, value\}} for every register which stores the value of the latest DRAM address from which the value is to be loaded from.
            \item The remaining hardware is the same as for a single pipeline MIPS processor.
        \end{enumerate}
    \item The \textbf{Memory Request Manager (MRM)} has the following hardware components and ports:
        \begin{enumerate}
            \item $N$ \textbf{buffer queues} (which is a fixed sized data segment, $size = 32$) for every core which stores the instructions send to the \textbf{MRM}. Detailed hardware structure is explained in the next section. A \textbf{counter} is also used to keep a track of the number of elements present in the segment.
            \item A sw-lw forwarding \textbf{pipeline register} which stores the latest value of \textbf{\{issue clock cycle, value\}} for the address(es) which are currently present in the \textbf{buffer queue}.
            \item Two \textbf{cyclic priority encoders} which helps decide the core whose instruction is to be processed next.
            \item $2\times N$ \textbf{input ports} which receive the DRAM requests from the cores and the priority signals from the cores.
            \item $N$ \textbf{output ports} which return the DRAM load results to the cores.
            \item Two more \textbf{counters} which keep a track of the number of instructions processed in the current row and cycles remaining for the current request to be processed.
            \item Two \textbf{registers} which store the current core and row whose request is being serviced.
        \end{enumerate}
    \item The \textbf{DRAM architecture} is as follows:
        \begin{enumerate}
            \item It contains a \textbf{2D square array} of $1024$ rows with each row having $1024$ bytes ($256$ words). This memory segment is divided into $N$ blocks to separate the memory locations accessible by the cores.
            \item It contains another storage location called the \textbf{row buffer} which is used to store the current memory retrieval row.
        \end{enumerate}
\end{enumerate}
Note: $N$ is taken to be $16$ in our implementation, however the value can be changed to any number that is a factor of $1024$.


\section{Implementation}
\subsection{Algorithm}
Note: The hardware specifications and decision delays are discussed in the next subsection\par
The idea behind our implementation is to ensure that all cores are serviced as soon as possible and no core is left waiting for too long. The \textbf{MRM} has separate buffers for every core, which has a maximum size of $32$ requests each. The buffer is implemented as an \textit{unordered map} of \textit{queue} which helps service the requests sequentially for every row.\par
When the requests are sent to the \textbf{MRM} for the first time (or after the \textbf{MRM} was emptied), it selects the first core which has a request that has been sent to the buffer. In the subsequent selections, it selects requests from the same row (and hence the same core), if requests are available.\par
Now, if all the requests of the current row are processed or $maxToProcess$ ($maxToProcess = \frac{rowAccessDelay}{colAccessDelay}$) number of instructions have been serviced from the current row, the \textbf{MRM} decides which core, row to service next. This is determined by cyclically selecting the next core which has a priority load, if any. If no core has a priority load, then the next core with a pending request is cyclically selected. If the core was selected using priority, $maxToProcess$ ensures that no core is left \textbf{starving} and cyclic selection ensures that no core gets more priority when deciding which request to serve.
\subsubsection*{Other Optimisations:}
\begin{itemize}
    \item \textbf{Forwarding:} sw to lw forwarding is implemented using a register file (pipelining registers) present in the \textbf{MRM}
    \item \textbf{Skipping Redundant Instructions:} If there are multiple lw instructions for the same register, only the most recent request is sent to the DRAM and the others are skipped. Similar implementation is done for multiple sw instructions.
    \item \textbf{Dirty-only Write-back:} Write-back only when the buffer is dirty, else no write-back.
\end{itemize}

\subsection{Hardware Perspective}
The delays have been evaluated under the following assumptions:
\begin{enumerate}
    \item The processor sends requests to the \textbf{MRM} before the falling edge of the current clock cycle (so that the information is available during the next clock edge).
    \item The clock cycle is large enough that about 5 sequential operations can be performed in a single clock cycle. This is necessary since the processor performs IF, ID, Registers, ALU control, ALU.
\end{enumerate}

\subsubsection*{Buffer Queue}
The buffer queue is a data segment which has a maximum size of $32$ for every core. The requests sent by the processor is ``pushed" to this queue. It has a similar to linked list implementation, where the position of the next element of the queue is stored in the current element. When creating a new queue, an algorithm similar to that used for hashing is used. This on an average is $O(1)$ operation and will complete in one clock cycle. Thus when the \textbf{MRM} selects the next instruction, it happens towards the end of the clock cycle and hence takes one complete cycle to select the next instruction to execute (when in the same row).\par
When priority encoder is involved to select the next core, another clock cycle is used to obtain the result from the encoders and then the optimal row is selected from the core of the output of the encoder.

\subsubsection*{Forwarding Pipeline Register}
It is similar to the pipeline register taught in the lectures and stores the \{issue clock cycle, value\} for the addresses that have pending sw requests to be processed by the DRAM. The mapping is similar to that which is used by a cache. The result is returned one cycle after the request is sent to the \textbf{MRM} in which the \textbf{MRM} determines the forwarding value.

\subsubsection*{Cyclic Priority Encoders}
If the core whose request to process has to be changed, the values from the cyclic priority encoders is read. It is different from a priority encoder in the sense that the priority changes cyclically with the core just after the current core having the largest priority. This uses a combinational circuit and hence returns the result with a delay comparable to only delays of logic gates.


\section{Strengths and Weaknesses}
\subsection{Strengths}
\begin{itemize}
    \item Forwarding helps save a lot of cycles especially when there more pending instructions, this helps in ensuring that stalling does not cascade (if any) and the pending instructions do not keep on increasing.
    \item Every redundant instruction is not processed and this increased the instructions per cycle greatly. In the case when a redundant instruction was being processed by the DRAM, it is rejected by the processor (this saves cycles too by not stalling for the request to finish processing).
    \item Writeback happens only when the buffer was dirty, this saves clock cycles in a lot of cases and decreases the time taken to switch rows by $50\%$
    \item Starvation handling was done which helped improved the performance of the \textbf{MRM} for large number of cores and a large number of cycles. As the total number of instructions per core are increased, the eficiency of the manager improves.
    \item The operations of the decision making of the \textbf{MRM} happen independently of the core execution, which avoids stalling of processor instructions. This also ensures that on an asymptotic scale, the $rowAccessDelay$ is increased by a factor of $1.5$, along with the number of instructions processed per row switch becomes $5\times$.
    \item Every core is allocated an individual \textbf{buffer queue} which ensures that the requests can be pushed independently of other cores and the pending requests in other cores doesn't limit the number of request that can be sent by a particular core.
\end{itemize}

\subsection{Weaknesses}
\begin{itemize}
    \item To ensure that all cores have equal priorities and minimal effects of other cores, a lot of additional memory is needed in the \textbf{MRM}. Thus, it has a slightly larger component cost, proportional to $N$.
    \item Since the data structure for the DRAM request is somewhat complicated, the pushing of instructions is relatively slow as compared to other processes being executed in the \textbf{MRM}. This is the limiting process for the \textbf{MRM} delays.
    \item Our implementation provides sufficient ``bubble time" to ensure that the \textbf{MRM} instructions execute within a single clock cycle. However, more instructions could have been squeezed in the MRM in a single clock cycle by decreasing the ``bubble time".
\end{itemize}


\section{Test Cases:}
\begin{enumerate}
    \item \textbf{Test1 -} Initial test case to test multi-core functionality
    \item \textbf{Test2 -} Test case to test execution when row changes are needed (no dependent instructions)
    \item \textbf{Test3 -} Tests execution when instructions are skipped + stopping a single core on error
    \item \textbf{Test4 -} Contains primarily store expressions in the files
    \item \textbf{Test5 -} Each core respectively: dependent loads, no DRAM, forwarding (reduction of cycles from about 450 cycles to 271 cycles compared to Assignment 4 implementation)
    \item \textbf{Test6 -} Unsafe instructions in all cores (forwarding happens in some of the cores)
    \item \textbf{Test7 -} Collection of files provided in Assignment 4 demonstration to be run in a parallel manner
    \item \textbf{Test8 -} Random test case 1 (random)
    \item \textbf{Test9 -} Random test case 2 (negative address)
    \item \textbf{Test10 -} Random test case 3 (normal files and erroneous files run in parallel)
\end{enumerate}


\end{document}