-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcollections.tex
114 lines (100 loc) · 11 KB
/
collections.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
\chapter{Collections in COMPSs}
\label{sec:col}
As we have seen in previous sections and examples, a COMPSs parameter is basically a regular user-code object, as a \verb|numpy.ndarray|, with additional metadata to help the COMPSs Runtime to compute any dependency between tasks induced by this particular object, keep track of its locations, and so on.\\
\\
A very common issue reported by COMPSs users is that the programming model is not able to detect dependencies induced by attributes or contents. Many examples are valid: an array \verb|[object(), some_future_object]|, an instance of a class with some attribute that is a future object... or some object that has been used in a super-object. If the container is used as a COMPSs parameter, no process of the sub-object will ever happen, as the programming model won't know about it, so the user should expect tasks to receive outdated or future objects.\\
\\
The ideal solution, a generic introspection algorithm, is very hard, if not impossible, to implement. Python is a dynamically typed language, some objects can be modified if iterated, many others have no easy way to list its internal attributes, circular references can happen... the list is almost endless. Another obstacle is object reconstruction. Let's consider the following code:
\begin{verbatim}
A = MyClass()
A.attribute = some_pycompss_task()
another_pycompss_task(A)
\end{verbatim}
Ideally, we would like to detect the dependency induced by \verb|A.attribute| with no synchronizations in the master, and then get the full object in the worker. This means that the programming should:
\begin{enumerate}
\item Detect the data dependency (introspection)
\item Ask for \verb|A|, and \verb|A.attribute|
\item Deserialize \verb|A| and \verb|A.attribute|, realize that one object is an attribute of the other, and add it
\end{enumerate}
These steps require a heavy implementation with a noticeable performance impact. For example, a $2000 \times 2000$ \verb|numpy.matrix| can make the programming model iterate through $4000000$ elements unnecessarily.\\
\\
It was decided that support for arrays should be implemented in the COMPSs Programming Model. Given that many COMPSs users find words like \textit{array, hash map, reflection, inheritance} complicated and misleading it was decided to call this feature as \textit{support for collections}, as collection is a word that seemed more understandable by non computer science researchers. This name also gives the opportunity to extend this implementation to other iterable data structures such as sets, hash maps and so on.\\
\\
This feature should cover these two cases:
\begin{verbatim}
L = [future_object_1, future_object_2, ...]
y = f(L) # Future objects should be synced and available
\end{verbatim}
\begin{verbatim}
L = [object_1]
modify_object_1()
f(L) # object_1 should be updated and synced properly
\end{verbatim}
In other words, collections should support both IN and INOUT objects.\\
\\
This feature is especially interesting because its implementation serves two purposes: usability and performance. With no collections users are forced to use some \textit{alternative tricks} such as functions with the signature \verb|f(*args)| to deceive the programming model into believing that it is receiving multiple, distinct arguments instead of an array of them. As we will see later, this particular trick has its own problems and issues.
\section{Collections as Input Parameters}
\label{subsec:col_in}
The first step consists of enabling the COMPSs Programming Model to accept collections composed of input parameters. This step will also help us to identify all the COMPSs components that require some implementation and/or modification when dealing with this feature.\\
\\
The easiest way to implement this feature is with recursion. Note that, in terms of dependencies, if some object $x$ is contained in some collection $C$ then any dependency that affects $C$ must also affect $x$, and vice versa. Let's consider the following pseudo-code:
\begin{verbatim}
c1 = f() # c1 is a future object
C = [c1, ...] # C contains a future object and possibly more things
g(C)
\end{verbatim}
There is a clear dependency between the tasks $g(C)$ and $f()$. The easiest way to implement this is to recursively iterate collection objects and to process each object as a single parameter. This approach offers some room to improve performance. For example, it is not necessary to transfer certain metadata of each single element of a collection, as it can be deduced or inherited from the global collection object. In this first case, it is not necessary to specify that the direction of all the elements $c_1, ..., c_n$ of some input collection $C$ is \verb|IN|, as it can be deduced from the fact that $C$ is an input collection. The same applies with some other information such as locations, and so on.\\
\\
Our chosen approach is something similar to what is called TDD (Test Driven Development) \footnote{https://en.wikipedia.org/wiki/Test-driven\_development}: we wrote a PyCOMPSs app that uses collections as input parameters and now we want to make it work. The source code can be found in appendix \ref{subsec:col_in_program}.
As we mentioned in section \ref{subsec:compss_components}, the design of the programming model forces the developer to go through many layers of the software just to implement a single feature. Any PyCOMPSs parameter will go through the pipeline shown in figure \ref{fig:parameter_pipeline}.
\begin{figure}[ht!]
\centering
\includegraphics[scale = 0.5]{figures/parameter_pipeline.png}
\caption{The journey of a Python parameter, from the user's function call until the task is finished in the worker}
\label{fig:parameter_pipeline}
\end{figure}
Our implementation can be generalized with this pattern:
\inputminted{python}{snippets/collection_pattern.py}
This pattern allows us to implicitly define collections of collections. In fact, a \verb|Depth| field can be defined when decorating a task. This field has a default value of 1, and it determines the allowed levels of recursion before considering any object a regular COMPSs parameter. For example, if \verb|Depth = 2| and a $2 \times 2 \times 2$ matrix is passed as a parameter, COMPSs will interpret it as $2 \times 2$ collection.\\
\\
Following the schema from figure \ref{fig:parameter_pipeline}, a collection parameter will be processed as follows:
\begin{itemize}
\item \textbf{Python} When a collection is detected it is recursively processed. This means that all of its objects are individually identified and serialized. However, we will only forward the collection parameter to the COMPSs Runtime. This parameter will contain an array of identifiers of its contents. This saves us some memory and computing time, as the contents of a collection will inherit many attributes from the collection parameter itself, such as direction, name and so on.
\item \textbf{C/C++} This step is very straightforward as our only concern is to convert the Python String which contains all the identifiers of the contents to a Java Array.
\item \textbf{Runtime/Master (Java)} Parameter registration and dependency calculation are done by recursively calling already existing code (see figure \ref{fig:col_in_graph_example}). However, data transfers are optimized. More specifically, only the collection parameter with the list of its content is marked as a transferrable unit here. This saves us many unnecessary connections between nodes.
\item \textbf{Runtime/Worker (Java)} It receives a collection parameter and is responsible of asking the master for the corresponding resources. This approach allows us to ask for collection parameters lazily, as an eager approach may cause a huge bottleneck if many tasks are simultaneously launched. It also constructs the \textit{collection file structure} from figure \ref{fig:collection_file_hierarchy}, which will be used later by the Python Worker.
\item \textbf{Python Worker} When a collection parameter is received it will simply traverse the file hierarchy from figure \ref{fig:collection_file_hierarchy} and reconstruct the collection, while checking that no file is desearialized twice by using the resource memoizer we have mentioned before.
\end{itemize}
\begin{figure}[ht!]
\centering
\includegraphics[scale = 0.4]{figures/dependency_col_in_graph.png}
\caption{A dependency graph generated by a task with a COLLECTION\_IN parameter with 10 elements. Nodes are tasks, an edge from A to B means that needs some data generated or modified by A. As we can see, COMPSs will see 11 dependencies, making no difference between collections and parameters.}
\label{fig:col_in_graph_example}
\end{figure}
\begin{figure}
\centering
\includegraphics[scale = 0.5]{figures/collection_representation.png}
\caption{The in-worker representation of a collection. If a collection contains another collection, a reference to this file will appear, forming a DAG.}
\label{fig:collection_file_hierarchy}
\end{figure}
We have also implemented a \textit{serialization memoizer} to avoid serializing and deserializing parameters more than once. Let's consider the following code:
\begin{verbatim}
obj1, obj2 = object(), object()
A = [obj1, obj2]
f(obj1, A)
\end{verbatim}
After the improvement from section \ref{sec:task_overhead} all COMPSs objects are identified by their memory address, so we are able to realize that \verb|obj1| and \verb|A[0]| are the same object. We have taken advantage of this knowledge to avoid doing extra serializations and deserializations by keeping track of the objects that are already processed. Some stranger cases such as \verb|f([obj1, [obj1]])| also benefit from it.\\
\section{Collections as INOUT Parameters}
\label{subsec:col_inout}
Once we have collections of input parameters the next logical step consists of allowing COMPSs have collections of parameters that can be updated inside some task. Apart from adding some usable notation we needed to implement some extra steps such as notifying the runtime that the versions of the contents have been updated. The same optimization tricks apply here, as many metadata can be deduced from the parent data structure. Two example applications can be found in appendix \ref{subsec:col_inout_program}.
\section{Practical Applications}
\label{subsec:col_examples}
\input{hyperloglog.tex}
\subsection{Usage of collections in other projects}
\label{subsubsec:col_projects}
The collection feature was very welcome by some other research groups and projects. Some of these research groups are:
\begin{enumerate}
\item Dislib, a distributed computing library highly focused on machine learning on top of PyCOMPSs \footnote{https://www.bsc.es/research-and-development/software-and-apps/software-list/dislib/related-software}
\item DDS (Distributed Data Set) \footnote{https://github.com/bsc-wdc/compss/tree/stable/compss/programming\_model/bindings/python/src/pycompss/dds}, a PyCOMPSs implementation of Google Spark's RDDs. It also allows the user to write Spark-like codes for PyCOMPSs .
\item exaQute (\textbf{Exa}scale \textbf{Q}uantification of Uncertainties for \textbf{Te}chnology and Science Simulation)\footnote{http://exaqute.eu/}.
\end{enumerate}