hyperloglog.tex

\subsection{Approximating cardinalities of huge sets}
\label{subsec:wcountproblem}
The Count-Distinct or Word Count Problem can be formulated as follows: given a sequence of elements $s_{1}, ..., s_{n}$ compute the amount of \textbf{distinct} elements in it. For example, for the sequence dog, cat, dog, bird, bird the answer is 3 (the distinct elements are bird, cat, and dog).

If the sequence is not too large this problem can be easily solved in expected linear time and space using hash tables, or $\mathcal{O}(nlogn)$ time and linear space using some data structures as Red-Black trees. However, this bound on space starts to become unacceptable when datasets are too large. In this section we will describe a probabilistic algorithm named as HyperLogLog \cite{Flajolet07hyperloglog:the}.\\
\\
This algorithm is very simple and yet very powerful. The core idea is the following: for each element $s_{i}$ of our sequence, use a hash function $h: \{0, 1\}^{*} \mapsto \{0, 1\}^b$ to compute a value $h(s_{i})$ and estimate the cardinality as $2^m$, where $m$ is the maximum number of leading zeros among all $h(s_{i})$. We must note that if all $h(x)$ have the same probability $\frac{1}{2^{b}}$ then the probability for some value to have $k$ leading zeros is $2^{-k}$. This means that the expected number of observations that are needed to find a number with $k$ leading zeros is $2^{k}$. Given that having a single hash value is not precise enough but computing multiple hash functions is too expensive, what is done is the following: 
\begin{enumerate}
\item Given a token $t$, compute $h(t)$
\item Take the first $p$ bits and use them to refer to a position in an array consisting of $2^p$ elements $a_{0}, ..., a_{2^p - 1}$
\item Update this position according to the other $b - p$ bits so it keeps the maximum amount of leading zeros seen so far
\item Once all tokens are processed output the harmonic mean of $2^{a_{0}}, ..., 2^{a_{2^p - 1}}$ as the answer.
\end{enumerate}
An interesting trivia fact is that if we need $\mathcal{O}(\log n)$ bits for our hash function to be able to count until $n$ then we only need $\mathcal{O}(\log \log n)$ to store the number of leading zeros of some hash value. This is why HyperLogLog is called that way.\\
\\
A very nice property of HyperLogLog is that two distinct runs on two different datasets can be merged if they have used the same parameters (hash function, $b$, and $p$) in such a way that it approximates the cardinality of the union of the two datasets. Given two arrays $a$ and $b$, each corresponding to a run of HyperLogLog we can get a fictional run of HyperLogLog $c$ that represents the union of both datasets by computing $c_{i} =\max(a_{i}, b_{i})$ for all of the $2^{p}$ positions. This makes sense, as it produces the same result as running a single HyperLogLog on the concatenation of the two datasets. This property allows us to parallelize or to distribute this algorithm, giving us a potential performance boost. The source code we will use for our experiments can be found in appendix \ref{subsec:hyperloglog_source_code}.\\
\\
Note that this application is a classical \textit{map-reduce} workflow. Without collections we are forced to implement any reduce function as \verb|reduce(f, *args)|. Each extra argument is passed as an input parameter via socket and pipe, implying a huge overhead. With collections only the collection object, and the list of identifiers, are transferred. The other properties of the contents, such as direction, locations and so on, are deduced or requested in the destination node.\\
\\
The elimination of this overhead is noticeable even with a very small number of parameters. As we can see in figure \ref{fig:collection_vs_normal}, the collection feature reduces the overhead drastically. An important observation is that a PyCOMPSs task of the form \verb|f(*args)| usually starts to show problems and crashes when more than $60$ arguments are passed, as each argument represents a lot of metadata to be transferred via socket and pipe.
\begin{figure}[ht!]
\centering
\includegraphics[scale = 0.5]{figures/collection_vs_normal.png}
\caption{Execution time of the reduce functions with and without collections. Each point is the average of 5 executions. Although the samples are noisy, as they are small, a consistent improvement by the collection feature can be appreciated. The non-collections versions started to crash and to show strange behaviours around the 60 parameters}
\label{fig:collection_vs_normal}
\end{figure}\\
\\
A comparison between the amount of meta-data generated and sent by the classical reduce implementation and by the collections version can be found in appendix \ref{subsec:reduce_data_comparison}.

This improvement benefits many applications, as the map-reduce scheme is very common.