-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1_introduzione.tex
125 lines (86 loc) · 7.41 KB
/
1_introduzione.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
\chapter{Introduction}
With the spread of the Internet and more importantly of the social networks, efficient data analysis on graphs becomes increasingly important.
Graphs are a powerful data structure that natural model the interactions between objects.
\section{Basic definitions}
\begin{definizione}\label{def:graph}
A \textbf{graph} is a pair of sets $G=(V,E)$, where $V$ is the set of vertices (or nodes) and $E \subseteq V \times V$ is the set of edges.
\end{definizione}
If two vertices $u, v \in V$ are connected by an edge they are called the extremes of the edge, in this case we denote the edge with the pair $(u, v) \in E$.
If $(u,v) \in E \Leftrightarrow (v,u) \in E$ the graph is called undirected; wherever not specified we will only deal with undirected graphs.
A sequence of nodes $v_{1}, v_{2}, \ldots, v_{k}$ is called path if $(v_{i}, v_{i+1}) \in E$ for $i = 1, \ldots k-1$; a path is called simple if $v_{i} \neq v_{j}$ for all $i,j$ such that $1 \leq i < j \leq k$. A cycle is a path where $(v_{k}, v_{1}) \in E$.
We denote by $N(u) = \{ v : (u,v) \in E \}$ the set of neighbors of the vertex $u$. The cardinality of this set is called degree of $u$ (i.e. \textit{deg} $u$ = $|N(u)|)$.
With $N^{<k}(u)$ we indicate the set of vertices connected to $u$ by a simple path of length less than $k$ (note that $N(u)$ = $N^{<2}(u)$). We can extend the previous definition to a set of vertices $V' \subseteq V$ as $N^{<k}(V') = \cup_{v \in V'}{ N^{<k}(v) }$.
\begin{definizione}\label{def:subgraph}
A graph $G' = (V', E')$ is called \textbf{subgraph} of $G=(V,E)$ if $V' \subset V$ and $E' \subset E$. A subgraph is called \textbf{induced subgraph} if $E' = (V' \times V') \cap E$.
\end{definizione}
We use $G' \subset G$ to indicate that the graph $G'$ is a subgraph of $G$ and $G' < G$ to indicate that the graph $G'$ is an induced subgraph of $G$.\medskip
Note that an induced subgraph $G' = (V', E')$ can be uniquely identified by the set of its vertices $V'$.
\begin{definizione}\label{def:labeledgraph}
A \textbf{labeled graph} is a triple $(V,E,L)$ where $(V,E)$ is a graph and $L : V \rightarrow \Sigma$
is a function that assigns a symbol of the alphabet $\Sigma$ to every node $v$. We call $L(u) \in \Sigma$ the label of the node $u$.
\end{definizione}
Given a path $\pi = v_{1}, v_{2}, \ldots, v_{k}$ we extend the function $L$ and define the label sequence of $\pi$ the string $L(\pi) = L(v_{1}) L(v_{2}) \ldots L(v_{k}) \in \Sigma^{k}$ obtained by the concatenation of the labels of the nodes in the path.\medskip
In this thesis, we mainly focus on analyzing complex networks: special graphs with a non-trivial topology, different from random graphs. Complex networks occurs in graphs that model real system like social networks or computer networks, and are characterized by a specific structural feature:
\begin{definizione}\label{def:power-law-graph}
We define as \textbf{power-law degree distribution} a network where the degree of the nodes $u$ follow, for some $\gamma$ (usually $2 < \gamma < 3$), the probability distribution:
\begin{equation}
P(deg(u) = k) \sim k^{-\gamma}
\end{equation}
\end{definizione}
\begin{figure}[h]
\centering
\begin{minipage}[t]{.45\textwidth}
\centering
\includegraphics[width=5cm,height=3cm]{figure/figure-1-1}
\caption{Degree distribution of a random network}
\end{minipage}\hfill
\begin{minipage}[t]{.45\textwidth}
\centering
\includegraphics[width=5cm,height=3cm]{figure/figure-1-2}
\caption{Degree distribution of a complex network}
\end{minipage}
\end{figure}
\begin{figure}[h]
\centering
\begin{minipage}[t]{.45\textwidth}
\centering
\includegraphics[width=3cm,height=3cm]{figure/figure-1-3}
\caption{Random network}
\end{minipage}\hfill
\begin{minipage}[t]{.45\textwidth}
\centering
\includegraphics[width=2.8cm,height=3cm]{figure/figure-1-4}
\caption{Complex network}
\end{minipage}
\end{figure}
\section{The problem}
\begin{problema}
Given an undirected labeled graph $G=(V,E,L)$ over an alphabet $\Sigma$, an integer $q$
and two set of nodes $V_{1}, V_{2} \subset V$, we want to estimate the similarity between the two induced subgraphs $V_{1}, V_{2} < G$, based on the labels frequency of simple paths with nodes in $V_{1} \cup N^{<q}(V_{1})$ and $V_{2} \cup N^{<q}(V_{2})$ that ends in $V_1$ and $V_2$.
\end{problema}
We will talk about a more formal and rigorous definition of subgraph similarity in chapter 2.\medskip
In the definition we use $V_{1} \cup N^{<q}(V_{1})$ and $V_{2} \cup N^{<q}(V_{2})$ instead of simply $V_{1}$ and $V_{2}$ because, in a complex graph, we also want to keep in mind the interaction between each subgraph and the rest of the graph.\medskip
The difficulty we must face is that, in a complex network, the labels can exponentially explode for increasing values of $q$ and $|\Sigma|$, to $|\Sigma|^{q} \gg |V|$. Even worse, the number of simple paths can exponentially explode to $|V|^{q}$. The simple reason is that in complex networks the average separation is very low (the famous idea of \textit{six degrees of separation}).\medskip
In this thesis, we exploit the problem using randomized techniques and parallelization, which makes the problem suitable even for large networks.
\section{Practical applications}
The problem presented can be applied to a lot of contexts.
We illustrate some examples of practical application by referring to famous websites,
in order to better understand.
In all the examples we assume to work on a \textit{friendship graph} where the nodes are the registered users and
the edges are the friendship relations between two users.
\paragraph*{Netflix} In order to produce and translate the right television series, we want to estimate the similarity between two geographical groups of viewers, where each viewer is labeled with its favorite film genre.
\paragraph*{Amazon} In order to improve advertising campaigns, we want to estimate the similarity between two sets of clients of the same age group, where each client is labeled with the category of objects he buys more frequently.
\paragraph*{Facebook} In order to suggest new friends, we want to estimate the similarity between two users, where each user is labeled with its favorite musical genres.
\clearpage
\section{Thesis organization}
The thesis is divided into five chapters and one appendix, where
the topics covered are respectively the following ones.\medskip
\begin{itemize}
\item Chapter 1: \textbf{Introduction}, where we present some basic definitions in order to introduce the problem we are analyzing.
\item Chapter 2: \textbf{Basic tools}, where we present the definition of some similarity indices already existing in the literature, along with two methods we will use.
\item Chapter 3: \textbf{Computation of subgraph similarity}, where we give a description of different approaches to compute the subgraph similarity.
\item Chapter 4: \textbf{Project development}, where we perform implementation choices and steps of the project development with some experimental results.
\item Chapter 5: \textbf{Conclusion and future works}, where we draw some conclusions from theoretical and practical analysis of the presented approaches, with some hypothesis of future works.
\item Appendix A: \textbf{Code snippets}, where we give some important code snippets of the project used for the experimental results.
\end{itemize}
The results in this thesis are joint work with Alessio Conte, Roberto Grossi, Andrea Marino, Kunihiko Sadakane and Takeaki Uno~\cite{SubSim}.\medskip