-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathstorjv3.tex
3509 lines (2997 loc) · 173 KB
/
storjv3.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[8pt,fleqn,openany]{book}
\usepackage[utf8]{inputenc}
\usepackage[top=3cm,bottom=3cm,left=3.2cm,right=3.2cm,headsep=16pt,letterpaper]{geometry}
\usepackage{xcolor}
\definecolor{defaultcolor}{RGB}{38,131,255}
\definecolor{ForestGreen}{RGB}{34,139,34}
\usepackage{tabulary}
\usepackage{hyperref} \def\UrlBreaks{\do\/\do-}
\usepackage{graphicx}
\usepackage[english]{babel}
\usepackage{listings}
\usepackage{listings-golang}
\usepackage{listings-protobuf}
\usepackage{color}
\usepackage{tikz}
\usepackage[labelfont=it,textfont={it},singlelinecheck=on,justification=centering]{caption}
\usepackage{amsmath}
\usepackage{float}
\usepackage{comment}
\usepackage{cite}
\usepackage{placeins}
\usepackage{tabulary}
\usepackage{pdfrender}
\usepackage{csquotes}
\usepackage{inconsolata}
\usepackage[defaultfam,light]{montserrat}
\usepackage[italic]{mathastext}
\setlength{\parskip}{1em}
\urlstyle{rm}
\renewcommand{\rmdefault}{Montserrat-LF}
\renewcommand{\baselinestretch}{1.15}
\setlength{\emergencystretch}{3pt}
\newcommand{\code}[1]{{\em #1}}
\newcommand{\semibold}[1]{\textpdfrender{TextRenderingMode=FillStroke,LineWidth=.1pt,FillColor=black}{#1}}
\newcommand{\TODO}[1]{\textbf{\color{red} TODO: #1}}
\lstset{
basicstyle=\footnotesize\tt,
keywordstyle=\color{blue},
numbers=left,
numbersep=5pt,
showstringspaces=false,
stringstyle=\color{red},
commentstyle=\color{ForestGreen},
tabsize=2,
language=Golang
}
\title{\textbf{\sffamily\color{white} \
Storj: A Decentralized Cloud Storage Network Framework}}
\author{\small\sffamily\color{white}
Storj Labs, Inc.}
\date{\small\sffamily\color{white}
October 30, 2018\\v3.0\\
March 7, 2024\\v3.1\\
\small\colorlet{urllinkcolor}{white}\url{https://github.com/storj/whitepaper}
}
\input{structure}
\begin{document}
\raggedbottom
\widowpenalty10000
\clubpenalty10000
\thispagestyle{fancy}
\chapterimage{images/header.eps}
\AddToShipoutPicture*{\put(0,0){\includegraphics[width=\paperwidth]{images/front.eps}}}
\maketitle
\nobreakspace
\vfill{\footnotesize\noindent{}Copyright \textcopyright\ 2024 Storj Labs, Inc.
and Subsidiaries\\
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0
license (CC BY-SA 3.0).\\\\
All product names, logos, and brands used or cited in this document are
property of their respective owners. All company, product, and service names
used herein are for identification purposes only. Use of these names, logos,
and brands does not imply endorsement.}
\newpage
\tableofcontents\newpage
\section{Abstract}
Decentralized cloud storage represents a fundamental shift in
the efficiency and economics of large-scale storage.
Eliminating central control allows users to store and share data
without reliance on a third-party storage provider. Decentralization mitigates
the risk of data failures and outages while simultaneously increasing
the security and privacy of object storage. It also
allows market forces to optimize for less expensive storage
at a greater rate than any single provider could afford.
Although there are many ways to build such a system, there are some specific
responsibilities any given implementation should address.
Based on our experience with petabyte-scale
storage systems, we introduce a modular framework for considering these
responsibilities and for building our distributed storage network.
Additionally, we describe an initial
concrete implementation for the entire framework.
\section{Contributors}
This paper represents the combined efforts of many individuals.
Contributors affiliated with Storj Labs, Inc. include but are not limited to:
Tim Adams,
Kishore Aligeti,
Cameron Ayer,
Atikh Bana,
Alexander Bender,
Stefan Benten,
Maximillian von Briesen,
Paul Cannon,
Gina Cooley,
Dennis Coyle,
Egon Elbre,
Nadine Farah,
Patrick Gerbes,
John Gleeson,
Ben Golub,
James Hagans,
Jens Heimbürge,
Faris Huskovic,
Philip Hutchins,
Brandon Iglesias,
Viktor Ihnatiuk,
Jennifer Johnson,
Kevin Leffew,
Alexander Leitner,
Richard Littauer,
Dylan Lott,
JT Olio,
Kaloyan Raev,
Garrett Ransom,
Matthew Robinson,
Jon Sanderson,
Benjamin Sirb,
Dan Sorensen,
Helene Unland,
Natalie Villasana,
Bryan White,
and Shawn Wilkinson.
We'd also like to thank the other authors and contributors of the
previous Storj and Metadisk white papers:
Tome Boshevski,
Josh Brandoff,
Vitalik Buterin,
Braydon Fuller,
Gordy Hall,
Jim Lowry,
Chris Pollard,
and James Prestwich.
We'd like to especially thank
Petar Maymounkov,
Anand Babu Periasamy,
Tim Kosse,
Roberto Galoppini,
Steven Willoughby,
and Aaron Boodman
for their helpful review of and contributions to an early draft of this paper.
We would like to acknowledge the efforts, white papers, and communications of
others in the distributed computing, blockchain, distributed storage, and
decentralized storage space, whose work has informed our efforts. A more
comprehensive list of sources is in the bibliography, but we would like to
provide particular acknowledgement for the guidance and inspiration provided
by the teams that designed and built
Allmydata,
Ceph,
CoralCDN,
Ethereum,
Farsite,
Filecoin,
Freenet,
Gluster,
GFS,
Hadoop,
IPFS,
Kademlia,
Lustre,
Maidsafe,
Minio,
MojoNation,
OceanStore,
Scality,
Siacoin,
and Tahoe-LAFS.
Finally, we extend a huge thank you to everyone we talked to during the
design and architecture of this system for their valuable thoughts, feedback,
input, and suggestions.
Please address correspondence to \href{mailto:paper@storj.io}{paper@storj.io}.
\section{Changelog}
This section describes updates from the past editions of this white paper.
Beyond a few trivial wording tweaks, we changed the following aspects in version 3.1:
\begin{itemize}
\item Clarified {\em encryption blocks} in section \ref{data-definitions}.
\item Replaced Kademlia for storage node discovery with a direct node-to-satellite
indication of network participation (section \ref{sec:concrete-node-discovery}).
\item Simplified the Audits service and containment mode (section \ref{sec:concrete-audits}).
\item Added the ability for storage node operators to select the Satellites they would like to work with,
eliminating the need for Satellite vetting and opt-out (section \ref{sec:concrete-satellite-reputation}).
\item Removed section 4.21 {\em Quality Control and Branding} about obsolete branding ideas.
\item Updated appendix \ref{chap:attacks} about anticipated attacks to reflect the removal of Kademlia.
\end{itemize}
With these changes in mind, we expect that this paper once again matches our
in-production service at the time of publication.
For more a detailed changelog, please see
\url{https://github.com/storj/whitepaper/compare/v3.0-merged...v3.1}.
\chapter{Introduction}\label{chap:intro}
The Internet is a massive decentralized and distributed network consisting of
billions of devices which are not controlled by a single group or entity.
Much of the data currently available through the Internet is quite centralized
and is stored with a handful of technology companies that have the
experience and capital to build massive data centers capable of handling this
vast amount of information.
A few of the challenges faced by data centers are: data breaches, periods of
unavailability on a grand scale, storage costs, and expanding and upgrading
quickly enough to meet user demand for faster data and larger formats.
Decentralized storage has emerged as an answer to the challenge of
providing a performant, secure, private, and economical cloud storage solution.
Decentralized storage is better positioned to achieve these outcomes as the
architecture has a more natural alignment to the decentralized architecture of
the Internet as a whole, as opposed to massive centralized data centers.
News coverage of data breaches over the past few years has shown us that the
frequency of such breaches has been increasing by as much as a factor of 10
between 2005 and 2017 \cite{breaches}.
Decentralized storage's process of protecting data makes data breaches more
difficult than current methods used by data centers while, at the same time,
costing less than current storage methods.
This model can address the rapidly
expanding amount of data for which current solutions struggle.
With an anticipated 44 zettabytes of data expected to exist by 2020 and a
market that will grow to \$92 billion USD in the same time frame
\cite{storage-growth}, we have identified several key market segments that decentralized cloud storage
has the potential to address.
As decentralized cloud storage capabilities evolve, it will be able to address a much wider range of use cases from basic
object storage to content delivery networks (CDN).
Decentralized cloud storage is rapidly advancing in maturity, but its evolution
is subject to a specific set of design constraints which
define the overall requirements and implementation of
the network. When designing a distributed storage system,
there are many parameters to be optimized such as speed, capacity,
trustlessness, Byzantine fault tolerance, cost, bandwidth, and latency.
We propose a framework that scales horizontally to exabytes of data storage
across the globe. Our system, the Storj Network, is a robust object store
that encrypts, shards, and distributes data to nodes around the world for
storage.
Data is stored and served in a manner purposefully designed to prevent
breaches.
In order to accomplish this task, we've designed our system to be modular,
consisting of independent components with task-specific jobs.
We've integrated these components to implement a decentralized object storage
system that is not only secure, performant, and reliable but also significantly
more economical than either on-premise or traditional, centralized cloud storage.
We have organized the rest of this paper into six additional
chapters. Chapter \ref{chap:design-constraints} discusses the design space
in which Storj operates and the specific constraints on which
our optimization efforts are based.
Chapter \ref{chap:framework} covers our framework. Chapter
\ref{chap:concrete}
describes the concrete implementation of the framework,
while chapter \ref{chap:walkthroughs} explains what happens
during each operation in the network. Chapter
\ref{chap:future-work} covers future work.
Finally, chapter \ref{chap:selected-calcs} covers selected calculations.
\chapter{Storj design constraints}\label{chap:design-constraints}
Before designing a system, it's important to first define its requirements.
There are many different ways to design a decentralized storage system. However,
with the addition of a few requirements, the potential design space shrinks
significantly.
Our design constraints are heavily influenced by our product and
market fit goals.
By carefully considering each requirement, we ensure the framework
we choose is as universal as possible, given the constraints.
\section{Security and privacy}
Any object storage platform must ensure both the privacy and
security of data stored regardless of whether it is centralized or decentralized.
Decentralized storage platforms must mitigate an additional layer of
complexity and risk associated with the storage of data on inherently
untrusted nodes. Because decentralized storage platforms cannot take many
of the same shortcuts data center based approaches can (e.g. firewalls, DMZs,
etc.), decentralized storage must be designed from the ground up to support
not only end-to-end encryption but also enhanced security and privacy at all levels of the
system.
Certain categories of data are also subject to specific regulatory compliance.
For example, the United States legislation for
the Health Insurance Portability and
Accountability Act (HIPAA) has specific requirements for data center
compatibility. European countries have to consider the General Data Protection
Regulation (GDPR) regarding
how individual information must be protected and secured.
Many customers outside of the United States may feel they have significant
geopolitical reasons to consider storing data in a way that limits the ability
for US-based entities to impact their privacy \cite{kopano}.
There are many other regulations in other sectors regarding user's data privacy.
Customers should be able to
evaluate that our software is implemented correctly, is resistant to
attack vectors (known or unknown), is secure, and otherwise fulfills all
of the customers' requirements.
The code for the Storj network is open source software and
provides the level of transparency and assurance needed to prove that the
behaviors of the system are as advertised.
\section{Decentralization}\label{sec:decentralization}
Informally, a decentralized application is a service that has no single
operator. Furthermore, no single entity should be solely responsible for the
cost associated with running the service or be able to cause a service
interruption for other users.
One of the main motivations for preferring decentralization is to drive
down infrastructure costs for maintenance, utilities, and bandwidth.
We believe that there
are significant underutilized resources at the edge of the network
for many smaller operators. In our experience building decentralized
storage networks, we have found a long tail of resources that are presently
unused or underused that could provide affordable and
geographically distributed cloud storage. Conceivably, some small operator
might have access to less-expensive electricity than standard data centers or another small
operator could have access to less-expensive cooling. Many of these small operator
environments are not substantial enough to run an entire datacenter-like
storage system. For example, perhaps a small business or home Network Attached
Storage (NAS) operator has
enough excess electricity to run ten hard drives but not more. We have found that
in aggregate, enough small operator environments exist such that their
combination over the internet constitutes
significant opportunity and advantage for less-expensive and faster storage.
Our decentralization goals for fundamental infrastructure, such as storage, are also driven by our desire to provide a viable alternative to the few major
centralized storage entities who dominate the market at present.
We believe that there exists inherent risk in trusting a single entity,
company, or organization with a significant percentage of the world's data.
In fact, we believe that there is an implicit cost associated with the risk of
trusting any third party with custodianship of personal data.
Some possible costly outcomes include changes to the company's roadmap that could result in the product
becoming less useful, changes to the company's position on data collection that could
cause it to sell customer metadata to advertisers, or even the company could go
out of business or otherwise fail to keep customer data safe.
By creating an equivalent or better decentralized
system, many users concerned about single-entity risk will have a viable
alternative.
With decentralized architecture, Storj could cease operating and the data
would continue to be available.
We have decided to adopt a decentralized architecture because, despite the trade-offs, we believe decentralization better addresses the needs of cloud storage
and resolves many core limitations, risks, and cost factors that result from
centralization. Within this context,
decentralization results in a globally distributed network that can
serve a wide range of storage use cases from archival to CDN. However,
centralized storage systems require different architectures, implementations,
and infrastructure to address each of those same use cases.
\section{Marketplace and economics}
Public cloud computing, and public cloud storage in particular, has
proven to be an attractive business model for the large centralized cloud
providers. Cloud computing is estimated to be a \$186.4 billion dollar market
in 2018, and is expected to reach \$302.5 billion by 2021 \cite{gartner-cloud-growth}.
The public cloud storage model has provided a compelling economic model to end
users. Not only does it enable end users to scale on demand but also allows them to avoid the significant fixed costs of facilities, power, and data center
personnel. Public cloud storage has generally proven to be an economical,
durable, and performant option for many end users when compared to
on-premise solutions.
However, the public cloud storage model has, by its nature, led to a high
degree of concentration. Fixed costs are born by the network operators, who
invest billions of dollars in building out a network of data centers and
then enjoy significant economies of scale. The combination of large upfront
costs and economies of scale means that there is an extremely limited number
of viable suppliers of public cloud storage (arguably, fewer than five major
operators worldwide). These few suppliers are also the primary beneficiaries of
the economic return.
We believe that decentralized storage can provide a viable alternative to
centralized cloud.
However, to encourage partners or customers to bring data to the network,
the price charged for storage and bandwidth---combined with the other
benefits of decentralized storage---must be
more compelling and economically beneficial than competing storage solutions.
In our design of Storj, we seek to create an economically advantageous
situation for four different groups:
\begin{description}
\item[End users] - We must provide the same economically compelling
characteristics of public cloud storage with no upfront costs and scale on
demand.
In addition, end users must experience meaningfully better value for given
levels of capacity, durability, security, and performance.
\item[Storage node operators] - It must be economically attractive for storage
node operators to help build out the network.
They must be paid fairly, transparently, and be able to make a
reasonable profit relative to any marginal costs they incur.
It should be economically advantageous to be a storage
node operator not only by utilizing underused capacity but also by creating
new capacity, so that we can grow the network beyond the capacity that
currently exists.
Since node availability and reliability has a large impact on network
availability, cost, and durability, it is required that storage node
operators have sufficient incentive to maintain reliable and continuous
connections to the network.
\item[Demand providers] - It must be economically attractive for developers and
businesses to drive customers and data onto the Storj network. We must design
the system to fairly and transparently deliver margin to partners. We believe
that there is a unique opportunity to provide open-source software (OSS)
companies and projects, which drive over two-thirds of the public cloud workloads
today without receiving direct revenue, a source of sustainable revenue.
\item[Network operator] - To sustain continued investment in code,
functionality, network maintenance, and demand generation, the network
operator, currently Storj Labs, Inc., must be able to retain a reasonable profit. The operator must maintain this profit while
not only charging end users less than the public cloud providers but also margin sharing
with storage node operators and demand providers.
\end{description}
Additionally, the network must be able to account for ensuring efficient, timely billing
and payment processes as well as regulatory compliance for tax and other reporting.
To be as globally versatile as possible with payments, our network must be robust to accommodate several types of transactions (such as cryptocurrency, bank payments, and other forms of barter).
Lastly, the Storj roadmap must be aligned with the economic drivers of the
network.
New features and changes to the concrete implementations of framework
components must be driven by applicability to specific object storage use cases
and the relationship between features and performance to the price of storage
and bandwidth relative to those use cases.
\section{Amazon S3 compatibility}\label{constraint-amazon}
At the time of this paper's publication, the most widely deployed public cloud
is Amazon Web Services \cite{aws-dominates}. Amazon Web Services not only
is the largest cloud services ecosystem but also has the benefit of first mover
advantage. Amazon's first cloud services product was Amazon Simple Storage
Service, or Amazon S3 for short. Public numbers are hard to come by but
Amazon S3 is likely the most widely deployed cloud storage protocol in existence.
Most cloud storage products provide some form of compatibility with the
Amazon S3 application program interface (API) architecture.
Our objective is to aggressively compete in the wider cloud
storage industry and bring decentralized cloud storage into the mainstream.
Until a decentralized cloud storage protocol becomes widely adopted,
Amazon S3 compatibility creates a graceful transition path from centralized
providers by alleviating many switching costs for our users.
To achieve this, the Storj implementation allows
applications previously built against Amazon S3 to work with Storj with
minimal friction or changes.
S3 compatibility adds aggressive requirements for feature set, performance, and
durability.
At a bare minimum, this requires the methods described in
Figure \ref{fig:s3-api-code} to be implemented.
\begin{figure}[!htbp]
\lstset{language=Golang}
\begin{lstlisting}
// Bucket operations
CreateBucket(bucketName)
DeleteBucket(bucketName)
ListBuckets()
// Object operations
GetObject(bucketName, objectPath, offset, length)
PutObject(bucketName, objectPath, data, metadata)
DeleteObject(bucketName, objectPath)
ListObjects(bucketName, prefix, startKey, limit, delimiter)
\end{lstlisting}
\caption{Minimum S3 API}
\label{fig:s3-api-code}
\end{figure}
The Storj service supports the majority of the S3 protocol via two different integration patterns that emerged as specific customer requirements:
\begin{description}
\item[Single tenant gateway] - a self-hosted, end-to-end encrypted S3 compatible gateway where the customer's equipment is responsible for encryption, erasure coding and direct peer-to-peer transmission of data to storage nodes, including data expansion from erasure coded redundancy and long tail mitigation.
\item[Multi-tenant gateway] - a hosted S3 compatible gateway where a trusted provider is responsible for encryption, erasure coding, and transmission of data to storage nodes. This gateway requires a key from the customer as part of every request, since the customer's encryption keys are not preserved in the hosted environment.
\end{description}
\section{Durability, device failure, and churn}\label{sec:constraint-churn}
A storage platform is useless unless it also functions as a retrieval platform.
For any storage platform to be valuable, it must be careful not to lose the
data it was given, even in the presence of a variety of possible failures within
the system. Our system must store data with high durability and have negligible
risk of data loss.
For all devices, component failure is a guarantee.
All hard drives fail after enough wear
\cite{backblaze-hd-2018-q1} and servers providing network access to
these hard drives will also eventually fail. Network links may die, power
failures could cause havoc sporadically,
and storage media become unreliable over time.
Data must be stored with enough redundancy to recover from
individual component failures.
Perhaps more importantly, no data can be left in a single location
indefinitely. In such an environment, redundancy, data
maintenance, repair, and replacement of lost redundancy must be considered
inevitable, and the system must account for these issues.
Furthermore, decentralized systems are susceptible to high churn rates
where participants join the network and then leave for
various reasons, well before their hardware has actually failed.
For instance, Rhea {\em et al.} found that in many real world peer-to-peer
systems, the median time a participant lasts in the network ranges from hours
to mere minutes \cite{dht-churn}.
Maymounkov {\em et al.} found that the probability of a node staying connected
to a decentralized network for an additional hour is an {\em increasing}
function of uptime (Figure \ref{fig:kad-uptime} \cite{kad}).
In other words, nodes that have been online for a long time are less likely
to contribute to overall node churn.
Churn could be caused by any number of factors. Storage nodes
may go offline due to hardware or software failure, intermittent internet
connectivity, power loss, complete disk failure, or software shutdown or
removal. The more
network churn that exists, the more redundancy is required to make up for
the greater rate of node loss. The more redundancy that is required, the more
bandwidth is needed for correct operation of the system. In fact, there is
a tight relationship between network churn, additional redundancy,
and bandwidth availability \cite{pick2-churn}. To
keep background bandwidth usage and redundancy low, our network must have
low network churn and a strong incentive to favor long-lived, stable
nodes.
\begin{figure}[!htbp]
\centering
\includegraphics[width=.6\textwidth]{images/uptime.png}
\caption{Probability of remaining online an additional hour as a function of
uptime.
The $x$ axis represents minutes. The $y$ axis shows the fraction of nodes
that stayed online at least $x$ minutes that also stayed online at least
$x+60$ minutes. Source: Maymounkov {\em et al.} \cite{kad}}
\label{fig:kad-uptime}
\end{figure}
See section \ref{sec:RS-sim} and Blake {\em et al.} \cite{pick2-churn} for a
discussion of how repair bandwidth varies as a function of node churn.
\section{Latency}
Decentralized storage systems can potentially capitalize on
massive opportunities for parallelism.
Some of these opportunities include increased transfer rates, processing
capabilities, and overall throughput even when individual
network links are slow. However, parallelism cannot, by itself, improve {\em
latency}. If an individual network link is utilized as part of an operation,
its latency will be the lower bound for the overall operation.
Therefore, any distributed system
intended for high performance applications must continuously and aggressively
optimize for low latency not only on an individual process scale but also for
the system's entire architecture.
\section{Bandwidth}\label{sec:req-bandwidth}
Global bandwidth availability is increasing year after year. Unfortunately,
access to
high-bandwidth internet connections is unevenly distributed across the world.
While some users can easily access symmetric, high-speed, unlimited bandwidth
connections, others have significant difficulty obtaining the same type of
access.
In the United States and other countries,
the method in which many residential internet service providers (ISPs)
operate presents two specific challenges for designers of a
decentralized network protocol. The first challenge is
the asymmetric internet connections offered by many ISPs.
Customers subscribe to internet service
based on an advertised download speed, but the upload speed is potentially an
order of magnitude or two slower. The second challenge is that bandwidth is
sometimes ``capped'' by the ISP at a fixed amount of allowed traffic per month.
For example, in many
US markets, the ISP Comcast imposes a one terabyte per month bandwidth cap
with stiff fines for customers who go over this limit \cite{comcast-cap}.
An internet connection with a cap of 1 TB/month cannot average more than
385 KB/s over the month without exceeding the monthly bandwidth allowance,
even if the ISP advertises speeds of 10 MB/s or higher.
Such caps impose
significant limitations on the bandwidth available to the network
at any given moment.
With device failure and churn guaranteed, any decentralized system will have a
corresponding amount of repair traffic. As a result, it is important to account
for the bandwidth required not only for data storage and retrieval but also
for data maintenance and repair \cite{pick2-churn}. Designing a
storage system that is careless with bandwidth usage would not only give undue
preference to storage node operators with access to unlimited high-speed
bandwidth but also centralize the system to some degree. In order to keep the storage
system as decentralized as possible and working in as many environments
as possible, bandwidth usage must be aggressively minimized.
Please see section \ref{sec:bandwidth-space-limits} for a discussion on how
bandwidth availability and repair traffic limit usable space.
\section{Object size}
We can broadly classify large storage systems into two groups by average
object size. To differentiate between the two groups, we classify a ``large'' file as a
few megabytes or greater in size. A database is the
preferred solution for storing many small pieces of information,
whereas an object store or file system is ideal for storing many large files.
The initial product offering by Storj Labs is designed to function primarily as
a decentralized object store for larger files.
While future improvements may enable
database-like use cases, object storage is the predominant initial use case described in
this paper. We made protocol design decisions with the assumption that the
vast majority of stored objects will be 4MB or larger. While smaller files are
supported, they may simply be more costly to store.
It is worth noting that this will not negatively impact use cases that
require reading lots of files smaller than a megabyte. Users can address this
with a packing strategy by aggregating and storing many small files as one
large file.
The protocol supports seeking and streaming, which will allow users to download small files
without requiring full retrieval of the aggregated object.
\section{Byzantine fault tolerance}
Unlike centralized solutions like Amazon S3, Storj operates in an untrusted
environment where individual storage providers are not necessarily assumed to be
trustworthy. Storj operates over the public internet, allowing anyone to sign
up to become a storage provider.
We adopt the Byzantine, Altruistic, Rational (BAR) model \cite{bar} to discuss
participants in the network.
\begin{itemize}
\item {\em Byzantine} nodes may deviate arbitrarily from the suggested
protocol for any reason. Some examples include nodes that are broken or nodes
that are actively trying to sabotage the protocol. In general, a
{\em Byzantine} node is a bad actor, or one that optimizes for a utility
function that is independent of the one given for the suggested protocol.
\item Inevitable hardware failures aside, {\em Altruistic} nodes are good
actors and participate in a proposed protocol even if the rational choice is
to deviate.
\item {\em Rational} nodes are neutral actors and participate or deviate only
when it is in their net best interest.
\end{itemize}
Some distributed storage systems (e.g. datacenter-based cloud object storage systems)
operate in an environment
where all nodes are considered {\em altruistic}. For example, absent hardware failure
or security breaches, Amazon's storage nodes
will not do anything besides what they were explicitly programmed to do,
because Amazon owns and runs all of them.
In contrast, Storj operates in an environment where every node is
managed by its own independent operator.
In this environment, we can expect that a majority
of storage nodes are {\em rational} and a minority are {\em Byzantine}. Storj assumes no
{\em altruistic} nodes.
We must include incentives that encourage the network to ensure that the
rational nodes on the network (the majority of operators) behave as similarly
as possible to the expected behavior of altruistic nodes.
Likewise, the effects of Byzantine behavior must be minimized or eliminated.
Note that creating a system that is robust in the face of Byzantine behavior
does not require a Byzantine fault tolerant consensus protocol---we avoid
Byzantine consensus. See sections \ref{sec:concrete-metadata},
\ref{sec:distributed-metadata}, and appendix \ref{chap:dist-consensus} for
more details.
\section{Coordination avoidance}\label{sec:coordination-avoidance}
A growing body of distributed database research shows that systems that
avoid coordination wherever possible have far better throughput than systems
where subcomponents are forced to coordinate to achieve correctness
\cite{cap1, cap2, consistency-vs-latency, hat, i-confluence, anna,
calm1, calm2}.
We use Bailis {\em et al.}'s informal definition
that coordination is the requirement that concurrently executing operations
synchronously communicate or otherwise stall in order to complete
\cite{i-confluence}.
This observation happens at all scales and applies not only to distributed
networks but also to
concurrent threads of execution coordinating within the same computer.
As soon as coordination is needed, actors in the system will need to wait for
other actors, and waiting---due to coordination issues---can have a significant
cost.
While many types of operations in a network may require coordination
(e.g., operations that require linearizability\footnote{
Linearizable operations are atomic operations on a specific object where
the order of operations is equivalent to the order given original ``wall clock''
time.
}
\cite{jepsen-consistency, hat, vv-consistency}), choosing strategies that
avoid coordination (such as Highly Available Transactions \cite{hat}) can offer
performance gains of two to three orders of magnitude over wide area networks.
In fact, by carefully avoiding coordination as much as possible, the Anna
database \cite{anna} is able to be 10 times faster than both Cassandra and Redis in their
corresponding environments and 700 to 800 times faster than
performance-focused in-memory databases such as Masstree or Intel's TBB
\cite{anna-announce}.
Not all coordination can be avoided, but new frameworks (such as Invariant
Confluence \cite{i-confluence} or the CALM principle \cite{calm1, calm2})
allow system architects to understand when coordination is required for
consistency and correctness. As evidenced
by Anna's performance successes, it is most efficient to avoid coordination
where possible.
Systems that minimize coordination are
much better at scaling from small
to large workloads. Adding more resources to a coordination-avoidant system
will directly increase throughput and performance. However,
adding more resources to a coordination-dependent system
(such as Bitcoin \cite{bitcoin} or even Raft \cite{raft}) will not result in
much additional throughput or overall performance.
To get to exabyte scale, minimizing coordination is one of the key components
of our strategy.
Surprisingly, many decentralized storage platforms are working towards
architectures that require significant amounts of coordination,
where most if not all operations must be accounted for by a single global
ledger. For us to achieve exabyte scale, it is a fundamental requirement to
limit hotpath coordination domains to small spheres which are entirely
controllable by each user.
This limits the applicability of blockchain-like solutions for our use case.
\chapter{Framework}\label{chap:framework}
After having considered our design constraints, this chapter outlines the
design of a framework consisting of only the most fundamental components.
The framework describes
all of the components that must exist to satisfy our constraints.
As long as our design constraints remain constant, this framework will, as
much as is feasible, describe Storj both now and ten years from now.
While there will be some design freedom within the framework,
this framework will obviate the need for future rearchitectures entirely, as
independent components will be able to be replaced without affecting other
components.
\section{Framework overview}
All designs within our framework will do the following things:
\begin{description}
\item[Store data] When data is stored with the network, a client encrypts
and breaks it up into multiple pieces. The pieces are distributed
to peers across the network. When this occurs, metadata is generated that
contains information on where to find the data again.
\item[Retrieve data] When data is retrieved from the network,
the client will first reference the metadata to identify the locations of the
previously stored pieces.
Then the pieces will be retrieved and the original data will be reassembled
on the client's local machine.
\item[Maintain data] When the amount of redundancy drops below a certain
threshold, the necessary data for the missing pieces is regenerated and
replaced.
\item[Pay for usage] A unit of value should be sent in exchange for
services rendered.
\end{description}
To improve understandability, we break up the design into a collection of eight
independent components and then combine them to form the desired framework.
The individual components are:
\begin{enumerate}
\item Storage nodes
\item Peer-to-peer communication
\item Redundancy
\item Metadata
\item Encryption
\item Audits and reputation
\item Data repair
\item Payments
\end{enumerate}
\section{Storage nodes}
The storage node's role is to store and return data.
Aside from reliably storing data, nodes should provide
network bandwidth and appropriate responsiveness.
Storage nodes are selected to store data based on various criteria: ping time,
latency, throughput, bandwidth caps, sufficient disk space,
geographic location, uptime, history of responding accurately to audits, and
so forth. In return for their service, nodes are paid for both data egress and
data at rest.
Because storage
nodes are selected via changing variables external to the protocol, node
selection is an explicit, non-deterministic process in our framework. This means
that we must keep track of which nodes were selected for each upload via a
small amount of metadata; we can't select nodes for storing data implicitly or
deterministically as in a system like Dynamo \cite{dynamo}. As with
GFS \cite{gfs}, HDFS \cite{hdfs}, or Lustre \cite{lustre}, this decision
implies the requirement of a metadata storage system to keep track
of selected nodes (see section \ref{sec:framework-metadata}).
\section{Peer-to-peer communication}
All peers on the network communicate via a standarized protocol. The
framework requires that this protocol:
\begin{itemize}
\item provides peer reachability, even in the face of firewalls
and NATs where possible.
This may require techniques like STUN \cite{stun}, UPnP \cite{upnp},
NAT-PMP \cite{natpmp}, etc.
\item provides authentication as in S/Kademlia \cite{skad},
where each participant cryptographically
proves the identity of the peer with whom they are speaking to avoid
man-in-the-middle attacks.
\item provides complete privacy. In cases such as bandwidth measurement
(see section \ref{baer}), the client and storage node must be able
to communicate without any risk of eavesdroppers. The protocol should
ensure that all communications are private by default.
\end{itemize}
Additionally, the framework requires a way to look up peer network addresses
by a unique identifier so that, given a peer's unique identifier, any other
peer can connect to it. This responsibility is similar to the internet's
standard domain name system (DNS) \cite{dns},
which is a mapping of an identifier to an
ephemeral connection address, but unlike DNS, there can be no centralized
registration process.
A network overlay can be built on top of our chosen peer-to-peer communication protocol
to achieve these goals.
See Section \ref{sec:concrete-node-discovery} for
implementation details.
\section{Redundancy}\label{sec:framework-redundancy}
We assume that at any moment, any storage node could go offline permanently.
Our redundancy
strategy must store data in a way that provides access to the data with high
probability, even though any given number of individual nodes may be in
an offline state. To
achieve a specific level of {\em durability} (defined as the probability that
data remains available in the face of failures), many products in this space use
simple replication. Unfortunately, this ties durability to the network {\em
expansion factor}, which is the storage overhead for reliably storing data. This
significantly increases the total cost relative to the stored data.
For example, suppose a certain desired level of durability requires a
replication strategy that makes eight copies of the data. This yields an
expansion factor of 8x, or 800\%. This data then needs to be stored on the
network, using bandwidth in the process. Thus, more replication results in more
bandwidth usage for a fixed amount of data. As discussed in the protocol design
constraints (section \ref{sec:req-bandwidth}) and Blake {\em et al.}
\cite{pick2-churn},
high bandwidth usage prevents scaling, so this is an undesirable
strategy for ensuring a high degree of file durability.
As an alternative to simple replication, {\em erasure codes} provide a much
more efficient method to achieve redundancy.
Erasure codes are well-established in use for both distributed and peer-to-peer
storage systems \cite{p2p-lazy, storj-v2, rs-cd, rs-intro, rs-stragglers, hail,
filefec-packing}.
Erasure codes are an encoding scheme for manipulating
data durability without tying it to bandwidth usage, and have been found to
improve repair traffic significantly over replication \cite{pick2-churn}.
Importantly, they allow changes in durability without changes in expansion
factor.
An erasure code is often described by two numbers, $k$ and $n$. If a block of
data is encoded with a $(k,n)$ erasure code, there are $n$ total generated {\em
erasure shares}, where only any $k$ of them are required to recover the original
block of data. If a block of data is $s$ bytes, each of the $n$ erasure shares
is roughly $s/k$ bytes. Besides the case when $k=1$ (replication), all erasure
shares are unique.
Interestingly, the durability of a $(k=20,n=40)$ erasure code
is better than a $(k=10,n=20)$ erasure code, even though the expansion factor
(2x) is the same for both. This is because the risk is spread
across more nodes in the $(k=20,n=40)$ case. These considerations make erasure
codes an important part of our general framework.
To better understand how erasure codes increase durability without
increasing expansion factors, the following table shows various choices of
$k$ and $n$, along with the expansion factor and associated durability:
\begin{center}
\begin{tabular}{c c c l}
$k$ & $n$ & Exp. factor & $P(D \mid p = 10\%)$ \\ \hline
2 & 4 & 2 & 99.207366813274616\%\\
4 & 8 & 2 & 99.858868985411326\%\\
8 & 16 & 2 & 99.995462406878260\%\\
16 & 32 & 2 & 99.999994620652776\%\\
20 & 40 & 2 & 99.999999807694154\%\\
32 & 64 & 2 & 99.999999999990544\%\\
\end{tabular}
\end{center}
In contrast, replication requires significantly higher expansion factors for
the same durability. The following table shows durability with a replication
scheme:
\begin{center}
\begin{tabular}{c c c l}
$k$ & $n$ & Exp. factor & $P(D \mid p = 10\%)$ \\ \hline
1 & 1 & 1 & 90.483741803595962\%\\
1 & 2 & 2 & 98.247690369357827\%\\
1 & 3 & 3 & 99.640050681691051\%\\
1 & 10 & 10 & 99.999988857452166\%\\
1 & 16 & 16 & 99.999999998036174\%\\
\end{tabular}
\end{center}
To see how these tables were calculated, we'll start
with the simplifying assumption that $p$ is the monthly node
churn rate (that is, the fraction of nodes that will go offline in a month on
average).
Mathematically, time-dependent processes are modeled according to
the Poisson distribution, where it is assumed that $\lambda$ events are
observed in the given unit of time.
As a result, we model durability
as the cumulative distribution function (CDF) of the Poisson distribution with mean $\lambda=pn$,
where we expect $\lambda$ pieces of the file to be lost monthly.
To estimate
durability, we consider the CDF up to $n-k$,
looking at the probability that at most $n-k$ pieces
of the file are lost in a month and the file can still be rebuilt.
The CDF is given by:
\begin{align*}
&& P(D) = e^{-\lambda} \sum_{i=0}^{n-k} \frac{\lambda^i}{i!} &&
\end{align*}
The expansion factor still plays a big role in durability, as seen in the
following table:
\begin{center}
\begin{tabular}{c c c l}
$k$ & $n$ & Exp. factor & $P(D \mid p = 10\%)$ \\ \hline
4 & 6 & 1.5 & 97.688471224736705\%\\
4 & 12 & 3 & 99.999514117129605\%\\
20 & 30 & 1.5 & 99.970766304935266\%\\
20 & 50 & 2.5 & 99.999999999999548\%\\
100 & 150 & 1.5 & 99.999999999973570\%\\
\end{tabular}
\end{center}
By being able to tweak the durability independently of the expansion factor,
erasure coding allows very high durability to be achieved with surprisingly
low expansion factors.
Because of how limited bandwidth is as a resource, completely eliminating
replication as a
strategy and using erasure codes only for redundancy causes a drastic
decrease in bandwidth footprint.
Erasure coding also results in storage nodes getting paid more.
High expansion factors dilute the incoming funds per byte across more storage
nodes; therefore, low expansion factors, such as those provided by erasure
coding, allow for a much more direct passthrough of income to storage node
operators.
\subsection{Erasure codes' effect on streaming}
Erasure codes are used in many streaming contexts such as audio CDs and
satellite communications \cite{rs-cd},
so it's important to point out that using erasure coding in general does not
make our streaming design requirement (required by Amazon S3 compatibility,
see section \ref{constraint-amazon}) more challenging.
Whatever erasure code is chosen for our framework, as with CDs, streaming can
be added on top by encoding small portions at a time, instead of attempting to
encode a file all at once. See section \ref{sec:structured-file-storage} for
more details.
\subsection{Erasure codes' effect on long tails}\label{sec:long-tail}
Erasure codes enable an enormous performance benefit, which is the ability to
avoid waiting for ``long-tail'' response times \cite{tail-at-scale}. A
long-tail response occurs in situations where a needed server has an