-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsri-addressable-caching.src.html
209 lines (172 loc) · 10.8 KB
/
sri-addressable-caching.src.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
<pre class='metadata'>
Title: Subresource Integrity Addressable Caching
Status: DREAM
ED: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html
Shortname: sri-addressable-caching
Level: None
Editor: Brad Hill, Facebook Inc., hillbrad@gmail.com
Abstract:
This note describes some security and privacy risks associated with
content addressable caching in Web user agents, particularly those
that may arise from the use of Subresource Integrity metadata
to identify cache-eligible resources.
Nothing stops user agents from building such a cache. This note encourages
implementations following proper precautions, while also indicating why
a correct and normative specification for such a cache is likely unnecessary
and perhaps impossible.
Indent: 2
Markup Shorthands: css off, markdown on
Boilerplate: omit conformance, omit feedback-header
!Participate: <a href="https://github.com/hillbrad/sri-addressable-caching/issues/new">File an issue</a> (<a href="https://github.com/hillbrad/sri-addressable-caching/issues">open issues</a>)
</pre>
<pre boilerplate="copyright">©2016 Facebook, Inc.</pre>
<pre class="link-defaults">
spec:html; type:element; text:link
spec:html; type:dfn; text:case-sensitive
spec:html; type:dfn; text:ascii case-insensitive
</pre>
<pre class="anchors">
spec: SRI; urlPrefix: https://w3c.github.io/webappsec-subresource-integrity/
type: dfn
text: Subresource Integrity
spec: CSP3; urlPrefix: https://w3c.github.io/webappsec-csp/
type: dfn
text: Content Security Policy
</pre>
Introduction {#intro}
=====================
A small number of popular JavaScript frameworks and libraries account for a
substantial portion of the network, battery, and time budgets of users of
the contemporary Web. Many applications include these libraries even if they
only use a small portion of the functionality they provide. Large improvements
seem possible if Web User Agents ("UAs") could fetch and compile these libraries a
single time. The benefit would be especially large for UAs that do not currently
share cached content across origins. (typically to protect user privacy)
If a content addressable cache could be populated prior to user
browsing activity, the UA could identify opportunities when a device is idle,
charging, and has unmetered network available, to download,
parse, and compile script files in wide use in order to increase
performance later, especially under slow network conditions.
A historical difficulty of accomplishing this has been that resources may
respond differently in different contexts, update at any time, and the same public
resource may be hosted in many locations without any way to determine (without
actually fetching it) whether it is actually the same content.
The `Subresource Integrity` [[SRI]] ("SRI") specification now allows for
applications to tag their script dependencies with cryptographic hashes that
strongly identify the exact content they want. This naturally raises the
possibility of using the SRI hash to key into a content addressable cache in
the UA.
Version 1 of SRI explicitly avoided specifying a means of doing this. This
document explores why content addressable caching is difficult on the Web
platform, the sometimes subtle security and privacy issues it raises,
and how UAs might sidestep these difficulties in a non-normative fashion.
Sharing cached resources across origins based on other signals of content identity
(such as the "immutable" value for the <tt>Cache-Control</tt> header) may have similar
privacy and security concerns and find this document useful.
Timing Leaks {#cache_timing}
============================
The first difficulty of implementing cross-origin, content addressable caching on
the Web platform is that it may leak information about the user's past browsing.
For example, site "A" could load a script with a given hash, then later when
the user visits site "B", it could attempt to observe the time it takes to load
that same resource and infer whether the user had previously been at "A".
Such timing leaks may exist today in UAs that do not segregate their cache
based on the requesting origin or registerable domain, but as a content addressasble cache is
shared by design, this issue requires explicit consideration in this context.
Deterministic History Leaks {#history_leaks}
============================================
In the presence of a user-specific content addressable cache, "B" might also do the following:
<tt><script src='/probe.js?unique_id=1IAMTRACKINGY0U' integrity='sha-256:hash-of-resource-at-a' /></tt>
If "B" serves a resource with this markup and observes no request to the probe resource is
made, it can infer with near certainty that the user has previously visited "A" because the
cache is primed.
Origin Laundering {#origin_laundering}
======================================
In general, it would seem that the use of a cryptographic hash
to identify content to be loaded should prevent cache poisoning. However, there
are some circumstances where the properties of a script (or other content which
might be similarly identified in the future) depend on the origin from
which it is loaded.
To reduce attack surface, `Content-Security-Policy` [[CSP3]] ("CSP")
allows a resource to specify to the UA from which origins it should load content
like scripts. If https://example.com specifies
<tt>Content-Security-Policy: script-src 'scripts.example.com'</tt> for
its resources in order to only allow known scripts from a source it controls,
it would be bad if an attacker could inject the following code into a document:
<tt><script src='https://scripts.example.com/angular.min.js' integrity='sha-256:deadbeef1234badcafe...' /></tt>
And by so doing get the UA to load an old version of the Angular framework from a content
addressable cache if the server at https://scripts.example.com origin wouldn't
actually serve that resource.
Origin Laundering of this sort might also allow bypassing restrictions on the
loading of Workers or ServiceWorkers, which must be at the same-origin as the
resource loading them.
Origin Laundering could create immediate Universal Cross-Site Scripting (UXSS) if
it was applied to <tt><plugin></tt>, <tt><object></tt>, or similar
subresource types that create a plugin document, as PDFs and other plugin document
types often use variant access control policies which allow these document types
to assume the origin in their own URL, rather than the origin of the document to
which they are a subresource.
A similar UXSS risk would exist if top-level documents
(e.g. HTML in an <tt><iframe></tt>) were eligible for loading from a content
aware cache in a manner subject to Origin Laundering.
Suggestions for a Solution {#solution}
======================================
Because of these issues, it is difficult to specify SRI-based content addressable
caching behavior that depends entirely on the interactions between the user agent
and resources.
The difficulty of normatively specifying a neutral algorithm does not mean that
we must forgo the potential improvements of such a cache. So long as they do not
violate an application's semantics (including security), UAs have wide latitude to
make performance improvements. The performance benefits of a cache need not be
uniform, deterministic or even entirely predictable by applications.
A simple solution for timing and history leaks is to not base the population of a
content addressable cache on individual user behavior. If loading a script is fast
because it is always fast in a given UA, there is no privacy leak.
One possible answer to building SRI addressable caching is for user agents to enable
it for pieces of content based on large-scale content heuristics, and never
individual user behavior. A number of strategies are possible: crawling and analysis of
content on the Web at large and aggregate user browsing behavior based on UA
telemetry are two obvious ones. Either approach could enable a user agent to approximate
which resources referenced with SRI hashes would provide the most user benefit as
a member of a content addressable cache.
A user agent might also simply "pick winners" manually to avoid the security and
privacy traps, but it is likely a more sophisticated strategy will produce better
results. The exact nature of such a strategy is inherently non-normative and which
strategy will achieve the best results may be unstable over time and for different
user populations.
If the content addressable cache is pre-populated identically for large user populations,
the only information that might be leaked via side-channels is the last update time of
the SRI cache, and possibly the size of the storage allotted to this type of cache
(if that is allowed to vary statically or dynamically based on the available
storage on the device) by examining whether resources at the time and space edges of the
cache are primed.
Managing origin laundering requires the cache to be annotated with a list of
URLs observed to actually host a given resource, and double-keying the content
addressable cache with both the resource hash and its location.
Either a crawling or user telemetry based cache population strategy
provides the necessary data for this double-keying.
In many circumstances, it may be possible to safely load from the content addressable
cache using only the hash key, for example:
<ul>
<li>If no Content-Security-Policy rules apply to the resource fetch</li>
<li>If a resource's Content-Security-Policy header explicitly lists the hash
of an external resource as allowed, the UA could interpret that as an authoritative
statement that the resource's origin provenance is irrelevant</li>
<li>If the origin provenance of a resource is otherwise irrelevant because the
resource is not granted any special privileges based on that information</li>
</ul>
<h3>Second-Order Effects</h3>
Decisions on what to cache are not content-neutral, or if they begin so, they cannot
remain so. Any algorithm will create a performance bias towards things included
in the cache, which may have other, possibly detrimental, effects in the long term.
It may entrench incumbent, popular frameworks at the expense of innovation, or it
might discourage early adoption of security fixes if newly-patched libraries are
slower because they have not been pre-cached yet. Conversely, a policy of only
caching the latest few versions of a given library might encourage more rapid
patching and obsolesence of vulnerable versions.
A privileged place in the cache might encourage providers of popular libraries
and frameworks to bundle unrelated features to gain a competitive
advantage, or just to evict less popular competing frameworks from a
size-limited cache. Whether such effects would materialize, and their magnitude, is
impossible to accurately predict, but implementers may wish to take these concerns
into consideration when deciding whether and how to implement such a caching algorithm.