Skip to content

Commit b8c4108

Browse files
committed
WARC/1.1 annotated: hopsFromSeed metadata field #59
1 parent bccb0e7 commit b8c4108

File tree

1 file changed

+19
-0
lines changed
  • specifications/warc-format/warc-1.1-annotated

1 file changed

+19
-0
lines changed

specifications/warc-format/warc-1.1-annotated/index.md

+19
Original file line numberDiff line numberDiff line change
@@ -1110,6 +1110,25 @@ optional.
11101110
- 'fetchTimeMs': time in milliseconds that it took to collect the
11111111
archived URI, starting from the initiation of network traffic.
11121112

1113+
> **Community recommendation:** #59
1114+
> The `hopsFromSeed` field comes from the [discovery path](https://heritrix.readthedocs.io/en/latest/glossary.html#discovery-path)
1115+
> concept in the Heritrix web crawler. The value is a string containing
1116+
> one character for each link or embed followed from the seed, for
1117+
> example "LLLE" might be an image on a page that's 3 links away from
1118+
> a seed. The value of `hopsFromSeed` for a seed URI should be the
1119+
> empty string.
1120+
>
1121+
> | Symbol | Meaning | Examples |
1122+
> |--------|----------------------------------------------------------|----------------------------------------------------------|
1123+
> | `L` | Link | `<a href=...>` |
1124+
> | `E` | Embedded | `<img src=...>`<br>`<script src=...>` |
1125+
> | `X` | Speculative embed | `<script>var url = 'http://example.org/foo.js';</script>` |
1126+
> | `R` | Redirect | `HTTP/1.0 302 Found`<br>`Location: ...` |
1127+
> | `P` | Prerequisite | robots.txt, DNS lookup |
1128+
> | `I` | Implicit/Implied | favicon.ico |
1129+
> | `M` | Manifest | URLs listed in sitemap files |
1130+
> | `S` | Form submission | `<form action=...>` |
1131+
11131132
A 'metadata' record may be associated with other records derived from
11141133
the same capture event using the WARC-Concurrent-To header. A 'metadata'
11151134
record may be associated to another record which it describes, using the

0 commit comments

Comments
 (0)