Skip to content

parcex extracts content from warc files and rebuilds web ressources

License

Notifications You must be signed in to change notification settings

netzliteratur/parcex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

#Description parcex extracts content from warc files.

#Usage ./parcex.py WARC-FILE

WARC-FILE must be a warc file that conforms to

http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

#Output

  • Directory structure with the schema as root.
  • The structure maps the structure of the web resource.
  • Empty file names are named index.html.N, where N >= 0

#Requirements Python 2.7/3.x

#Status Testing

About

parcex extracts content from warc files and rebuilds web ressources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages