Skip to content

An fast and efficient web spider that visits all links on a given domain using bloom filters to store visited links and multiple threads for concurrency.

License

Notifications You must be signed in to change notification settings

PenguinGabe/pantopoda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pantopoda

Pantopoda is a ruby spidering library that was built out of sheer frustration at the lack of good, modern web crawling tools that Python enjoys. Pantopoda uses bloom filters to store the list of visited urls for efficient querying up to hundreds of thousands of urls, and the requests are handled by Typhoeus to allow for multi-threaded crawling.

Pantopoda will crawl every single page it can find on a particular domain.

Installation

Add this line to your application's Gemfile:

gem 'pantopoda'

And then execute:

$ bundle

Or install it yourself as:

$ gem install pantopoda

Usage

TODO: Write usage instructions here

Contributing

  1. Fork it ( https://github.com/[my-github-username]/pantopoda/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

About

An fast and efficient web spider that visits all links on a given domain using bloom filters to store visited links and multiple threads for concurrency.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages