Categories: None [Edit]
medusa-crawler
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
=== Features
* Choose the links to follow on each page with +focus_crawl+
* Multi-threaded design for high performance
* Tracks +301+ HTTP redirects
* Allows exclusion of URLs based on regular expressions
* Records response time for each page
* Obey _robots.txt_ directives (optional, but recommended)
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
* Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]
=== Examples
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
require 'medusa'
Medusa.crawl('https://www.example.com', depth_limit: 2)
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
require 'medusa'
Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
crawler.discard_page_bodies = some_flag
# Persist all the pages state across crawl-runs.
crawler.clear_on_startup = false
crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
crawler.skip_links_like(/private/)
crawler.on_pages_like(/public/) do |page|
logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
end
# Use an arbitrary logic, page by page, to continue customize the crawling.
crawler.focus_crawl(/public/) do |pa...
Total
Ranking: 125,663 of 183,539
Downloads: 4,596
Daily
Ranking: 30,826 of 183,521
Downloads: 6
Downloads Trends
Ranking Trends
Num of Versions Trends
Popular Versions (Major)
Popular Versions (Major.Minor)
Depended by
Rank | Downloads | Name |
---|
Owners
# | Gravatar | Handle |
---|---|---|
1 | brutuscat |