Categories: None [Edit]
scrapetor
Scrapetor is a Ruby HTML parsing + scraping toolkit. The parser is a native C arena DOM with structural indexes built at parse time and NEON SIMD scanners in the SAX hot loop. A streaming extraction engine compiles the schema DSL into a single forward pass — no DOM materialised, one Ruby boundary crossing per document. On builds where libcurl is available, Scrapetor::Fetcher adds an HTTP/2-capable fetch layer with per-thread connection cache, shared DNS + TLS session pool, in-process gzip / deflate / brotli / zstd decoding, iconv charset transcoding, retry + exponential backoff, ETag / Last-Modified disk cache with bulk revalidation, per-host throttle, cookie jar, basic + bearer auth, proxy, and three bulk concurrency models (parallel_fetch / multi_fetch / streaming multi_each). Scrapetor::Session ties the cookie / auth / throttle / retry policies together. Also ships robots.txt + sitemap.xml parsers, a bounded-memory streaming HTML parser, and structured-data extractors (JSON-LD, OpenGraph, Schema.org, Microdata, RDFa, Twitter Cards). The Net::HTTP-based Scrapetor.fetch is preserved as the no-libcurl fallback.
Total
Ranking: 193,427 of 193,553
Downloads: 181
Daily
Ranking: 63,203 of 193,540
Downloads: 1
Downloads Trends
Ranking Trends
Num of Versions Trends
Popular Versions (Major)
Popular Versions (Major.Minor)
Depended by
| Rank | Downloads | Name |
|---|
Depends on
| Rank | Downloads | Name |
|---|---|---|
| 8 | 1,341,379,803 | rake |
| 14 | 1,231,898,944 | minitest |
| 15 | 1,208,695,699 | nokogiri |
| 110 | 447,634,925 | webrick |
| 654 | 85,903,326 | benchmark-ips |
| 9,758 | 351,566 | nokolexbor |
Owners
| # | Gravatar | Handle |
|---|---|---|
| 1 | scrapetor |