Skip to content

Releases: spider-rs/spider

v1.11.0

26 Jul 12:08
Compare
Choose a tag to compare
  • fix user-agent memory leak between calls on configuration
  • fix user-agent ownership http client
  • fix robot-parser memory leaks
  • add shared connection between robotparser client
  • perf(req): enable brotli
  • chore(statics): add initial static media ignore - remove unsafe macro
  • chore(tls): add ALPN tls defaults

Performance between crawls increased from 1.9s to 1.5s from prev benchmarks.

Full Changelog: v1.10.0...v1.11.0

v1.10.0

11 Jul 00:10
Compare
Choose a tag to compare

Whats Included

  • feat(ua): add random spoof User-Agent flag
  • chore(minor): add scoped lazy media ignore statics

In order to enable random User-Agents set the feature ua_generator in the cargo.toml.

[dependencies]
spider = { version = "1.10.0", features = ["ua_generator"] }

Full Changelog: v1.9.0...v1.10.0

Spider v1.9.0

16 May 14:59
760b3ce
Compare
Choose a tag to compare

What's Changed

  • feat(crawl): add subdomain and tld crawling by @j-mendez in #59

You can now gather all the content for all of your pages in one go between tlds and subdomains.

-- Example

extern crate spider;

use spider::website::Website;

fn main() {
  let mut website: Website = Website::new("https://rsseau.fr");
  website.configuration.subdomains = true;
  website.configuration.tld = true;
  website.crawl();

  for page in website.get_pages() {
    println!("- {}", page.get_url());
  }
}

Full Changelog: v1.8.0...v1.9.0

v1.8.0

05 Sep 15:03
675c040
Compare
Choose a tag to compare

What's Changed

  • feat(time): add page duration uptime tracking with the feature flag time.

Full Changelog: v1.7.22...v1.8.0

Spider v1.7.22

30 Apr 11:44
9b8f0c3
Compare
Choose a tag to compare

What's Changed

  • fix(selectors): add resources to ignore list to handle main formats by @j-mendez in #47

-- Other Changes
Update docs on respect robots txt handling and missing rust-docs.

Full Changelog: v1.7.19...v1.7.22

Spider v1.7.8

26 Apr 19:43
Compare
Choose a tag to compare

What's Changed

  • chore(concurrency): add simultaneous multithreading detection by @j-mendez in #45
  • perf(concurrency): increase default concurrency limit by @j-mendez in #46
  • feat(cli): add comma separated list ability blacklist
  • chore(cli): fix rust verbose log output

--
about .5s performance shaved between benchmark Spider v1.6.1.

Screen Shot 2022-04-26 at 3 41 02 PM

Full Changelog: v1.7.3...v.1.7.7

Full Changelog: v.1.7.7...v.1.7.8

Crawl sync option

24 Apr 19:20
Compare
Choose a tag to compare
  1. ability to crawl links in sync.
fn main() {
   // crawl one by one 
    let mut website: Website = Website::new("https://choosealicense.com");
    website.crawl_sync();
}

or via the cli.

spider -d https://rsseau.fr crawl -s

What's Changed

  • chore(log): add crate log default logger by @j-mendez in #42
  • feat(delay): add non blocking delay scheduling by @j-mendez in #43

Full Changelog: v1.6.1...v1.7.3

Spider v1.6.1

22 Apr 18:22
Compare
Choose a tag to compare

Performance Tuned

Speed of crawler cranked up a notch and now the fastest open-source spider crawler available. View the benchmarks in the CI action for results. If you know of any alternative crawlers feel free to open an issue so we can add the benchmark comparisons.

What's Changed

  • test(bench): add self task execution of bench by @j-mendez in #40
  • perf(links): filter dup links after async batch
  • chore(delay): fix crawl delay thread groups

Full Changelog: v1.6.0...v1.6.1

Perf increased after commit 053eea4.
Screen Shot 2022-04-22 at 2 19 08 PM

Benchmarks against crolly and node-crawler (cases are about identical in implementation ) .

Screen Shot 2022-04-22 at 2 44 17 PM

Crawl sync api fix

21 Apr 17:05
1efc75e
Compare
Choose a tag to compare

This release brings fixing the thread handling of async task with the client that is established on the main thread.
Crawl speed is improved drastically due to the incorrect handling between the client pool releasing between threads.

What's Changed

  • chore(log): add log util by @j-mendez in #31
  • feat(regex): add optional regex black listing by @j-mendez in #36
  • perf(crawl): improve crawl link exclusion by @j-mendez in #37
  • perf(parsing): add async parallel page handling by @j-mendez in #38
  • perf(client): fix blocking and async mixture by @j-mendez in #39

Full Changelog: v1.5.1...v1.6.0

v1.5.1

15 Apr 19:27
Compare
Choose a tag to compare

What's Changed

  • chore(ua): add cargo env ua defaults by @j-mendez in #27
  • chore(links): fix parsing valid website pages by @j-mendez in #30

Full Changelog: v1.5.0...v1.5.1