Releases: spider-rs/spider
v1.11.0
- fix user-agent memory leak between calls on configuration
- fix user-agent ownership http client
- fix robot-parser memory leaks
- add shared connection between robotparser client
- perf(req): enable brotli
- chore(statics): add initial static media ignore - remove unsafe macro
- chore(tls): add ALPN tls defaults
Performance between crawls increased from 1.9s to 1.5s from prev benchmarks.
Full Changelog: v1.10.0...v1.11.0
v1.10.0
Whats Included
- feat(ua): add random spoof User-Agent flag
- chore(minor): add scoped lazy media ignore statics
In order to enable random User-Agents set the feature ua_generator
in the cargo.toml
.
[dependencies]
spider = { version = "1.10.0", features = ["ua_generator"] }
Full Changelog: v1.9.0...v1.10.0
Spider v1.9.0
What's Changed
You can now gather all the content for all of your pages in one go between tlds and subdomains.
-- Example
extern crate spider;
use spider::website::Website;
fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
website.configuration.subdomains = true;
website.configuration.tld = true;
website.crawl();
for page in website.get_pages() {
println!("- {}", page.get_url());
}
}
Full Changelog: v1.8.0...v1.9.0
v1.8.0
What's Changed
- feat(time): add page duration uptime tracking with the feature flag
time
.
Full Changelog: v1.7.22...v1.8.0
Spider v1.7.22
What's Changed
-- Other Changes
Update docs on respect robots txt handling and missing rust-docs.
Full Changelog: v1.7.19...v1.7.22
Spider v1.7.8
What's Changed
- chore(concurrency): add simultaneous multithreading detection by @j-mendez in #45
- perf(concurrency): increase default concurrency limit by @j-mendez in #46
- feat(cli): add comma separated list ability blacklist
- chore(cli): fix rust verbose log output
--
about .5s performance shaved between benchmark Spider v1.6.1.
Full Changelog: v1.7.3...v.1.7.7
Full Changelog: v.1.7.7...v.1.7.8
Crawl sync option
- ability to crawl links in sync.
fn main() {
// crawl one by one
let mut website: Website = Website::new("https://choosealicense.com");
website.crawl_sync();
}
or via the cli.
spider -d https://rsseau.fr crawl -s
What's Changed
- chore(log): add crate log default logger by @j-mendez in #42
- feat(delay): add non blocking delay scheduling by @j-mendez in #43
Full Changelog: v1.6.1...v1.7.3
Spider v1.6.1
Performance Tuned
Speed of crawler cranked up a notch and now the fastest open-source spider crawler available. View the benchmarks in the CI action for results. If you know of any alternative crawlers feel free to open an issue so we can add the benchmark comparisons.
What's Changed
- test(bench): add self task execution of bench by @j-mendez in #40
- perf(links): filter dup links after async batch
- chore(delay): fix crawl delay thread groups
Full Changelog: v1.6.0...v1.6.1
Perf increased after commit 053eea4.
Benchmarks against crolly
and node-crawler
(cases are about identical in implementation ) .
Crawl sync api fix
This release brings fixing the thread handling of async task with the client that is established on the main thread.
Crawl speed is improved drastically due to the incorrect handling between the client pool releasing between threads.
What's Changed
- chore(log): add log util by @j-mendez in #31
- feat(regex): add optional regex black listing by @j-mendez in #36
- perf(crawl): improve crawl link exclusion by @j-mendez in #37
- perf(parsing): add async parallel page handling by @j-mendez in #38
- perf(client): fix blocking and async mixture by @j-mendez in #39
Full Changelog: v1.5.1...v1.6.0