Releases: spider-rs/spider
v1.46.5
v1.46.4
Whats Changed
Crawling all domains found on website now possible with *
in external_domains.
- feat(external): add wildcard handling all domains found #135
Example:
let mut website = Website::new("https://choosealicense.com");
website
.with_external_domains(Some(Vec::from(["*"].map(|d| d.to_string())).into_iter()));
Use the crawl budget
and blacklist
features to help prevent infinite crawls:
website
.with_blacklist_url(Some(Vec::from(["^/blog/".into()])))
.with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])));
Thank you @sebs for the help!
Full Changelog: v1.45.10...v1.46.0
v1.45.10
Whats Changed
You can now use crawl budgeting with the CLI and grouping domains.
- feat(cli): add crawl budgeting
- feat(cli): add external domains grouping
Example:
spider --domain https://choosealicense.com --budget "*,1" crawl -o
# ["https://choosealicense.com"]
Example of crawl grouping domains with CLI:
spider --domain https://choosealicense.com -E https://loto.rsseau.fr/ crawl -o
Full Changelog: v1.45.8...v1.45.9
v1.45.8
Whats Changed
Crawl budget limits to prevent paths from exceeding page limit. It is possible to set depth for the budget /a/b/c
.
Use the feature flag budget
to enable.
- feat(budget): add crawl budgeting pages
- chore(chrome): add fast chrome redirect determination without using
page.url().await
extra CDP call
Example:
use spider::tokio;
use spider::website::Website;
use std::io::Error;
#[tokio::main]
async fn main() -> Result<(), Error> {
let mut website = Website::new("https://rsseau.fr")
.with_budget(Some(spider::hashbrown::HashMap::from([
("*", 15),
("en", 11),
("fr", 3),
])))
.build()?;
website.crawl().await;
for link in website.get_links() {
println!("- {:?}", link.as_ref());
}
println!("Total pages: {}", links.len());
Ok(())
}
// - "https://rsseau.fr/en/tag/google"
// - "https://rsseau.fr/en/blog/debug-nodejs-with-vscode"
// - "https://rsseau.fr/en/books"
// - "https://rsseau.fr/en/blog"
// - "https://rsseau.fr/en/tag/zip"
// - "https://rsseau.fr/books"
// - "https://rsseau.fr/en/resume"
// - "https://rsseau.fr/en/"
// - "https://rsseau.fr/en/tag/wpa"
// - "https://rsseau.fr/en/blog/express-typescript"
// - "https://rsseau.fr/en/blog/zip-active-storage"
// - "https://rsseau.fr"
// - "https://rsseau.fr/blog"
// - "https://rsseau.fr/en"
// - "https://rsseau.fr/fr"
// Total pages: 15
Full Changelog: v1.43.1...v1.45.8
v1.43.1
Whats Changed
You can now get the final redirect destination from pages with page.get_url_final
. If you need to see if a redirect was performed you can access page.final_redirect_destination
to get the Option.
- feat(page): add page redirect destination exposure #127
Example:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
// return final redirect if found or the url used for the request
println!("{:?}", res.get_url_final());
}
});
website.crawl().await;
}
Thank you @matteoredaelli @joksas for the issue and help!
Full Changelog: v1.42.1...v1.43.1
v1.42.3
Whats Changed
- feat(external): add external domains grouping #135
- chore(website): website build method to perform validations with the builder chain
- chore(cli): fix links json output
- chore(glob): fix link callback #136
Example below gathering links from different domains.
use spider::tokio;
use spider::website::Website;
use std::io::Error;
use std::time::Instant;
#[tokio::main]
async fn main() -> Result<(), Error>{
let mut website = Website::new("https://rsseau.fr")
.with_external_domains(Some(Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter()))
.build()?;
let start = Instant::now();
website.crawl().await;
let duration = start.elapsed();
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
println!(
"Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
duration,
links.len()
);
Ok(())
}
Thank you @roniemartinez and @sebs for the help!
Full Changelog: v1.41.1...v1.42.3
v1.41.1
Whats Changed
The sitemap
feature flag was added to include pages found in the results. Currently the links found from the pages are not crawled.
- feat(sitemap): add sitemap crawling feature flag
If you want to set a custom sitemap location use the configuration website.configuration.sitemap_url
.
Example
website.configuration.sitemap_url = Some(Box::new("sitemap.xml".into()));
The builder method to adjust the location will be available in the next version, it was accidentally left out.
Full Changelog: v1.40.6...v1.41.1
v1.40.6
Whats Changed
- feat(chrome): enable chrome rendering page content [experimental]
- chore(crawl): remove crawl sync method for Sequential crawls
If you need crawls to be sequential use configuration.delay
or use website.with_delay(1)
, set any value greater than 0.
Headless
Use the feature flag chrome
for headless and chrome_headed
for headful crawling.
Chrome installations are detected automatically on the OS. The current implementation uses chromiumaxide and handles html as raw strings so downloading media will not be ideal since the bytes may be invalid. The chrome feature does not work with the decentralized flag at the moment.
Video below shows 200 plus pages being handled within a couple seconds, headless runs drastically faster.
Try to only use headed for debugging.
Screen.Recording.2023-09-10.at.8.41.44.PM.mov
Full Changelog: v1.37.7...v1.40.6
v1.37.7
What's Changed
- feat(page): add byte storing resource by @j-mendez in #131
- chore(pages): fix full_resource flag gathering scripts
- chore(cli): fix resource extensions [130]
- chore(full_resources): fix capturing link tag
- chore(page): fix trailing slash url getter
You can now get the bytes from Page
to store as a valid resource.
Thank you @Byter09 for the help!
Full Changelog: v1.36.5...v1.37.7
v1.36.5
What's Changed
- feat(sync): add broadcast channel subscriptions
- perf(css): improve link tree parsing nodes by @j-mendez in #126
- chore(cli): add regex and full_resources flags
Subscriptions π
With the sync
feature enabled [enabled by default].
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://choosealicense.com");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
while let Ok(res) = rx2.recv().await {
println!("{:?}", res.get_url());
}
});
website.crawl().await;
}
If you need the events to finish first, you can make a spawn call to website.crawl().await
first before rx2.recv().await
;
Full Changelog: v1.34.2...v1.36.3