Skip to content

Releases: spider-rs/spider

v1.46.5

28 Oct 17:52
c14cd6c
Compare
Choose a tag to compare

What's Changed

  • chore(page): fix subdomain entry point handling root by @j-mendez in #146

Full Changelog: v1.46.4...v1.46.5

v1.46.4

25 Sep 16:29
Compare
Choose a tag to compare

Whats Changed

Crawling all domains found on website now possible with * in external_domains.

  1. feat(external): add wildcard handling all domains found #135

Example:

let mut website = Website::new("https://choosealicense.com");

website
    .with_external_domains(Some(Vec::from(["*"].map(|d| d.to_string())).into_iter()));

Use the crawl budget and blacklist features to help prevent infinite crawls:

website
    .with_blacklist_url(Some(Vec::from(["^/blog/".into()])))
    .with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])));

Thank you @sebs for the help!

Full Changelog: v1.45.10...v1.46.0

v1.45.10

23 Sep 23:06
Compare
Choose a tag to compare

Whats Changed

You can now use crawl budgeting with the CLI and grouping domains.

  1. feat(cli): add crawl budgeting
  2. feat(cli): add external domains grouping

Example:

spider --domain https://choosealicense.com --budget "*,1" crawl -o
# ["https://choosealicense.com"]

Example of crawl grouping domains with CLI:

spider --domain https://choosealicense.com -E https://loto.rsseau.fr/ crawl -o

Full Changelog: v1.45.8...v1.45.9

v1.45.8

16 Sep 16:52
Compare
Choose a tag to compare

Whats Changed

Crawl budget limits to prevent paths from exceeding page limit. It is possible to set depth for the budget /a/b/c.
Use the feature flag budget to enable.

  • feat(budget): add crawl budgeting pages
  • chore(chrome): add fast chrome redirect determination without using page.url().await extra CDP call

Example:

use spider::tokio;
use spider::website::Website;
use std::io::Error;

#[tokio::main]
async fn main() -> Result<(), Error> {
    let mut website = Website::new("https://rsseau.fr")
        .with_budget(Some(spider::hashbrown::HashMap::from([
            ("*", 15),
            ("en", 11),
            ("fr", 3),
        ])))
        .build()?;

    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }

    println!("Total pages: {}", links.len());

    Ok(())
}
// - "https://rsseau.fr/en/tag/google"
// - "https://rsseau.fr/en/blog/debug-nodejs-with-vscode"
// - "https://rsseau.fr/en/books"
// - "https://rsseau.fr/en/blog"
// - "https://rsseau.fr/en/tag/zip"
// - "https://rsseau.fr/books"
// - "https://rsseau.fr/en/resume"
// - "https://rsseau.fr/en/"
// - "https://rsseau.fr/en/tag/wpa"
// - "https://rsseau.fr/en/blog/express-typescript"
// - "https://rsseau.fr/en/blog/zip-active-storage"
// - "https://rsseau.fr"
// - "https://rsseau.fr/blog"
// - "https://rsseau.fr/en"
// - "https://rsseau.fr/fr"
// Total pages: 15

Full Changelog: v1.43.1...v1.45.8

v1.43.1

16 Sep 11:57
Compare
Choose a tag to compare

Whats Changed

You can now get the final redirect destination from pages with page.get_url_final. If you need to see if a redirect was performed you can access page.final_redirect_destination to get the Option.

  • feat(page): add page redirect destination exposure #127

Example:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            // return final redirect if found or the url used for the request
            println!("{:?}", res.get_url_final());
        }
    });

    website.crawl().await;
}

Thank you @matteoredaelli @joksas for the issue and help!

Full Changelog: v1.42.1...v1.43.1

v1.42.3

15 Sep 14:57
Compare
Choose a tag to compare

Whats Changed

  • feat(external): add external domains grouping #135
  • chore(website): website build method to perform validations with the builder chain
  • chore(cli): fix links json output
  • chore(glob): fix link callback #136

Example below gathering links from different domains.

use spider::tokio;
use spider::website::Website;
use std::io::Error;
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<(), Error>{
    let mut website = Website::new("https://rsseau.fr")
        .with_external_domains(Some(Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter()))
        .build()?;

    let start = Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    Ok(())
}

Thank you @roniemartinez and @sebs for the help!

Full Changelog: v1.41.1...v1.42.3

v1.41.1

14 Sep 12:56
Compare
Choose a tag to compare

Whats Changed

The sitemap feature flag was added to include pages found in the results. Currently the links found from the pages are not crawled.

  1. feat(sitemap): add sitemap crawling feature flag

If you want to set a custom sitemap location use the configuration website.configuration.sitemap_url.

Example

website.configuration.sitemap_url = Some(Box::new("sitemap.xml".into()));

The builder method to adjust the location will be available in the next version, it was accidentally left out.

Full Changelog: v1.40.6...v1.41.1

v1.40.6

11 Sep 00:56
Compare
Choose a tag to compare

Whats Changed

  • feat(chrome): enable chrome rendering page content [experimental]
  • chore(crawl): remove crawl sync method for Sequential crawls

If you need crawls to be sequential use configuration.delay or use website.with_delay(1), set any value greater than 0.

Headless

Use the feature flag chrome for headless and chrome_headed for headful crawling.

Chrome installations are detected automatically on the OS. The current implementation uses chromiumaxide and handles html as raw strings so downloading media will not be ideal since the bytes may be invalid. The chrome feature does not work with the decentralized flag at the moment.

Video below shows 200 plus pages being handled within a couple seconds, headless runs drastically faster.
Try to only use headed for debugging.

Screen.Recording.2023-09-10.at.8.41.44.PM.mov

Full Changelog: v1.37.7...v1.40.6

v1.37.7

29 Aug 20:52
Compare
Choose a tag to compare

What's Changed

  • feat(page): add byte storing resource by @j-mendez in #131
  • chore(pages): fix full_resource flag gathering scripts
  • chore(cli): fix resource extensions [130]
  • chore(full_resources): fix capturing link tag
  • chore(page): fix trailing slash url getter

You can now get the bytes from Page to store as a valid resource.

Thank you @Byter09 for the help!

Full Changelog: v1.36.5...v1.37.7

v1.36.5

27 Aug 15:18
Compare
Choose a tag to compare

What's Changed

  • feat(sync): add broadcast channel subscriptions
  • perf(css): improve link tree parsing nodes by @j-mendez in #126
  • chore(cli): add regex and full_resources flags

Subscriptions πŸš€

With the sync feature enabled [enabled by default].

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    let mut rx2 = website.subscribe(16).unwrap();

     tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl().await;
}

If you need the events to finish first, you can make a spawn call to website.crawl().await first before rx2.recv().await;

Full Changelog: v1.34.2...v1.36.3