28 Oct 17:52

j-mendez

c14cd6c

v1.46.5

What's Changed

chore(page): fix subdomain entry point handling root by @j-mendez in #146

Full Changelog: v1.46.4...v1.46.5

Contributors

j-mendez

Assets 2

25 Sep 16:29

j-mendez

v1.46.4

1bb357e

v1.46.4

Whats Changed

Crawling all domains found on website now possible with * in external_domains.

feat(external): add wildcard handling all domains found #135

Example:

let mut website = Website::new("https://choosealicense.com");

website
    .with_external_domains(Some(Vec::from(["*"].map(|d| d.to_string())).into_iter()));

Use the crawl budget and blacklist features to help prevent infinite crawls:

website
    .with_blacklist_url(Some(Vec::from(["^/blog/".into()])))
    .with_budget(Some(spider::hashbrown::HashMap::from([("*", 300), ("/licenses", 10)])));

Thank you @sebs for the help!

Full Changelog: v1.45.10...v1.46.0

Contributors

sebs

Assets 2

23 Sep 23:06

j-mendez

v1.45.10

588958d

v1.45.10

Whats Changed

You can now use crawl budgeting with the CLI and grouping domains.

feat(cli): add crawl budgeting
feat(cli): add external domains grouping

Example:

spider --domain https://choosealicense.com --budget "*,1" crawl -o
# ["https://choosealicense.com"]

Example of crawl grouping domains with CLI:

spider --domain https://choosealicense.com -E https://loto.rsseau.fr/ crawl -o

Full Changelog: v1.45.8...v1.45.9

Assets 2

16 Sep 16:52

j-mendez

v1.45.8

fa8cf97

v1.45.8

Whats Changed

Crawl budget limits to prevent paths from exceeding page limit. It is possible to set depth for the budget /a/b/c.
Use the feature flag budget to enable.

feat(budget): add crawl budgeting pages
chore(chrome): add fast chrome redirect determination without using page.url().await extra CDP call

Example:

use spider::tokio;
use spider::website::Website;
use std::io::Error;

#[tokio::main]
async fn main() -> Result<(), Error> {
    let mut website = Website::new("https://rsseau.fr")
        .with_budget(Some(spider::hashbrown::HashMap::from([
            ("*", 15),
            ("en", 11),
            ("fr", 3),
        ])))
        .build()?;

    website.crawl().await;

    for link in website.get_links() {
        println!("- {:?}", link.as_ref());
    }

    println!("Total pages: {}", links.len());

    Ok(())
}
// - "https://rsseau.fr/en/tag/google"
// - "https://rsseau.fr/en/blog/debug-nodejs-with-vscode"
// - "https://rsseau.fr/en/books"
// - "https://rsseau.fr/en/blog"
// - "https://rsseau.fr/en/tag/zip"
// - "https://rsseau.fr/books"
// - "https://rsseau.fr/en/resume"
// - "https://rsseau.fr/en/"
// - "https://rsseau.fr/en/tag/wpa"
// - "https://rsseau.fr/en/blog/express-typescript"
// - "https://rsseau.fr/en/blog/zip-active-storage"
// - "https://rsseau.fr"
// - "https://rsseau.fr/blog"
// - "https://rsseau.fr/en"
// - "https://rsseau.fr/fr"
// Total pages: 15

Full Changelog: v1.43.1...v1.45.8

Assets 2

16 Sep 11:57

j-mendez

v1.43.1

f40297e

v1.43.1

Whats Changed

You can now get the final redirect destination from pages with page.get_url_final. If you need to see if a redirect was performed you can access page.final_redirect_destination to get the Option.

feat(page): add page redirect destination exposure #127

Example:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            // return final redirect if found or the url used for the request
            println!("{:?}", res.get_url_final());
        }
    });

    website.crawl().await;
}

Thank you @matteoredaelli @joksas for the issue and help!

Full Changelog: v1.42.1...v1.43.1

Contributors

matteoredaelli and joksas

Assets 2

15 Sep 14:57

j-mendez

v1.42.3

3b46261

v1.42.3

Whats Changed

feat(external): add external domains grouping #135
chore(website): website build method to perform validations with the builder chain
chore(cli): fix links json output
chore(glob): fix link callback #136

Example below gathering links from different domains.

use spider::tokio;
use spider::website::Website;
use std::io::Error;
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<(), Error>{
    let mut website = Website::new("https://rsseau.fr")
        .with_external_domains(Some(Vec::from(["http://loto.rsseau.fr/"].map(|d| d.to_string())).into_iter()))
        .build()?;

    let start = Instant::now();
    website.crawl().await;
    let duration = start.elapsed();

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }

    println!(
        "Time elapsed in website.crawl() is: {:?} for total pages: {:?}",
        duration,
        links.len()
    );

    Ok(())
}

Thank you @roniemartinez and @sebs for the help!

Full Changelog: v1.41.1...v1.42.3

Contributors

sebs and roniemartinez

Assets 2

14 Sep 12:56

j-mendez

v1.41.1

d752534

v1.41.1

Whats Changed

The sitemap feature flag was added to include pages found in the results. Currently the links found from the pages are not crawled.

feat(sitemap): add sitemap crawling feature flag

If you want to set a custom sitemap location use the configuration website.configuration.sitemap_url.

Example

website.configuration.sitemap_url = Some(Box::new("sitemap.xml".into()));

The builder method to adjust the location will be available in the next version, it was accidentally left out.

Full Changelog: v1.40.6...v1.41.1

Assets 2

11 Sep 00:56

j-mendez

v1.40.6

6bbbd82

v1.40.6

Whats Changed

feat(chrome): enable chrome rendering page content [experimental]
chore(crawl): remove crawl sync method for Sequential crawls

If you need crawls to be sequential use configuration.delay or use website.with_delay(1), set any value greater than 0.

Headless

Use the feature flag chrome for headless and chrome_headed for headful crawling.

Chrome installations are detected automatically on the OS. The current implementation uses chromiumaxide and handles html as raw strings so downloading media will not be ideal since the bytes may be invalid. The chrome feature does not work with the decentralized flag at the moment.

Video below shows 200 plus pages being handled within a couple seconds, headless runs drastically faster.
Try to only use headed for debugging.

Screen.Recording.2023-09-10.at.8.41.44.PM.mov

Full Changelog: v1.37.7...v1.40.6

Assets 2

29 Aug 20:52

j-mendez

v1.37.7

29019a6

v1.37.7

What's Changed

feat(page): add byte storing resource by @j-mendez in #131
chore(pages): fix full_resource flag gathering scripts
chore(cli): fix resource extensions [130]
chore(full_resources): fix capturing link tag
chore(page): fix trailing slash url getter

You can now get the bytes from Page to store as a valid resource.

Thank you @Byter09 for the help!

Full Changelog: v1.36.5...v1.37.7

Contributors

j-mendez and Byter09

Assets 2

27 Aug 15:18

j-mendez

v1.36.5

9bc8888

v1.36.5

What's Changed

feat(sync): add broadcast channel subscriptions
perf(css): improve link tree parsing nodes by @j-mendez in #126
chore(cli): add regex and full_resources flags

Subscriptions 🚀

With the sync feature enabled [enabled by default].

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://choosealicense.com");
    let mut rx2 = website.subscribe(16).unwrap();

     tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl().await;
}

If you need the events to finish first, you can make a spawn call to website.crawl().await first before rx2.recv().await;

Full Changelog: v1.34.2...v1.36.3

Contributors

j-mendez

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

Whats Changed

Contributors

Whats Changed

Whats Changed

Whats Changed

Contributors

Whats Changed

Contributors

Whats Changed

Example

Whats Changed

Headless

What's Changed

Contributors

What's Changed

Subscriptions 🚀

Contributors

Releases: spider-rs/spider

v1.46.5

What's Changed

Contributors

v1.46.4

Whats Changed

Contributors

v1.45.10

Whats Changed

v1.45.8

Whats Changed

v1.43.1

Whats Changed

Contributors

v1.42.3

Whats Changed

Contributors

v1.41.1

Whats Changed

Example

v1.40.6

Whats Changed

Headless

v1.37.7

What's Changed

Contributors

v1.36.5

What's Changed

Subscriptions 🚀

Contributors