Skip to content

Commit

Permalink
feat(sitemap): add sitemap crawling feature flag
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Sep 14, 2023
1 parent 3b5e227 commit 3680887
Show file tree
Hide file tree
Showing 10 changed files with 292 additions and 31 deletions.
93 changes: 88 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.40.13"
version = "1.41.0"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "Multithreaded web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ htr = "0.5.27"
flexbuffers = "2.0.0"

[dependencies.spider]
version = "1.40.13"
version = "1.41.0"
path = "../spider"
features = ["serde"]

Expand Down
4 changes: 3 additions & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.40.13"
version = "1.41.0"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand Down Expand Up @@ -43,6 +43,7 @@ itertools = { version = "0.10.5", optional = true }
case_insensitive_string = { version = "0.1.6", features = [ "compact", "serde" ]}
jsdom = { version = "0.0.11-alpha.1", optional = true, features = [ "hashbrown", "tokio" ] }
chromiumoxide_fork = { version = "0.5.9", optional = true, features = ["tokio-runtime", "bytes"], default-features = false }
sitemap = { version = "0.4.1", optional = true }

[target.'cfg(all(not(windows), not(target_os = "android"), not(target_env = "musl")))'.dependencies]
tikv-jemallocator = { version = "0.5.0", optional = true }
Expand All @@ -62,6 +63,7 @@ serde = ["dep:serde", "hashbrown/serde", "compact_str/serde"]
fs = ["tokio/fs"]
full_resources = []
socks = ["reqwest/socks"]
sitemap = ["dep:sitemap"]
js = ["dep:jsdom"]
chrome = ["dep:chromiumoxide_fork"]
chrome_headed = ["chrome"]
Expand Down
13 changes: 7 additions & 6 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.40.13"
spider = "1.41.0"
```

And then the code:
Expand Down Expand Up @@ -87,7 +87,7 @@ We have a couple optional feature flags. Regex blacklisting, jemaloc backend, gl

```toml
[dependencies]
spider = { version = "1.40.13", features = ["regex", "ua_generator"] }
spider = { version = "1.41.0", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand All @@ -102,6 +102,7 @@ spider = { version = "1.40.13", features = ["regex", "ua_generator"] }
1. `glob`: Enables [url glob](https://everything.curl.dev/cmdline/globbing) support.
1. `fs`: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.
1. `js`: Enables javascript parsing links created with the alpha [jsdom](https://github.com/a11ywatch/jsdom) crate.
1. `sitemap`: Include sitemap pages in results.
1. `time`: Enables duration tracking per page.
1. `chrome`: Enables chrome headless rendering, use the env var `CHROME_URL` to connect remotely [experimental].
1. `chrome_headed`: Enables chrome rendering headful rendering [experimental].
Expand All @@ -114,7 +115,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.40.13", features = ["decentralized"] }
spider = { version = "1.41.0", features = ["decentralized"] }
```

```sh
Expand All @@ -135,7 +136,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.40.13", features = ["sync"] }
spider = { version = "1.41.0", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -165,7 +166,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.40.13", features = ["regex"] }
spider = { version = "1.41.0", features = ["regex"] }
```

```rust,no_run
Expand All @@ -192,7 +193,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.40.13", features = ["control"] }
spider = { version = "1.41.0", features = ["control"] }
```

```rust
Expand Down
3 changes: 3 additions & 0 deletions spider/src/configuration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ pub struct Configuration {
pub proxies: Option<Box<Vec<String>>>,
/// Headers to include with request.
pub headers: Option<Box<reqwest::header::HeaderMap>>,
#[cfg(feature = "sitemap")]
/// Include a sitemap in response of the crawl
pub sitemap_url: Option<Box<CompactString>>,
}

/// Get the user agent from the top agent list randomly.
Expand Down
5 changes: 4 additions & 1 deletion spider/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,13 @@
//! - `socks`: Enables socks5 proxy support.
//! - `glob`: Enables [url glob](https://everything.curl.dev/cmdline/globbing) support.
//! - `fs`: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.
//! - `sitemap`: Include sitemap pages in results.
//! - `js`: Enables javascript parsing links created with the alpha [jsdom](https://github.com/a11ywatch/jsdom) crate.
//! - `time`: Enables duration tracking per page.
//! - `chrome`: Enables chrome headless rendering [experimental].
//! - `chrome`: Enables chrome headless rendering, use the env var `CHROME_URL` to connect remotely [experimental].
//! - `chrome_headed`: Enables chrome rendering headful rendering [experimental].
//! - `chrome_cpu`: Disable gpu usage for chrome browser.
//! - `chrome_stealth`: Enables stealth mode to make it harder to be detected as a bot.
pub extern crate bytes;
pub extern crate compact_str;
Expand Down
Loading

0 comments on commit 3680887

Please sign in to comment.