Skip to content

Commit

Permalink
0.0.21
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Nov 29, 2023
1 parent 97188f8 commit b7e235b
Show file tree
Hide file tree
Showing 24 changed files with 76 additions and 38 deletions.
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ const website = new Website("https://rsseau.fr")
.withBudget({
// limit up to 200 pages crawled for the entire website
"*": 200,
// limit only 10 pages on the docs page
"/docs": 10
})
.withBlacklistUrl([new RegExp("/books").source, "/resume"])
.build();
Expand Down Expand Up @@ -78,17 +80,17 @@ const website = new Website("https://choosealicense.com").withCron(
"1/5 * * * * *",
);
// sleep function to test cron
const stopCron = (time: number, handle: Cron) => {
const stopCron = (time: number, handle) => {
return new Promise((resolve) => {
setTimeout(() => {
resolve(handle.stop());
}, time);
});
};

const links: NPage[] = [];
const links = [];

const onPageEvent = (err: Error | null, value: NPage) => {
const onPageEvent = (err, value) => {
links.push(value);
};

Expand All @@ -103,12 +105,14 @@ Use the crawl shortcut to get the page content and url.
```ts
import { crawl } from "@spider-rs/spider-rs";

const { links, pages } = new crawl("https://rsseau.fr");
const { links, pages } = await crawl("https://rsseau.fr");
console.log(pages);
```

## Benchmarks

Spider is about 1,000x (small websites) 10,000x (medium websites), and 100,000x (production grade websites) times faster than the popular crawlee library even with the node port performance hits.

```sh
----------------------
mac Apple M1 Max
Expand All @@ -119,19 +123,17 @@ mac Apple M1 Max
```

Test url: `https://choosealicense.com` (small)

32 pages

| | `libraries` |
| `libraries` | `speed` |
| :-------------------------------- | :-------------------- |
| **`spider-rs: crawl 10 samples`** | `286ms`(✅ **1.00x**) |
| **`crawlee: crawl 10 samples`** | `1s` (✅ **1.00x**) |
| **`crawlee: crawl 10 samples`** | `1.7s` (✅ **1.00x**) |

Test url: `https://rsseau.fr` (medium)

211 pages

| | `libraries` |
| `libraries` | `speed` |
| :-------------------------------- | :-------------------- |
| **`spider-rs: crawl 10 samples`** | `2.5s` (✅ **1.00x**) |
| **`crawlee: crawl 10 samples`** | `75s` (✅ **1.00x**) |
Expand Down
2 changes: 1 addition & 1 deletion book/src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Spider powers some big tools and helps bring the crawling aspect to almost no do
```ts
import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr");
const website = new Website("https://choosealicense.com");

await website.crawl();

Expand Down
4 changes: 4 additions & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,7 @@
- [Crawl](./crawl.md)
- [Scrape](./scrape.md)
- [Cron Job](./cron-job.md)

# Benchmarks

- [Compare](./benchmarks.md)
34 changes: 34 additions & 0 deletions book/src/benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Benchmarks

The speed of Spider-RS ported compared to other tools.

Spider is about 1,000x (small websites) 10,000x (medium websites), and 100,000x (production grade websites) times faster than the popular crawlee library even with the node port performance hits.

```sh
----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
1 TB of SSD disk space
-----------------------
```

Test url: `https://choosealicense.com` (small)
32 pages

| `libraries` | `speed` |
| :-------------------------------- | :-------------------- |
| **`spider-rs: crawl 10 samples`** | `286ms`(✅ **1.00x**) |
| **`crawlee: crawl 10 samples`** | `1.7s` (✅ **1.00x**) |

Test url: `https://rsseau.fr` (medium)
211 pages

| `libraries` | `speed` |
| :-------------------------------- | :-------------------- |
| **`spider-rs: crawl 10 samples`** | `2.5s` (✅ **1.00x**) |
| **`crawlee: crawl 10 samples`** | `75s` (✅ **1.00x**) |

The performance scales the larger the website and if throttling is needed.

Linux benchmarks are about 10x faster than macOS for spider-rs.
6 changes: 3 additions & 3 deletions book/src/crawl.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@ website.unsubscribe(subscriptionID);

## Headless Chrome

Headless Chrome rendering can be done by setting the third param in `crawl` or `scrape` to `true`.
It will attempt to connect to chrome running remotely if the `CHROME_URL` env variable is set with chrome launching as a fallback. Using a remote connection with `CHROME_URL` will
drastically speed up runs.
Headless Chrome rendering can be done by setting the third param in `crawl` or `scrape` to `true`.
It will attempt to connect to chrome running remotely if the `CHROME_URL` env variable is set with chrome launching as a fallback. Using a remote connection with `CHROME_URL` will
drastically speed up runs.

```ts
import { Website } from "@spider-rs/spider-rs";
Expand Down
2 changes: 1 addition & 1 deletion book/src/env.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ You can set the chrome URL to connect remotely.

```sh
CHROME_URL=http://localhost:9222
```
```
5 changes: 1 addition & 4 deletions book/src/page.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ await page.fetch();
get all the links related to a page.

```ts

const page = new Page("https://choosealicense.com", false, false);
await page.fetch();
const links = await page.getLinks();
Expand All @@ -34,7 +33,6 @@ console.log(links);
Get the markup for the page or HTML.

```ts

const page = new Page("https://choosealicense.com", false, false);
await page.fetch();
const html = page.getHtml();
Expand All @@ -46,9 +44,8 @@ console.log(html);
Get the raw bytes of a page to store the files in a database.

```ts

const page = new Page("https://choosealicense.com", false, false);
await page.fetch();
const bytes = page.getBytes();
console.log(bytes);
```
```
6 changes: 3 additions & 3 deletions book/src/scrape.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ console.log(website.getPages());

## Headless Chrome

Headless Chrome rendering can be done by setting the third param in `crawl` or `scrape` to `true`.
It will attempt to connect to chrome running remotely if the `CHROME_URL` env variable is set with chrome launching as a fallback. Using a remote connection with `CHROME_URL` will
Headless Chrome rendering can be done by setting the third param in `crawl` or `scrape` to `true`.
It will attempt to connect to chrome running remotely if the `CHROME_URL` env variable is set with chrome launching as a fallback. Using a remote connection with `CHROME_URL` will
drastically speed up runs.

```ts
Expand All @@ -31,4 +31,4 @@ const onPageEvent = (err, value) => {

// all params are optional. The third param determines headless rendering.
await website.scrape(onPageEvent, false, true);
```
```
5 changes: 3 additions & 2 deletions book/src/simple.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ A basic example.
```ts
import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr");
const website = new Website("https://choosealicense.com");

await website.crawl();
console.log(website.getLinks());
Expand All @@ -28,14 +28,15 @@ You can pass a function that could be async as param to `crawl` and `scrape`.
```ts
import { Website, type NPage } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr");
const website = new Website("https://choosealicense.com");

const links: NPage[] = [];

const onPageEvent = (err: Error | null, value: NPage) => {
links.push(value);
};

// params in order event, background, and headless chrome
await website.crawl(onPageEvent);
console.log(website.getLinks());
```
2 changes: 1 addition & 1 deletion npm/android-arm-eabi/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-android-arm-eabi",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"android"
Expand Down
2 changes: 1 addition & 1 deletion npm/android-arm64/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-android-arm64",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"android"
Expand Down
2 changes: 1 addition & 1 deletion npm/darwin-arm64/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-darwin-arm64",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"darwin"
Expand Down
2 changes: 1 addition & 1 deletion npm/darwin-universal/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-darwin-universal",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"darwin"
Expand Down
2 changes: 1 addition & 1 deletion npm/darwin-x64/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-darwin-x64",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"darwin"
Expand Down
2 changes: 1 addition & 1 deletion npm/freebsd-x64/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-freebsd-x64",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"freebsd"
Expand Down
2 changes: 1 addition & 1 deletion npm/linux-arm-gnueabihf/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-linux-arm-gnueabihf",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"linux"
Expand Down
2 changes: 1 addition & 1 deletion npm/linux-arm64-gnu/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-linux-arm64-gnu",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"linux"
Expand Down
2 changes: 1 addition & 1 deletion npm/linux-arm64-musl/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-linux-arm64-musl",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"linux"
Expand Down
2 changes: 1 addition & 1 deletion npm/linux-x64-gnu/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-linux-x64-gnu",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"linux"
Expand Down
2 changes: 1 addition & 1 deletion npm/linux-x64-musl/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-linux-x64-musl",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"linux"
Expand Down
2 changes: 1 addition & 1 deletion npm/win32-arm64-msvc/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-win32-arm64-msvc",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"win32"
Expand Down
2 changes: 1 addition & 1 deletion npm/win32-ia32-msvc/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-win32-ia32-msvc",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"win32"
Expand Down
2 changes: 1 addition & 1 deletion npm/win32-x64-msvc/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs-win32-x64-msvc",
"version": "0.0.20",
"version": "0.0.21",
"repository": "https://github.com/spider-rs/spider-nodejs",
"os": [
"win32"
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@spider-rs/spider-rs",
"version": "0.0.20",
"version": "0.0.21",
"main": "index.js",
"types": "index.d.ts",
"napi": {
Expand Down

0 comments on commit b7e235b

Please sign in to comment.