-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(builder): add depth and caching
- Loading branch information
Showing
12 changed files
with
275 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,4 +15,6 @@ | |
|
||
# Usage | ||
|
||
- [Crawl](./crawl.md) | ||
- [Scrape](./scrape.md) | ||
- [Cron Job](./cron-job.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# Crawl | ||
|
||
Crawl a website concurrently. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
async def main(): | ||
website = Website("https://rsseau.fr") | ||
website.crawl() | ||
print(website.get_links()) | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
## Async Event | ||
|
||
You can pass in a async function as the first param to the crawl function for realtime updates streamed. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
class Subscription: | ||
def __init__(self): | ||
print("Subscription Created...") | ||
def __call__(self, page): | ||
print(page.url + " - status: " + str(page.status_code)) | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.crawl(Subscription()) | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
## Background | ||
|
||
You can run the request in the background and receive events with the second param set to `true`. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
class Subscription: | ||
def __init__(self): | ||
print("Subscription Created...") | ||
def __call__(self, page): | ||
print(page.url + " - status: " + str(page.status_code)) | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.crawl(Subscription(), True) | ||
# this will run instantly as the crawl is in the background | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
## Subscriptions | ||
|
||
You can setup many subscriptions to run events when a crawl happens. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
class Subscription: | ||
def __init__(self): | ||
print("Subscription Created...") | ||
def __call__(self, page): | ||
print(page.url + " - status: " + str(page.status_code)) | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.crawl() | ||
subscription_id = website.subscribe(Subscription()); | ||
website.crawl() | ||
website.unsubscribe(subscription_id); | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
## Headless Chrome | ||
|
||
Headless Chrome rendering can be done by setting the third param in `crawl` or `scrape` to `true`. | ||
It will attempt to connect to chrome running remotely if the `CHROME_URL` env variable is set with chrome launching as a fallback. Using a remote connection with `CHROME_URL` will | ||
drastically speed up runs. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
class Subscription: | ||
def __init__(self): | ||
print("Subscription Created...") | ||
def __call__(self, page): | ||
print(page.url + " - status: " + str(page.status_code)) | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.crawl(Subscription(), false, True) | ||
|
||
asyncio.run(main()) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Scrape | ||
|
||
Scape a website and collect the resource data. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.scrape() | ||
print(website.get_pages()) | ||
# [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...] | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
## Headless Chrome | ||
|
||
Headless Chrome rendering can be done by setting the third param in `crawl` or `scrape` to `true`. | ||
It will attempt to connect to chrome running remotely if the `CHROME_URL` env variable is set with chrome launching as a fallback. Using a remote connection with `CHROME_URL` will | ||
drastically speed up runs. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.scrape(NULL, NULL, True) | ||
print(website.get_pages()) | ||
# [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...] | ||
|
||
asyncio.run(main()) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,82 @@ | ||
# Simple Example | ||
|
||
We use the [pyo3](https://pyo3.rs/v0.20.0/) to port the Rust project to target Python. | ||
|
||
There are some performance drawbacks from the addon, even still the crawls are lightning fast and efficient. | ||
|
||
## Usage | ||
|
||
The examples below can help get started with spider. | ||
|
||
### Basic | ||
|
||
```python | ||
import asyncio | ||
|
||
from spider_rs import Website | ||
|
||
async def main(): | ||
website = Website("https://jeffmendez.com") | ||
website.crawl() | ||
print(website.links) | ||
# print(website.pages) | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
### Events | ||
|
||
You can pass an object that could be async as param to `crawl` and `scrape`. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
class Subscription: | ||
def __init__(self): | ||
print("Subscription Created...") | ||
def __call__(self, page): | ||
print(page.url + " - status: " + str(page.status_code)) | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.crawl(Subscription()) | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
### Selector | ||
|
||
The `title` method allows you to extract the title of the page. | ||
|
||
```py | ||
import asyncio | ||
from spider_rs import Website | ||
|
||
class Subscription: | ||
def __init__(self): | ||
print("Subscription Created...") | ||
def __call__(self, page): | ||
print(page.url + " - title: " + str(page.title())) | ||
|
||
async def main(): | ||
website = Website("https://choosealicense.com") | ||
website.crawl(Subscription()) | ||
``` | ||
|
||
## Shortcut | ||
|
||
You can use the `crawl` shortcut method to collect contents quickly without configuration. | ||
|
||
```ts | ||
import asyncio | ||
|
||
from spider_rs import crawl | ||
|
||
async def main(): | ||
website = await crawl("https://jeffmendez.com") | ||
website = crawl("https://jeffmendez.com") | ||
print(website.links) | ||
# print(website.pages) | ||
|
||
asyncio.run(main()) | ||
``` | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.