Skip to content

Latest commit

 

History

History
225 lines (178 loc) · 7.53 KB

storage_state_tutorial.md

File metadata and controls

225 lines (178 loc) · 7.53 KB

Using storage_state to Pre-Load Cookies and LocalStorage

Crawl4ai’s AsyncWebCrawler lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a storage_state, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.

What is storage_state?

storage_state can be:

  • A dictionary containing cookies and localStorage data.
  • A path to a JSON file that holds this information.

When you pass storage_state to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.

Example Structure

Here’s an example storage state:

{
  "cookies": [
    {
      "name": "session",
      "value": "abcd1234",
      "domain": "example.com",
      "path": "/",
      "expires": 1675363572.037711,
      "httpOnly": false,
      "secure": false,
      "sameSite": "None"
    }
  ],
  "origins": [
    {
      "origin": "https://example.com",
      "localStorage": [
        { "name": "token", "value": "my_auth_token" },
        { "name": "refreshToken", "value": "my_refresh_token" }
      ]
    }
  ]
}

This JSON sets a session cookie and two localStorage entries (token and refreshToken) for https://example.com.


Passing storage_state as a Dictionary

You can directly provide the data as a dictionary:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    storage_dict = {
        "cookies": [
            {
                "name": "session",
                "value": "abcd1234",
                "domain": "example.com",
                "path": "/",
                "expires": 1675363572.037711,
                "httpOnly": False,
                "secure": False,
                "sameSite": "None"
            }
        ],
        "origins": [
            {
                "origin": "https://example.com",
                "localStorage": [
                    {"name": "token", "value": "my_auth_token"},
                    {"name": "refreshToken", "value": "my_refresh_token"}
                ]
            }
        ]
    }

    async with AsyncWebCrawler(
        headless=True,
        storage_state=storage_dict
    ) as crawler:
        result = await crawler.arun(url='https://example.com/protected')
        if result.success:
            print("Crawl succeeded with pre-loaded session data!")
            print("Page HTML length:", len(result.html))

if __name__ == "__main__":
    asyncio.run(main())

Passing storage_state as a File

If you prefer a file-based approach, save the JSON above to mystate.json and reference it:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(
        headless=True,
        storage_state="mystate.json"  # Uses a JSON file instead of a dictionary
    ) as crawler:
        result = await crawler.arun(url='https://example.com/protected')
        if result.success:
            print("Crawl succeeded with pre-loaded session data!")
            print("Page HTML length:", len(result.html))

if __name__ == "__main__":
    asyncio.run(main())

Using storage_state to Avoid Repeated Logins (Sign In Once, Use Later)

A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:

  1. Perform the login once in a hook.
  2. After login completes, export the resulting storage_state to a file.
  3. On subsequent runs, provide that storage_state to skip the login step.

Step-by-Step Example:

First Run (Perform Login and Save State):

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def on_browser_created_hook(browser):
    # Access the default context and create a page
    context = browser.contexts[0]
    page = await context.new_page()
    
    # Navigate to the login page
    await page.goto("https://example.com/login", wait_until="domcontentloaded")
    
    # Fill in credentials and submit
    await page.fill("input[name='username']", "myuser")
    await page.fill("input[name='password']", "mypassword")
    await page.click("button[type='submit']")
    await page.wait_for_load_state("networkidle")
    
    # Now the site sets tokens in localStorage and cookies
    # Export this state to a file so we can reuse it
    await context.storage_state(path="my_storage_state.json")
    await page.close()

async def main():
    # First run: perform login and export the storage_state
    async with AsyncWebCrawler(
        headless=True,
        verbose=True,
        hooks={"on_browser_created": on_browser_created_hook},
        use_persistent_context=True,
        user_data_dir="./my_user_data"
    ) as crawler:
        
        # After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
        result = await crawler.arun(
            url='https://example.com/protected-page',
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
        )
        print("First run result success:", result.success)
        if result.success:
            print("Protected page HTML length:", len(result.html))

if __name__ == "__main__":
    asyncio.run(main())

Second Run (Reuse Saved State, No Login Needed):

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Second run: no need to hook on_browser_created this time.
    # Just provide the previously saved storage state.
    async with AsyncWebCrawler(
        headless=True,
        verbose=True,
        use_persistent_context=True,
        user_data_dir="./my_user_data",
        storage_state="my_storage_state.json"  # Reuse previously exported state
    ) as crawler:
        
        # Now the crawler starts already logged in
        result = await crawler.arun(
            url='https://example.com/protected-page',
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
        )
        print("Second run result success:", result.success)
        if result.success:
            print("Protected page HTML length:", len(result.html))

if __name__ == "__main__":
    asyncio.run(main())

What’s Happening Here?

  • During the first run, the on_browser_created_hook logs into the site.
  • After logging in, the crawler exports the current session (cookies, localStorage, etc.) to my_storage_state.json.
  • On subsequent runs, passing storage_state="my_storage_state.json" starts the browser context with these tokens already in place, skipping the login steps.

Sign Out Scenario:
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses on_browser_created_hook or arun to simulate signing out, then export the resulting storage_state again. That would give you a baseline “logged out” state to start fresh from next time.


Conclusion

By using storage_state, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.