Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment #329

Open
QuangTQV opened this issue Dec 9, 2024 · 3 comments
Assignees

Comments

@QuangTQV
Copy link

QuangTQV commented Dec 9, 2024

I am encountering slow performance when using crawl4AI in a Docker environment, whereas when I test it outside of Docker using the regular pip installation, the speed is significantly faster. Could there be any configuration or environment issues causing this discrepancy in performance? Please let me know if there are any errors or optimizations I may have overlooked.

@unclecode
Copy link
Owner

@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.

@unclecode unclecode self-assigned this Dec 9, 2024
@QuangTQV
Copy link
Author

@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.

I'm mistaken, sorry.
But now, how can I return markdown? I see it only returns HTML, and the markdown parameter is empty.

image

{
"urls": "https://www.dienmayxanh.com/",
"word_count_threshold": 1,
"extraction_config": {
"type": "basic",
"params": {}
},
"chunking_strategy": {
"type": "string",
"params": {}
},
"content_filter": {
"type": "bm25",
"params": {}
},
"js_code": [
"string"
],
"wait_for": "string",
"css_selector": "string",
"screenshot": false,
"magic": false,
"extra": {},
"session_id": "string",
"cache_mode": "enabled",
"priority": 5,
"ttl": 3600,
"crawler_params": {}
}

@unclecode
Copy link
Owner

@QuangTQV The code below is how to use the new version:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(
        headless=True,
        verbose=True,
        user_agent_mode="random",
    )

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator(
            # content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
            # options={"ignore_links": True}
        )
    )

    async with AsyncWebCrawler(browser_config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://www.kidocode.com/degrees/technology',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
            # Fit markdown exists if you pass content filter
            # print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants