Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment #329

QuangTQV · 2024-12-09T02:32:24Z

I am encountering slow performance when using crawl4AI in a Docker environment, whereas when I test it outside of Docker using the regular pip installation, the speed is significantly faster. Could there be any configuration or environment issues causing this discrepancy in performance? Please let me know if there are any errors or optimizations I may have overlooked.

unclecode · 2024-12-09T12:41:47Z

@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.

QuangTQV · 2024-12-12T08:02:18Z

@QuangTQV Can you share with me the specs; is it AMD or ARM, and also how much memory you assign to your Docker? Do you know that, and on which hardware are you running it? I'm curious to know it.

I'm mistaken, sorry.
But now, how can I return markdown? I see it only returns HTML, and the markdown parameter is empty.

{
"urls": "https://www.dienmayxanh.com/",
"word_count_threshold": 1,
"extraction_config": {
"type": "basic",
"params": {}
},
"chunking_strategy": {
"type": "string",
"params": {}
},
"content_filter": {
"type": "bm25",
"params": {}
},
"js_code": [
"string"
],
"wait_for": "string",
"css_selector": "string",
"screenshot": false,
"magic": false,
"extra": {},
"session_id": "string",
"cache_mode": "enabled",
"priority": 5,
"ttl": 3600,
"crawler_params": {}
}

unclecode · 2024-12-13T12:28:49Z

@QuangTQV The code below is how to use the new version:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(
        headless=True,
        verbose=True,
        user_agent_mode="random",
    )

    # Set run configurations, including cache mode and markdown generator
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        markdown_generator=DefaultMarkdownGenerator(
            # content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
            # options={"ignore_links": True}
        )
    )

    async with AsyncWebCrawler(browser_config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://www.kidocode.com/degrees/technology',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
            # Fit markdown exists if you pass content filter
            # print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

unclecode self-assigned this Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment #329

Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment #329

QuangTQV commented Dec 9, 2024

unclecode commented Dec 9, 2024

QuangTQV commented Dec 12, 2024

unclecode commented Dec 13, 2024

Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment #329

Slow performance of crawl4AI in Docker compared to pip installation outside Docker environment #329

Comments

QuangTQV commented Dec 9, 2024

unclecode commented Dec 9, 2024

QuangTQV commented Dec 12, 2024

unclecode commented Dec 13, 2024