Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running in docker container #354

Open
Jevli opened this issue Dec 17, 2024 · 4 comments
Open

When running in docker container #354

Jevli opened this issue Dec 17, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@Jevli
Copy link

Jevli commented Dec 17, 2024

Hi,

When I'm running crawl4ai in docker container I get two odd errors. First one is for logger:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT_NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 123, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT_NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 141, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'

Also the other one:

THIS vvv IS solved view comments by me.

Problem witn asyncio.gather() when running multiple crawlers...

Error in crawler collect_PAGE_gig_data: BrowserType.connect_over_cdp: connect ECONNREFUSED 127.0.0.1:9222
Call log:
  - <ws preparing> retrieving websocket url from http://localhost:9222

Call log:
  - <ws preparing> retrieving websocket url from http://localhost:9222
@Jevli
Copy link
Author

Jevli commented Dec 17, 2024

Also when trying to debug with screeshot=True in crawler.arun()
I got following error:
image

@unclecode
Copy link
Owner

@Jevli Hi, you're right. It's definitely a weird bug. For the second one, can you share your code snippet with me? As for the duck herd, I will check tomorrow or the day after tomorrow to see why it behaves that way.

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    # Configure the browser settings
    browser_config = BrowserConfig(headless=True, verbose=True)

    # Set run configurations
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        screenshot=True,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://kidocode.com/',
            config=crawl_config
        )

        if result.success:
            print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
            print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
            # print("Fit Markdown Length:", len(result.markdown_v2.fit_markdown))

if __name__ == "__main__":
    asyncio.run(main())

And this is output

[INIT].... → Crawl4AI 0.4.23
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[EXPORT].. ℹ Exporting PDF and taking screenshot took 0.18s
[FETCH]... ↓ https://kidocode.com/... | Status: True | Time: 3.22s
[SCRAPE].. ◆ Processed https://kidocode.com/... | Time: 2035ms
[COMPLETE] ● https://kidocode.com/... | Status: True | Total: 5.26s
Raw Markdown Length: 118984
Citations Markdown Length: 107444

@unclecode unclecode self-assigned this Dec 17, 2024
@unclecode unclecode added the question Further information is requested label Dec 17, 2024
@Jevli
Copy link
Author

Jevli commented Dec 17, 2024

Dammit I think the chrome webdeveloper tools is mine. From using asyncio.gather()... removed that and got it working on outside of docker.

Thought I still got problem in docker. I will paste tomorrow snippets (I don't have today more time to "clean" code :)

For screenshot problem here is code which gets it:

If I turn screenshot=False code works. And I'm sorry this code look awful, hopefully it's readable. Haven't got time to clean it up...

async def authenticate_and_collect(
        self, url: str, second_crawler_hooks={}, **kwargs
    ):
        # Setup crawler strategy
        browser_config = BrowserConfig(
            headless=True,
            use_persistent_context=True,
            user_data_dir="./states/browser_data",
            storage_state=self.storage_path,
        )

        crawler_config = CrawlerRunConfig(
            magic=True,
            screenshot=True,
            cache_mode=CacheMode.BYPASS,
            wait_until="domcontentloaded",
            page_timeout=10000,
            **kwargs,
        )

        async def after_goto(page: Page, context: BrowserContext):
            await page.wait_for_load_state("networkidle")

        crawler_strategy = AsyncPlaywrightCrawlerStrategy(browser_config=browser_config)
        hooks = {
            "on_browser_created": self.on_browser_created,
            "after_goto": after_goto,
            **second_crawler_hooks,
        }
        crawler_strategy.hooks = hooks

        # First crawler for authentication
        async with AsyncWebCrawler(
            config=browser_config, crawler_strategy=crawler_strategy
        ) as crawler:
            return await crawler.arun(
                url=url,
                config=crawler_config,
            )

Btw **kwargs has css_selector and extraction_strategy when I'm running it

@Jevli
Copy link
Author

Jevli commented Dec 18, 2024

Update for docker, I get following error with docker file (running as should outside of docker):

ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-134' coro=<ManagedBrowser._monitor_browser_process() done, defined at /root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py:111> exception=AttributeError("'NoneType' object has no attribute 'error'")>
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 123, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 141, in _monitor_browser_process
    self.logger.error(
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'error'
Traceback (most recent call last):
  File "/app/src/main.py", line 47, in <module>
    PAGE_gigs = asyncio.run(collect_PAGE_gig_data())
  File "/usr/local/lib/python3.13/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 721, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/app/src/provider_crawler/PAGE_gigs.py", line 12, in collect_PAGE_gig_data
    gigs = await collect_PAGE_links()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/src/provider_crawler/PAGE_gigs.py", line 93, in collect_PAGE_links
    rows = await crawler.authenticate_and_collect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
    )
    ^
  File "/app/src/crawler/crawler.py", line 107, in authenticate_and_collect
    async with AsyncWebCrawler(
               ~~~~~~~~~~~~~~~^
        config=browser_config, crawler_strategy=crawler_strategy
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ) as crawler:
    ^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_webcrawler.py", line 131, in __aenter__
    await self.crawler_strategy.__aenter__()
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 501, in __aenter__
    await self.start()
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 508, in start
    await self.browser_manager.start()
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/crawl4ai/async_crawler_strategy.py", line 268, in start
    self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/async_api/_generated.py", line 14779, in connect_over_cdp
    await self._impl_obj.connect_over_cdp(
    ...<4 lines>...
    )
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/_impl/_browser_type.py", line 174, in connect_over_cdp
    response = await self._channel.send_return_as_dict("connectOverCDP", params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 67, in send_return_as_dict
    return await self._connection.wrap_api_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
    )
    ^
  File "/root/.cache/pypoetry/virtualenvs/PROJECT-NAME-9TtSrW0h-py3.13/lib/python3.13/site-packages/playwright/_impl/_connection.py", line 528, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: BrowserType.connect_over_cdp: connect ECONNREFUSED ::1:9222
Call log:
  - <ws preparing> retrieving websocket url from http://localhost:9222

Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f6199ed7920>
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/asyncio/base_subprocess.py", line 130, in __del__
  File "/usr/local/lib/python3.13/asyncio/base_subprocess.py", line 107, in close
  File "/usr/local/lib/python3.13/asyncio/unix_events.py", line 802, in close
  File "/usr/local/lib/python3.13/asyncio/unix_events.py", line 788, in write_eof
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 829, in call_soon
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 552, in _check_closed
RuntimeError: Event loop is closed

I'm just ignoring something docker spefic stuff versus native... (Screenshot problem exists still also but that is both outside and inside docker)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants