Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code blocks lose formatting when converting from HTML to markdown #325

Open
AswanthManoj opened this issue Dec 6, 2024 · 5 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@AswanthManoj
Copy link

When converting HTML <pre> and <code> tags to markdown, the current implementation removes newlines and modifies code block formatting, making it difficult to preserve the original code structure. This affects the readability and usability of the generated markdown.

Example:
Original HTML:

<pre><code>
def example():
    print("Hello")
    return True
</code></pre>

Current markdown output:

def example(): print("Hello") return True


Expected markdown output:
```python
def example():
    print("Hello")
    return True

Impact

  • Code indentations were lost
  • Line breaks are removed
  • Makes code blocks harder to read

Note: To replicate crawl the link "https://www.firecrawl.dev/blog/automated-web-scraping-free-2025"

Current Workaround

I implemented a custom solution using markdownify with a custom pre tag conversion:

class CustomMarkdownify(MarkdownConverter):
    def convert_pre(self, el, text, convert_as_inline=False):
        language = ''
        pre_class = el.get('class', '')
        if isinstance(pre_class, str):
            pre_class = pre_class.split()
        for class_name in pre_class:
            if class_name.startswith('language-'):
                language = class_name.replace('language-', '')
                break
        return f'```{language}\n{text}\n```\n' if language else f'```\n{text}\n```\n'

Then extracted code snippets using regex:

def extract_code_snippets(markdown: str) -> list:
    pattern = r'\n```.*?\n(.*?)\n```\n' 
    snippets = []
    matches = re.finditer(pattern, markdown, re.DOTALL)
    for match in matches:
        full_block = match.group(0)
        content = match.group(1)
        snippets.append((full_block, content))
    return snippets

This workaround involves:

  1. Extracting snippets from both the original markdown and newly converted markdown
  2. Using token-based similarity matching to identify corresponding code blocks
  3. Replacing the poorly formatted blocks with properly formatted ones

While this solution helps in some cases, it doesn't address all formatting issues and requires additional processing steps that shouldn't be necessary.

I can help implement a fix for this issue. Could you please point me to:

  1. The relevant code handling HTML to markdown conversion for code blocks
  2. Any specific patterns/conventions to follow

I'll submit a PR with the necessary changes once I have this guidance.

@unclecode
Copy link
Owner

@AswanthManoj Thanks for reporting this, this has already been resolved in 0.4.1; would you give it a try and let me know? Appreciate.

@unclecode unclecode self-assigned this Dec 9, 2024
@unclecode unclecode added the bug Something isn't working label Dec 9, 2024
@AswanthManoj
Copy link
Author

@unclecode Thanks for your update, I've checked it out and its working great.
But I've identified two main areas for improvement in the code:

  1. Currently the code successfully extracts the snippets but struggles with language detection. Documentation sites typically define languages in HTML classes flowing a pattern like:

    • language-{language}
    • lang-{language}
    • sp-{language}
      These patterns appear as the pre tag's class atribute in the raw html so we could easily identify it and include em in the markdown.
  2. Also I would like to propose adding a new feature through a extract_all_code_languages bool parameter.
    for any given code block in the html:

    • Find the nearest div that containing button elements
    • Check if these button's innerText matches known programming languages
    • For each language button found:
      • perform a click and get the language and the updated pre tag contents into array.
    • When finished map the final pre tag that is visible in the dom and map it to this array, do this to all code blocks.
    • After markdown conversion is completed replace the original single code block with multiple blocks containing all language variations in the markdown

This would allow us to capture all available language variations of the code example from documentation, rather than just the default/currently displayed version.

Whats your opinion on this? Do let me know...

@unclecode
Copy link
Owner

@AswanthManoj Hi, it's a very interesting idea, for sure. I'm going to put it in the backlog to detect the language as well and add it to the markdown; definitely, it would be very good. Thanks for the suggestion, and I'm happy that the previous issue has already been resolved.

@AswanthManoj
Copy link
Author

@unclecode I'd be happy to help implement these language detection and multi-language code extraction features. To ensure I maintain consistency with the project, could you guide me on like

  • Which files I should focus on for implementing these changes
  • Any specific code conventions or formatting standards I should follow

I could give it a try...

@unclecode
Copy link
Owner

@AswanthManoj I appreciate your interest in contributing to the library. I created a very detailed explanation of what must be done. I will share it here, but feel free to let me know if you think it's too much. I can handle it myself, but if you are interested, please go ahead and make a pull request. I will also invite you to our Discord channel so I can communicate easily with you. Additionally, you can share your email address for one very important aspect I explained in the following text: performance is crucial, and we cannot afford to lose even 10 milliseconds. Let me know what you think.

Updated Guidance Including the CustomHTML2Text Class

With the CustomHTML2Text code now visible, here are the additional points and exact places to focus on, along with performance considerations:

  1. Where to Integrate the Multi-Language Logic
    We still follow the same file structure mentioned before:

    • async_configs.py for the new extract_all_code_languages flag.
    • content_scraping_strategy.py (WebScrapingStrategy) for parsing and extracting multiple language variants from the HTML. This will likely produce a custom data structure that pairs each original <pre> block with multiple (language, code_html) tuples.
    • markdown_generation_strategy.py (e.g., DefaultMarkdownGenerator) for final Markdown formatting. Before we run CustomHTML2Text, you would modify the HTML content so that for each original <pre> block, you insert multiple code snippets (one per language variant). Essentially, you should inject multiple <pre><code class="language-xxx">...</code></pre> blocks so that the conversion step can process them cleanly.
  2. Incorporating into CustomHTML2Text
    The CustomHTML2Text class, shown above, transforms HTML into Markdown. It currently:

    • Wraps <pre> blocks in triple backticks
    • Wraps inline <code> in backticks
    • Handles a handle_code_in_pre flag to decide how to treat nested <code> tags within <pre>.

    To support multi-language code:

    • After you inject the multi-language versions into the HTML (in markdown_generation_strategy.py), each <pre>/<code> block should already carry a class attribute like class="language-python" or class="lang-java". The CustomHTML2Text class currently doesn’t parse that language info.
    • You can enhance handle_tag('pre', ...) and handle_tag('code', ...) to detect the language from the attrs and append it after the opening triple backticks. For example, if class="language-python" is found, output python instead of just .

    The logic:

    • In handle_tag, when you encounter a <pre> start tag, check if it has class="language-xxx" or similar attributes and store the detected language.
    • Print something like:
      ```python
      
      if a python block is found.
    • Same for handle_code_in_pre. If the code inside <pre> is multiline and you’ve got multiple variants, you’ll see multiple <pre> blocks in sequence. Each will get its own language annotation.

    Make sure to keep the changes minimal and conditional on the language class. If no language is found, fallback to the current behavior. The preserve_tags and other logic remain unchanged.

  3. Performance and Overhead Considerations
    The main performance cost is likely in the additional DOM queries or complexity in extracting multiple code versions. To stay within ~15% overhead:

    • In content_scraping_strategy.py:

      • Limit the complexity of lookups. When you find a code block with language toggles, store these results once and avoid scanning the DOM multiple times.
      • Consider caching: If multiple code blocks share the same language buttons pattern, reuse the parsing logic.
      • Don’t do unnecessary HTML transformations until you are sure they are needed. If extract_all_code_languages is False, skip all the multi-language logic.
    • In markdown_generation_strategy.py:

      • Pre-process and inject all code variants before calling CustomHTML2Text.
      • Keep string operations O(n) and avoid multiple repeated string concatenations in a loop. Use join or build the final HTML in-memory once.
    • In CustomHTML2Text:

      • The changes are minimal: just read attributes and print the language name. This should add negligible overhead.

    Testing Approach:

    • Benchmark your current extraction (without the new feature) to get a baseline (~180ms for https://crawl4ai.com/mkdocs/basic/simple-crawling/).
    • Enable extract_all_code_languages on a page with multiple code blocks. Ensure you only add minimal overhead operations:
      • Possibly measure how many code blocks you have and how much extra HTML processing occurs.
      • If you do need complex DOM manipulations, do them once per block, not repeatedly.
    • Use a profiler or simple timing checks around your new logic to ensure it doesn’t exceed ~15% overhead. If you find it too slow, try reducing DOM queries or caching results.

In summary:

  • Add a flag in async_configs.py.
  • Extend WebScrapingStrategy in content_scraping_strategy.py to detect and store all language variants for each code block.
  • Modify DefaultMarkdownGenerator or a similar markdown pipeline step to inject multiple <pre><code class="language-...">...</code></pre> blocks for each variant before CustomHTML2Text runs.
  • In CustomHTML2Text, just lightly detect language attributes and print a language annotation in the fenced code block. Keep all enhancements lean to maintain performance within the ~15% overhead target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants