-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code blocks lose formatting when converting from HTML to markdown #325
Comments
@AswanthManoj Thanks for reporting this, this has already been resolved in 0.4.1; would you give it a try and let me know? Appreciate. |
@unclecode Thanks for your update, I've checked it out and its working great.
This would allow us to capture all available language variations of the code example from documentation, rather than just the default/currently displayed version. Whats your opinion on this? Do let me know... |
@AswanthManoj Hi, it's a very interesting idea, for sure. I'm going to put it in the backlog to detect the language as well and add it to the markdown; definitely, it would be very good. Thanks for the suggestion, and I'm happy that the previous issue has already been resolved. |
@unclecode I'd be happy to help implement these language detection and multi-language code extraction features. To ensure I maintain consistency with the project, could you guide me on like
I could give it a try... |
@AswanthManoj I appreciate your interest in contributing to the library. I created a very detailed explanation of what must be done. I will share it here, but feel free to let me know if you think it's too much. I can handle it myself, but if you are interested, please go ahead and make a pull request. I will also invite you to our Discord channel so I can communicate easily with you. Additionally, you can share your email address for one very important aspect I explained in the following text: performance is crucial, and we cannot afford to lose even 10 milliseconds. Let me know what you think. Updated Guidance Including the CustomHTML2Text Class With the
In summary:
|
When converting HTML
<pre>
and<code>
tags to markdown, the current implementation removes newlines and modifies code block formatting, making it difficult to preserve the original code structure. This affects the readability and usability of the generated markdown.Example:
Original HTML:
def example(): print("Hello") return True
Impact
Note: To replicate crawl the link "https://www.firecrawl.dev/blog/automated-web-scraping-free-2025"
Current Workaround
I implemented a custom solution using
markdownify
with a custom pre tag conversion:Then extracted code snippets using regex:
This workaround involves:
While this solution helps in some cases, it doesn't address all formatting issues and requires additional processing steps that shouldn't be necessary.
I can help implement a fix for this issue. Could you please point me to:
I'll submit a PR with the necessary changes once I have this guidance.
The text was updated successfully, but these errors were encountered: