-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML of the content extracted #59
Comments
So you want it to figure out the main content area, but not strip any HTML? |
Actually, when i do .content, it works well but does not retain images in the main content. How can i extract main content along with the images. |
In short, can i extract the main content while keeping most of the format and content as it is. |
Have you tried to pass a list of all tags that you want to keep into |
Hmm yes i have. For example to extract the main content of https://www.cs.auckland.ac.nz/~jmor159/PLDS210/qsort.html, i have used source = open('https://www.cs.auckland.ac.nz/~jmor159/PLDS210/qsort.html').read but the result does not contain the image and sideby code. Is there a way to extract everything of the main content? |
You could try calling |
I tried calling prepare_candidates, but it gave a single element. I am not getting what you are saying. There are times when some img or tables is being missed out of the content. Can this be tailored? |
|
Hi, I applied .content on http://www.algolist.net/Algorithms/Sorting/Bubble_sort . I got fairly good content but images were not coming. Is there a way to get them ? |
Did you include |
Sorry to sound repetitive but this may be due to #51 perhaps? |
main content of the article is extracted using .content. But how can the main content of article be extracted in the same css format?
The text was updated successfully, but these errors were encountered: