-
Hi! First off to the devs, thank you for making this software. I'm new to web scraping and 4cat has been incredible for me so far. My question is: how do I scrape the 4plebs archive? I notice that older versions of 4cat have an option to search the 4plebs archive with (seemingly) no additional plugins necessary. When I access the 4cat localhost site, I don't see an option for 4chan anywhere. Can anyone point me in the right direction? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
Hey there, thanks for the compliment and happy to hear 4CAT has been of use! The 4plebs archive is not a data source, so 4CAT doesn't directly "scrape" this site. Rather, you can import data published by 4plebs to the 4chan data source. The easiest way of doing this is by downloading and importing one of their data dumps. You can then use this import script to move the 4plebs data to the 4CAT PostgreSQL database. Our helper scripts include a range of other scripts that can import imageboard data from other archives as well. If it's really needed, we also have a script to scrape data from 4plebs, but this is generally only advisable if you need a small dataset - they've had quite some problems with bots, so (understandably) installed harsh rate limits. Hope that helps. |
Beta Was this translation helpful? Give feedback.
-
Hi there, I only want to import data published by 4plebs from /pol/ January 1, 2022 to February 2022 on a specific keyword. How do I do this? |
Beta Was this translation helpful? Give feedback.
-
So, if the 4chan option isn't showing up in your 4cat localhost interface, it's likely because that feature might have been either deprecated or shifted elsewhere in newer updates. Your best bets are to make sure you're on the latest version of 4cat and to comb through the documentation for any clues. And hey, don't hesitate to reach out to the developer community; sometimes a quick question can save you a lot of time. Hope that helps you find your way! |
Beta Was this translation helpful? Give feedback.
-
Thanks dale, unfortunately I need February 2022 and the dumps end in
January 2022
…On Tue, Oct 31, 2023 at 8:19 AM Dale Wahl ***@***.***> wrote:
4plebs <https://archive.4plebs.org/_/articles/credits/#4plebsdumps> has
data dumps so you would need to see if they have one with /pol/ data from
that date and then import it via the helper script mentioned above by
sal-uva. You'd then enable 4chan in 4CAT's settings (data source section in
the control panel) and it will be a datasource you can search by keyword.
—
Reply to this email directly, view it on GitHub
<#381 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3XMCK6YZNLYDVEHBI4F5KLYCCYBDAVCNFSM6AAAAAA34JDODWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TIMZSGI4TS>
.
You are receiving this because you commented.Message ID:
<digitalmethodsinitiative/4cat/repo-discussions/381/comments/7432299@
github.com>
|
Beta Was this translation helpful? Give feedback.
Hey there, thanks for the compliment and happy to hear 4CAT has been of use!
The 4plebs archive is not a data source, so 4CAT doesn't directly "scrape" this site. Rather, you can import data published by 4plebs to the 4chan data source. The easiest way of doing this is by downloading and importing one of their data dumps. You can then use this import script to move the 4plebs data to the 4CAT PostgreSQL database. Our helper scripts include a range of other scripts that can import imageboard data from other archives as well.
If it's really needed, we also have a script to scrape data from 4plebs, but this is generally only advisable if you need a small dataset - they've had quite some prob…