-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
尝试执行命令数次,只能同步大约10条数据(共5k)10月29之前的所有广播均未曾同步。 #84
Comments
同样的情况,几年的广播,只能备份10条左右 |
我私信问了doufen都没有回答诶,感觉可能不在运营了,但是我看他们的开发者的一个私人账号每天还在发动态,但是我不好意思打扰他,或者你去问他(捂脸)。哈哈哈要不您联系我吧我们一起解决下,[email protected]。 |
https://www.douban.com/people/doufen-org/status/3664628436/?start=20#comments&_i=6415496nMRa4o7 |
哭了,咋截图啊,我的太多了 |
总不能手动去爬吧…… |
好的我用py简单写了一个爬虫脚本,目前只爬了广播和时间,评论还有影评啥的都没,想试试的可以看一下: import random
import time
import requests
from bs4 import BeautifulSoup
import json
# 豆瓣用户主页地址
user_url = 'https://www.douban.com/people/xxx/statuses'
# 保存文件名
file_name = 'douban_user_broadcast.json'
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
'Cookie': 'xx'
}
# 广播列表
broadcast_list = []
# 爬取前10页广播(每页10条)
for i in range(1, 60):
print(f"第{i+1}页")
# 拼接分页参数
params = {
'p': i
}
# 发送请求
response = requests.get(user_url, headers=headers, params=params)
# 解析响应内容
soup = BeautifulSoup(response.text, 'lxml')
# 获取广播元素列表
items = soup.find_all('div', class_='new-status status-wrapper saying')
# 遍历每个广播元素
for item in items:
# 创建一个字典存储广播信息
broadcast = {}
action_time = item.find('span', class_='created_at')['title']
broadcast['action_time'] = action_time.strip()
status_saying = item.find(
'p'
).text
broadcast['content'] = status_saying
# 将广播信息添加到列表中
broadcast_list.append(broadcast)
time.sleep(1+5*random.random())
# 将列表转换为 json 格式并保存到文件中
with open(file_name, 'w', encoding='utf-8') as f:
json.dump(broadcast_list, f, ensure_ascii=False) |
No description provided.
The text was updated successfully, but these errors were encountered: