Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

尝试执行命令数次,只能同步大约10条数据(共5k)10月29之前的所有广播均未曾同步。 #84

Open
asourorange opened this issue Nov 1, 2022 · 6 comments

Comments

@asourorange
Copy link

No description provided.

@flowfayefly
Copy link

同样的情况,几年的广播,只能备份10条左右

@asourorange
Copy link
Author

同样的情况,几年的广播,只能备份10条左右

我私信问了doufen都没有回答诶,感觉可能不在运营了,但是我看他们的开发者的一个私人账号每天还在发动态,但是我不好意思打扰他,或者你去问他(捂脸)。哈哈哈要不您联系我吧我们一起解决下,[email protected]

@flowfayefly
Copy link

同样的情况,几年的广播,只能备份10条左右

我私信问了doufen都没有回答诶,感觉可能不在运营了,但是我看他们的开发者的一个私人账号每天还在发动态,但是我不好意思打扰他,或者你去问他(捂脸)。哈哈哈要不您联系我吧我们一起解决下,[email protected]

https://www.douban.com/people/doufen-org/status/3664628436/?start=20#comments&_i=6415496nMRa4o7
我后来看到作者发的动态,说是广播备份不能用了😭 我打算直接截图了

@asourorange
Copy link
Author

哭了,咋截图啊,我的太多了

@zhaobenx
Copy link

总不能手动去爬吧……

@zhaobenx
Copy link

好的我用py简单写了一个爬虫脚本,目前只爬了广播和时间,评论还有影评啥的都没,想试试的可以看一下:

import random
import time
import requests
from bs4 import BeautifulSoup
import json

# 豆瓣用户主页地址
user_url = 'https://www.douban.com/people/xxx/statuses'
# 保存文件名
file_name = 'douban_user_broadcast.json'

# 请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'Cookie': 'xx'
}

# 广播列表
broadcast_list = []

# 爬取前10页广播(每页10条)
for i in range(1, 60):
    print(f"第{i+1}页")
    # 拼接分页参数
    params = {
        'p': i
    }
    # 发送请求
    response = requests.get(user_url, headers=headers, params=params)
    # 解析响应内容
    soup = BeautifulSoup(response.text, 'lxml')
    # 获取广播元素列表
    items = soup.find_all('div', class_='new-status status-wrapper saying')
    # 遍历每个广播元素
    for item in items:
        # 创建一个字典存储广播信息
        broadcast = {}
 
        action_time = item.find('span', class_='created_at')['title']
        broadcast['action_time'] = action_time.strip()
 
        status_saying = item.find(
            'p'
        ).text
        broadcast['content'] = status_saying
        # 将广播信息添加到列表中
        broadcast_list.append(broadcast)
    time.sleep(1+5*random.random())

# 将列表转换为 json 格式并保存到文件中        
with open(file_name, 'w', encoding='utf-8') as f:
    json.dump(broadcast_list, f, ensure_ascii=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants