Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JWT爬蟲問題 #11

Open
aaa71541367 opened this issue May 28, 2024 · 8 comments
Open

JWT爬蟲問題 #11

aaa71541367 opened this issue May 28, 2024 · 8 comments

Comments

@aaa71541367
Copy link

aaa71541367 commented May 28, 2024

https://live.shopee.tw/share
這個網站改版後無法透過get直接取得我要的資料
直接get會顯示以下的圖(圖二)
想知道如何能透過python或是爬蟲取得這個json
image
image
image

@joshhu
Copy link
Owner

joshhu commented May 28, 2024

同學您好,

因為我無法找到你的畫面,我猜測有幾個重點

  1. 如果網站有登入機制(如您提的蝦皮),使用requests是無法的
  2. 一般抓取json格式的資料,通常json是會搭配js動態載入,因此也無法用requests
  3. 大部分會使用selenium套件搭配XPATH語法去找到需要的資料

以上是我大概的了解,這邊,有之前我在明新科大上的selenum教學可以參考,謝謝。

@aaa71541367
Copy link
Author

aaa71541367 commented May 28, 2024

這個是用F12開發者模式抓到的資料
這個頁面不需要登入
動態載入可以透過開發者模式的資料抓到(請求網址的page改數字就好)
但現在問到的問題是如果我直接get開發者模式抓到的網址會顯示這個
沒辦法跑出我要的資料
image

@joshhu
Copy link
Owner

joshhu commented May 28, 2024

F12畫面很多東西,我比較無法猜出你要抓的是什麼。提供完整程式碼,謝謝。

@aaa71541367
Copy link
Author

aaa71541367 commented May 28, 2024

import requests

headers = {
    'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 OPR/105.0.0.0 (Edition GX-CN)",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
    "Sec-Ch-Ua": "\"Opera GX\";v=\"105\", \"Chromium\";v=\"119\", \"Not?A_Brand\";v=\"24\"",
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": "\"macOS\"",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1"
}

coin = {}

for i in range(15):
    url = f"https://live.shopee.tw/api/v1/lptab/item?tab_id=592037738960896&tab_type=1&device_id=6d910b16-b590-40bb-9157-5517f2923aa3&ctx_id=6d910b16-b590-40bb-9157-5517f2923aa3-1706454481206-697317&page_no={i+1}&offset=0&limit=50"
    response = requests.get(url, headers=headers).json()
    
    for x in range(6):
        try:
            coin[response["data"]["list"][x]["item_id"]] = response["data"]["list"][x]["item"]["coins_per_claim"]
        except:
            pass

sorted_items = sorted(coin.items(), key=lambda item: (-item[1], -item[0]))
sorted_dict = dict(sorted_items)
first_key = next(iter(sorted_dict.keys()))

print(first_key)

@joshhu
Copy link
Owner

joshhu commented May 28, 2024

麻煩格式化程式碼謝謝,用markdown語法。

@joshhu
Copy link
Owner

joshhu commented May 28, 2024

送出前按一下「Preview」檢查是否有格式化,markdown語法用三個`符號如下
```
<程式在這裏>
```

@aaa71541367
Copy link
Author

修了

@joshhu
Copy link
Owner

joshhu commented May 28, 2024

同學您好,

蝦皮爬蟲我沒試過,一般網站會擋爬蟲,限制可能較多,但有幾個方向可以試試

  1. 抓取XHR找到item的實際JSON配置
  2. Header中要加入需要的資訊,我不確定您的Header資訊是否可行
  3. 通常要求JSON會是使用POST而非GET
  4. 還是要配合selenium

同學請參考這個教學試試,要一個一個找。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants