Headless scraper for YouTube using Puppeteer based on our YouTube parser harke. It was mainly used for testing the parsers as well as monitoring some playlists on a server (without doing API requests).
Check out this new repo for more scrapers: https://github.com/algorithmwatch/dataskop-scrapers
git clone --depth 1 --branch v0.3.0 https://github.com/algorithmwatch/harke.git
cd harke
yarn install
yarn build
cd ..
git clone https://github.com/algorithmwatch/harke-puppeteer.git
cd harke-puppeteer
yarn install
yarn add ../harke
yarn run cli -l
yarn run cli -a
yarn run cli -w
yarn run cli -h
yarn run cli -v https://www.youtube.com/watch?v=IWlGDbMEtxM
yarn run cli -s antifa
To monitor YouTube's news playlists.
- install npm, yarn, and all missing deps for puppeeter on your server
./deploy.sh
(see script below)- then:
cd ../harke-puppeteer
yarn add ../harke
deploy.sh
#!/usr/bin/env bash
set -e
set -x
rsync --recursive --verbose --exclude .git --exclude node_modules --exclude html --exclude data --exclude user_data . server:~/code/harke-puppeteer
rsync --recursive --verbose ../harke/build server:~/code/harke
rsync --recursive --verbose ../harke/*.json server:~/code/harke
run.sh
#!/bin/bash
set -e
set -x
sleep $((RANDOM % 120))
date=$(date '+%Y-%m-%d')
db_location="/root/yt-playlists-data/db${date}.json"
echo $date
cd /root/code/harke-puppeteer
yarn run cli -m --dbLocation $db_location
Crontab:
*/7 * * * * /root/run.sh
We have to use some obfuscation to make the Google login work. We are using: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth
Alternativly, we could use puppeteer with Firefox.
To setup, specify this in .launch
:
product: 'firefox',
yarn remove puppeteer
PUPPETEER_PRODUCT=firefox yarn add puppeteer
Unfortunatly, using it with Firefox was buggy. (But Google was not blocking the login)
MIT