Skip to content
This repository has been archived by the owner on Mar 21, 2023. It is now read-only.

algorithmwatch/harke-puppeteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

harke-puppeteer

Headless scraper for YouTube using Puppeteer based on our YouTube parser harke. It was mainly used for testing the parsers as well as monitoring some playlists on a server (without doing API requests).

Check out this new repo for more scrapers: https://github.com/algorithmwatch/dataskop-scrapers

Installation

git clone --depth 1 --branch v0.3.0 https://github.com/algorithmwatch/harke.git
cd harke
yarn install
yarn build
cd ..
git clone https://github.com/algorithmwatch/harke-puppeteer.git
cd harke-puppeteer
yarn install
yarn add ../harke

Usage

Login

yarn run cli -l

Run all parsers to check if they are working

yarn run cli -a

Watch history

yarn run cli -w

Search history

yarn run cli -h

Single video

yarn run cli -v https://www.youtube.com/watch?v=IWlGDbMEtxM

Search results

yarn run cli -s antifa

Monitoring

To monitor YouTube's news playlists.

  • install npm, yarn, and all missing deps for puppeeter on your server
  • ./deploy.sh (see script below)
  • then:
cd ../harke-puppeteer
yarn add ../harke

deploy.sh

#!/usr/bin/env bash
set -e
set -x

rsync --recursive --verbose --exclude .git --exclude node_modules --exclude html --exclude data --exclude user_data . server:~/code/harke-puppeteer
rsync --recursive --verbose ../harke/build server:~/code/harke
rsync --recursive --verbose ../harke/*.json server:~/code/harke

run.sh

#!/bin/bash
set -e
set -x

sleep $((RANDOM % 120))
date=$(date '+%Y-%m-%d')
db_location="/root/yt-playlists-data/db${date}.json"
echo $date
cd /root/code/harke-puppeteer
yarn run cli -m --dbLocation $db_location

Crontab:

*/7 * * * * /root/run.sh

Documentation

Log in to Google (complicated)

We have to use some obfuscation to make the Google login work. We are using: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth

Firefox

Alternativly, we could use puppeteer with Firefox. To setup, specify this in .launch:

  product: 'firefox',
yarn remove puppeteer
PUPPETEER_PRODUCT=firefox yarn add puppeteer

Unfortunatly, using it with Firefox was buggy. (But Google was not blocking the login)

License

MIT

About

Headless scraper for YouTube using Puppeteer

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published