`harke-puppeteer`

Headless scraper for YouTube using Puppeteer based on our YouTube parser harke. It was mainly used for testing the parsers as well as monitoring some playlists on a server (without doing API requests).

Check out this new repo for more scrapers: https://github.com/algorithmwatch/dataskop-scrapers

Installation

git clone --depth 1 --branch v0.3.0 https://github.com/algorithmwatch/harke.git
cd harke
yarn install
yarn build
cd ..
git clone https://github.com/algorithmwatch/harke-puppeteer.git
cd harke-puppeteer
yarn install
yarn add ../harke

Usage

Login

yarn run cli -l

Run all parsers to check if they are working

yarn run cli -a

Watch history

yarn run cli -w

Search history

yarn run cli -h

Single video

yarn run cli -v https://www.youtube.com/watch?v=IWlGDbMEtxM

Search results

yarn run cli -s antifa

Monitoring

To monitor YouTube's news playlists.

install npm, yarn, and all missing deps for puppeeter on your server
./deploy.sh (see script below)
then:

cd ../harke-puppeteer
yarn add ../harke

deploy.sh

#!/usr/bin/env bash
set -e
set -x

rsync --recursive --verbose --exclude .git --exclude node_modules --exclude html --exclude data --exclude user_data . server:~/code/harke-puppeteer
rsync --recursive --verbose ../harke/build server:~/code/harke
rsync --recursive --verbose ../harke/*.json server:~/code/harke

run.sh

#!/bin/bash
set -e
set -x

sleep $((RANDOM % 120))
date=$(date '+%Y-%m-%d')
db_location="/root/yt-playlists-data/db${date}.json"
echo $date
cd /root/code/harke-puppeteer
yarn run cli -m --dbLocation $db_location

Crontab:

*/7 * * * * /root/run.sh

Documentation

Log in to Google (complicated)

We have to use some obfuscation to make the Google login work. We are using: https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth

Firefox

Alternativly, we could use puppeteer with Firefox. To setup, specify this in .launch:

  product: 'firefox',

yarn remove puppeteer
PUPPETEER_PRODUCT=firefox yarn add puppeteer

Unfortunatly, using it with Firefox was buggy. (But Google was not blocking the login)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.vscode		.vscode
src		src
.editorconfig		.editorconfig
.eslintrc		.eslintrc
.gitignore		.gitignore
README.md		README.md
nodemon.json		nodemon.json
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`harke-puppeteer`

Installation

Usage

Login

Run all parsers to check if they are working

Watch history

Search history

Single video

Search results

Monitoring

Documentation

Log in to Google (complicated)

Firefox

License

About

Releases

Packages

Contributors 2

Languages

algorithmwatch/harke-puppeteer

Folders and files

Latest commit

History

Repository files navigation

harke-puppeteer

Installation

Usage

Login

Run all parsers to check if they are working

Watch history

Search history

Single video

Search results

Monitoring

Documentation

Log in to Google (complicated)

Firefox

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`harke-puppeteer`

Packages