Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide example with authentication #83

Open
tacman opened this issue Jul 15, 2022 · 6 comments
Open

Provide example with authentication #83

tacman opened this issue Jul 15, 2022 · 6 comments

Comments

@tacman
Copy link
Contributor

tacman commented Jul 15, 2022

How can I scrape a website that requires authentication?

That is, I want to start with at https://jardinado.herokuapp.com/login, fill in my credentials, and THEN start scraping the site.

That is, I want the $goutteClient to execute something like this first, then scrape:

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Now that cookies are set, when I fetch a url that requires login I should get the page instead of the 302 (redirect to login).

I'm not sure how to implement this within the context of phpscraper. One idea would be to expose the goutte client.

@spekulatius
Copy link
Owner

Hmmm, while this should work it's quite a bit work to debug with the site being down (doesn't load for me). Can you bring it back up @tacman ?

@tacman
Copy link
Contributor Author

tacman commented Aug 11, 2022

Try now. It's a slow site, at least initially, because it's running on a free heroku dyno. It can take up to 30 seconds to "wake up" if it's been inactive for a while.

I set up a login for you -- [email protected], password: spekulatius

@spekulatius
Copy link
Owner

Hey @tacman

can you share some more code on how you add this to PHPScraper?

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Thanks :)

@tacman
Copy link
Contributor Author

tacman commented Aug 15, 2022

Well, that's kind of the point of this issue -- I don't know how to do that. I only see how to click links with phpScraper:

https://github.com/spekulatius/PHPScraper/blob/master/src/phpscraper.php#L918

I was hoping there was a way to submit a form, which would keep the cookies for that session. So instead of ->clickLink(), a method like ->submitForm(), when I could send in the credentials, and then load a page and follow links that require authentication.

@spekulatius
Copy link
Owner

Ah okay, now we are getting a bit closer. I've wondered how you did it. Did you get it working with Goutte only?

@tacman
Copy link
Contributor Author

tacman commented Aug 15, 2022

I have a Symfony bundle that crawls a website: https://github.com/survos/SurvosCrawlerBundle

The idea is that if it can create a set of links that are visible (based on different logins), those links can then be used in a simple PHPUnit test. It basically does what almost all testers do in the beginning -- log in, and click blindly on every link. It's amazing how often someone finds a broken page that way.

So I was trying to use PHPScrapper to do that. In the end, I couldn't, so I just used what other tools I had available:

    public function authenticateClient(?string $username = null, string $plainPassword=null): void
    {
        // might be worth checking out: https://github.com/liip/LiipTestFixturesBundle/pull/62#issuecomment-622191412
        static $clients = [];
        if (!array_key_exists($username, $clients)) {
            $gouteClient = new Client();
            $gouteClient
                ->setMaxRedirects(0);
            $this->username = $username;
            $baseUrl = $this->baseUrl;
            $clients[$username] = $gouteClient;
            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . trim($this->loginPath, '/'), [
                    'proxy' => '127.0.0.1:7080'
                ]);

//            dd($crawler, $url);
                $response = $gouteClient->getResponse();
                assert($response->getStatusCode() === 200, "Invalid route: " . $url);
//            dd(substr($response->getContent(),0, 1024), $url, $baseUrl);

// select the form and fill in some values
//                $form = $crawler->filter('login_form')->form();
                try {
                    $form = $crawler->selectButton($this->submitButtonSelector)->form();
                } catch (\Exception $exception) {
                    throw new \Exception($this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
                }
//                assert($form, $this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
                    $form['_username'] = $username;
                $form['_password'] = $plainPassword;

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();
                assert($response->getStatusCode() == 200, substr($response->getContent(), 0, 512) . "\n\n" . $url);

https://github.com/survos/SurvosCrawlerBundle/blob/main/src/Services/CrawlerService.php#L108

I don't love the code, though it's functional. If I could drop it all and replace it with PHPScraper, I would. Of course, if there's anything of value you can grab from my bundle, please do so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants