-
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide example with authentication #83
Comments
Hmmm, while this should work it's quite a bit work to debug with the site being down (doesn't load for me). Can you bring it back up @tacman ? |
Try now. It's a slow site, at least initially, because it's running on a free heroku dyno. It can take up to 30 seconds to "wake up" if it's been inactive for a while. I set up a login for you -- [email protected], password: spekulatius |
Hey @tacman can you share some more code on how you add this to PHPScraper? if ($username) {
$crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
]);
// select the form and fill in some values
$form = $crawler->selectButton('login-btn')->form();
$form['_username'] = 'user';
$form['_password'] = 'pass';
// submit that form
$crawler = $gouteClient->submit($form);
$response = $gouteClient->getResponse(); Thanks :) |
Well, that's kind of the point of this issue -- I don't know how to do that. I only see how to click links with phpScraper: https://github.com/spekulatius/PHPScraper/blob/master/src/phpscraper.php#L918 I was hoping there was a way to submit a form, which would keep the cookies for that session. So instead of ->clickLink(), a method like ->submitForm(), when I could send in the credentials, and then load a page and follow links that require authentication. |
Ah okay, now we are getting a bit closer. I've wondered how you did it. Did you get it working with Goutte only? |
I have a Symfony bundle that crawls a website: https://github.com/survos/SurvosCrawlerBundle The idea is that if it can create a set of links that are visible (based on different logins), those links can then be used in a simple PHPUnit test. It basically does what almost all testers do in the beginning -- log in, and click blindly on every link. It's amazing how often someone finds a broken page that way. So I was trying to use PHPScrapper to do that. In the end, I couldn't, so I just used what other tools I had available: public function authenticateClient(?string $username = null, string $plainPassword=null): void
{
// might be worth checking out: https://github.com/liip/LiipTestFixturesBundle/pull/62#issuecomment-622191412
static $clients = [];
if (!array_key_exists($username, $clients)) {
$gouteClient = new Client();
$gouteClient
->setMaxRedirects(0);
$this->username = $username;
$baseUrl = $this->baseUrl;
$clients[$username] = $gouteClient;
if ($username) {
$crawler = $gouteClient->request('GET', $url = $baseUrl . trim($this->loginPath, '/'), [
'proxy' => '127.0.0.1:7080'
]);
// dd($crawler, $url);
$response = $gouteClient->getResponse();
assert($response->getStatusCode() === 200, "Invalid route: " . $url);
// dd(substr($response->getContent(),0, 1024), $url, $baseUrl);
// select the form and fill in some values
// $form = $crawler->filter('login_form')->form();
try {
$form = $crawler->selectButton($this->submitButtonSelector)->form();
} catch (\Exception $exception) {
throw new \Exception($this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
}
// assert($form, $this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
$form['_username'] = $username;
$form['_password'] = $plainPassword;
// submit that form
$crawler = $gouteClient->submit($form);
$response = $gouteClient->getResponse();
assert($response->getStatusCode() == 200, substr($response->getContent(), 0, 512) . "\n\n" . $url); https://github.com/survos/SurvosCrawlerBundle/blob/main/src/Services/CrawlerService.php#L108 I don't love the code, though it's functional. If I could drop it all and replace it with PHPScraper, I would. Of course, if there's anything of value you can grab from my bundle, please do so! |
How can I scrape a website that requires authentication?
That is, I want to start with at https://jardinado.herokuapp.com/login, fill in my credentials, and THEN start scraping the site.
That is, I want the $goutteClient to execute something like this first, then scrape:
Now that cookies are set, when I fetch a url that requires login I should get the page instead of the 302 (redirect to login).
I'm not sure how to implement this within the context of phpscraper. One idea would be to expose the goutte client.
The text was updated successfully, but these errors were encountered: