Sinama

Sinama is a simple web scraping library.

Requirements

PHP 7.0

Installation

composer require rafaelglikis/sinama

Usage

Create a Sinama Client (which extends Goutte\Client):

use  Sinama\Client;
$client = new Client();

Make requests with the request() method:

// Go to the motherfuckingwebsite.com website
$crawler = $client->request('GET', 'https://motherfuckingwebsite.com/');

The method returns a Crawler object (which extends Symfony/Component/DomCrawler/Crawler).

To use your own Guzzle settings, you may create and pass a new Guzzle 6 instance to Sinama Client. For example, to add a 60 second request timeout:

use  Sinama\Client;
use GuzzleHttp\Client as GuzzleClient;

$client = new Client(new GuzzleClient([
    'timeout' => 60
]));
$crawler = $client->request('GET', 'https://github.com/trending');

For more options visit Guzzle Documentation.

Click on links:

$link = $crawler->selectLink('PHP')->link();
$crawler = $client->click($link);
echo $crawler->getUri()."\n";

Extract data the symfony way:

$crawler->filter('h3 > a')->each(function ($node) {
    print trim($node->text())."\n";
});

Or use Sinama special methods:

$crawler = $client->request('GET', 'https://github.com/trending');
echo '<html>';
echo '<head>';
echo '<title>'.$crawler->findTitle().'</title>';
echo '<head>';
echo '<body>';
echo '<h1>'.$crawler->findTitle().'</h1>';
echo '<p>Main Image: '.$crawler->findMainImage().'</p>';
echo $crawler->findMainContent();
echo '<pre>';
echo 'Links: ';
print_r($crawler->findLinks());
echo 'Emails: ';
print_r($crawler->findEmails());
echo 'Images: ';
print_r($crawler->findImages());
echo '</pre>';
echo '</body>';
echo '</html>';

Submit forms:

$crawler = $client->request('GET', 'https://www.google.com/');
$form = $crawler->selectButton('Google Search')->form();
$crawler = $client->submit($form, ['q' => 'rafaelglikis/sinama']);
$crawler->filter('h3 > a')->each(function ($node) {
    print trim($node->text())."\n";
});

Now that we have learned enough let's scrape a site with Sinama Spider:

use Sinama\Crawler;
use Sinama\Spider as BaseSpider;

class Spider extends BaseSpider
{
    public function parse(Crawler $crawler)
    {
        $crawler->filter('div.read-more > a')->each(function (Crawler $node) {
            $this->scrape($node->attr('href'));
        });

        $crawler->filter('div.blog-pagination > a')->each(function ($node) {
            $this->follow($node->attr('href'));
        });
    }

    public function scrape($url)
    {
        echo "*************************************************** ".$url."\n";
        $crawler = $this->client->request('GET', $url);
        echo "Title: " . $crawler->findTitle() . "\n";
        echo "Main Image: " . $crawler->findMainImage()."\n";
        echo "Main Content: \n" . $crawler->findMainContent()."\n";
        echo "Emails: \n";
        print_r($crawler->findEmails());
        echo "Links: \n";
        print_r($crawler->findLinks());
    }

    public function getStartUrls(): array
    {
        return [
            'https://blog.scrapinghub.com'
        ];
    }
}

$spider = new Spider([
    'start_urls' => [ 'https://blog.scrapinghub.com' ],
    'max_depth' => 2,
    'verbose' => true
]);
$spider->run();

TODO

Crawler::findTags()

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.idea		.idea
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
composer.json		composer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sinama

Requirements

Installation

Usage

TODO

About

Releases

Packages

Languages

rafaelglikis/sinama

Folders and files

Latest commit

History

Repository files navigation

Sinama

Requirements

Installation

Usage

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages