Skip to content

v2.0.0

Compare
Choose a tag to compare
@otsch otsch released this 15 Oct 15:08
· 19 commits to main since this release

Changed

  • BREAKING: Removed methods BaseStep::addToResult(), BaseStep::addLaterToResult(), BaseStep::addsToOrCreatesResult(), BaseStep::createsResult(), and BaseStep::keepInputData(). These methods were deprecated in v1.8.0 and should be replaced with Step::keep(), Step::keepAs(), Step::keepFromInput(), and Step::keepInputAs().
  • BREAKING: Added the following keep methods to the StepInterface: StepInterface::keep(), StepInterface::keepAs(), StepInterface::keepFromInput(), StepInterface::keepInputAs(), as well as StepInterface::keepsAnything(), StepInterface::keepsAnythingFromInputData() and StepInterface::keepsAnythingFromOutputData(). If you have a class that implements this interface without extending Step (or BaseStep), you will need to implement these methods yourself. However, it is strongly recommended to extend Step instead.
  • BREAKING: With the removal of the addToResult() method, the library no longer uses toArrayForAddToResult() methods on output objects. Instead, please use toArrayForResult(). Consequently, RespondedRequest::toArrayForAddToResult() has been renamed to RespondedRequest::toArrayForResult().
  • BREAKING: Removed the result and addLaterToResult properties from Io objects (Input and Output). These properties were part of the addToResult feature and are now removed. Instead, use the keep property where kept data is added.
  • BREAKING: The signature of the Crawler::addStep() method has changed. You can no longer provide a result key as the first parameter. Previously, this key was passed to the Step::addToResult() method internally. Now, please handle this call yourself.
  • BREAKING: The return type of the Crawler::loader() method no longer allows array. This means it's no longer possible to provide multiple loaders from the crawler. Instead, use the new functionality to directly provide a custom loader to a step described below. As part of this change, the UnknownLoaderKeyException was also removed as it is now obsolete. If you have any references to this class, please make sure to remove them.
  • BREAKING: Refactored the abstract LoadingStep class to a trait and removed the LoadingStepInterface. Loading steps should now extend the Step class and use the trait. As multiple loaders are no longer supported, the addLoader method was renamed to setLoader. Similarly, the methods useLoader() and usesLoader() for selecting loaders by key are removed. Now, you can directly provide a different loader to a single step using the trait's new withLoader() method (e.g., Http::get()->withLoader($loader)). The trait now also uses phpdoc template tags, for a generic loader type. You can define the loader type by putting /** @use LoadingStep<MyLoader> */ above use LoadingStep; in your step class. Then your IDE and static analysis (if supported) will know what type of loader, the trait methods return and accept.
  • BREAKING: Removed the PaginatorInterface to allow for better extensibility. The old Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator class has also been removed. Please use the newer, improved version Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator. This newer version has also changed: the first argument UriInterface $url is removed from the processLoaded() method, as the URL also is part of the request (Psr\Http\Message\RequestInterface) which is now the first argument. Additionally, the default implementation of the getNextRequest() method is removed. Child implementations must define this method themselves. If your custom paginator still has a getNextUrl() method, note that it is no longer needed by the library and will not be called. The getNextRequest() method now fulfills its original purpose.
  • BREAKING: Removed methods from HttpLoader:
    • $loader->setHeadlessBrowserOptions() => use $loader->browser()->setOptions() instead
    • $loader->addHeadlessBrowserOptions() => use $loader->browser()->addOptions() instead
    • $loader->setChromeExecutable() => use $loader->browser()->setExecutable() instead
    • $loader->browserHelper() => use $loader->browser() instead
  • BREAKING: Removed method RespondedRequest::cacheKeyFromRequest(). Use RequestKey::from() instead.
  • BREAKING: The HttpLoader::retryCachedErrorResponses() method now returns an instance of the new Crwlr\Crawler\Loader\Http\Cache\RetryManager class. This class provides the methods only() and except() to restrict retries to specific HTTP response status codes. Previously, this method returned the HttpLoader itself ($this), so if you're using it in a chain and calling other loader methods after it, you will need to refactor your code.
  • BREAKING: Removed the Microseconds class from this package. It has been moved to the crwlr/utils package, which you can use instead.

Added

  • New methods FileCache::prolong() and FileCache::prolongAll() to allow prolonging the time to live for cached responses.

Fixed

  • The maxOutputs() method is now also available and working on Group steps.
  • Improved warning messages for step validations that are happening before running a crawler.
  • A PreRunValidationException when the crawler finds a problem with the setup, before actually running, is not only logged as an error via the logger, but also rethrown to the user. This way the user won't get the impression, that the crawler ran successfully without looking at the log messages.

Detailed upgrade guide on https://www.crwlr.software/packages/crawler/v2.0/upgrade-guide