Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

moldova: add from_date support #701

Merged
merged 9 commits into from
Apr 21, 2021
1 change: 1 addition & 0 deletions kingfisher_scrapy/base_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ class attribute to the path to the OCDS data.
line_delimited = False
root_path = ''
dont_truncate = False
default_from_date = None
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved

def __init__(self, sample=None, note=None, from_date=None, until_date=None, crawl_time=None,
keep_collection_open=None, package_pointer=None, release_pointer=None, truncate=None, *args,
Expand Down
10 changes: 9 additions & 1 deletion kingfisher_scrapy/spiders/moldova.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,20 @@ class Moldova(SimpleSpider):
"""
Domain
MTender
Spider arguments
from_date
Download only data from this time onward (YYYY-MM-DDThh:mm:ss format).
"""
name = 'moldova'

# SimpleSpider
data_type = 'release_package'

# BaseSpider
date_format = 'datetime'
default_from_date = '2018-10-18T00:00:00'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest 2018-01-01 (we can even do 2010-01-01), since we don't need to have the exact from date for this API. We should only set a precise from date if the API will error otherwise. That way, if the API fills in historical data, we don't accidentally omit it.

I think the afghanistan spiders are also too precise.

There are some that have a precise month - I don't know if they can be relaxed as well: honduras_portal_bulk, malta, uruguay_base, zambia.

We can update the docstring about setting default_from_date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I didn't want to add a default_date for this one, as it is not required, but

self.check_date_spider_argument('from_date', spider_arguments, lambda cls: repr(cls.default_from_date),
fails otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check is there, because otherwise how will the f-string be formatted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default_from_date is used in:

if self.cls.date_required:
format_string += " Defaults to {default}."
elif spider_argument == 'from_date':
format_string += "\n If ``until_date`` is provided, defaults to {default}."
elif spider_argument == 'until_date':
format_string += "\n If ``from_date`` is provided, defaults to {default}."

expected = format_string.format(period=period, format=format_, default=default(self.cls))

but I think that if date_required is false and only from_date or only until_date is implemented, then we don't need a default_from_date or a default_until_date, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not required in that specific scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe I can update this check and remove default_from_date from Moldova?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but I think the f-string also needs to be updated as right now it would format it with None, no?

Copy link
Member Author

@yolile yolile Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but I think the f-string also needs to be updated as right now it would format it with None, no?

Done, and I also added that the check ask if there is a default_until_date
I also updated all the warnings that we had

I think the afghanistan spiders are also too precise.
There are some that have a precise month - I don't know if they can be relaxed as well: honduras_portal_bulk, malta, uruguay_base, zambia.

Honduras and Uruguay are fine as they have fixed dates that are not going to change. For Afghanistan, zambia and Malta, the spider ask for both dates to be set (from_date and until_date) and I think that is not required, but I will update that as part of #600

date_required = True

def start_requests(self):
# https://public.mtender.gov.md offers three endpoints: /tenders/, /tenders/plan/ and /budgets/. However, this
# service publishes contracting processes under multiple OCIDs.
Expand All @@ -25,7 +33,7 @@ def start_requests(self):
#
# Note: The OCIDs from the /budgets/ endpoint have no corresponding data in the second service. The OCIDs from
# the /tenders/plan/ endpoint are the same as from the /tenders/ endpoint.
url = 'https://public.mtender.gov.md/tenders/'
url = f'https://public.mtender.gov.md/tenders/?offset={self.from_date.strftime(self.date_format)}'
yield scrapy.Request(url, meta={'file_name': 'list.json'}, callback=self.parse_list)

@handle_http_error
Expand Down