Bug with Scrapyd/DDS #92

bezkos · 2017-07-08T22:22:57Z

If u have configure celery time to search for new jobs < time spider job need to finish, u ll end up with running same spider job multiple times. The problem is here:

`

    def _pending_jobs(self, spider):

    # Ommit scheduling new jobs if there are still pending jobs for same spider
    resp = urllib.request.urlopen('http://localhost:6800/listjobs.json?project=default')
    data = json.loads(resp.read().decode('utf-8'))
    if 'pending' in data:
        for item in data['pending']:
            if item['spider'] == spider:
                return True
    return False`

I fixed it with
`
def _pending_jobs(self, spider):

    # Ommit scheduling new jobs if there are still pending jobs for same spider
    resp = urllib.request.urlopen('http://localhost:6800/listjobs.json?project=default')
    data = json.load(resp)     
    if any(s in data for s in ('pending', 'running')):
    #if 'pending' in data:            
        for item in data['pending']:
            if item['spider'] == spider:
                return True
        for item in data['running']:
            if item['spider'] == spider:
                return True
    return False`

But a new problem arise. If you have free scrapyd slots, many independent jobs of same spider and your spider job running in one slot is too long , u cant use rest free slots cause your queue is blocked from running spider.
I think we must rework scheduling logic to take into account except spider name and args/kwargs too.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug with Scrapyd/DDS #92

Bug with Scrapyd/DDS #92

bezkos commented Jul 8, 2017

Bug with Scrapyd/DDS #92

Bug with Scrapyd/DDS #92

Comments

bezkos commented Jul 8, 2017