Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs clarification #23

Open
Ostapp opened this issue May 2, 2019 · 4 comments
Open

docs clarification #23

Ostapp opened this issue May 2, 2019 · 4 comments

Comments

@Ostapp
Copy link

Ostapp commented May 2, 2019

Does it assign a different UA to each request?
Does it assign a different UA to each request retry?

@vortexkd
Copy link

vortexkd commented Apr 4, 2020

For anyone who still wants the answer to this,
Yes it assigns a new user agent to each request.
You can refer to exactly how here https://pypi.org/project/fake-useragent/

but tldr, you can use the RANDOM_UA_TYPE setting (which defaults to random)

and the middleware will generate a new user agent string for each request based on the above criteria.

@i-chaochen
Copy link

i-chaochen commented Sep 1, 2020

Thank @alecxe for providing this great project.

For scrapy-proxies, I wonder what do you mean by set RANDOM_UA_PER_PROXY to be true?

Usage with scrapy-proxies

To use with middlewares of random proxy such as scrapy-proxies, you need:

set RANDOM_UA_PER_PROXY to True to allow switch per proxy
set priority of RandomUserAgentMiddleware to be greater than scrapy-proxies, so that proxy is set before handle UA

Do I need to first pip install scrapy_proxies, and then I add RANDOM_UA_PER_PROXY = True in my setting.py? or it is already included and I can add RANDOM_UA_PER_PROXY = True directly.

Also, for scrapy_proxies priority, do I need to add another DOWNLOADER_MIDDLEWARE for scrapy-proxies? I mean there will be two DOWNLOADER_MIDDLEWARES, respectively. And I then just set two priorities of fake-useragent are larger than scrapy-proxies, so I can have proxy + fake user agent together?

Because you mentioned fake user agent needs to turn off built-in UserAgentMiddleware and RetryMiddleware, and scrapy-proxies used RetryMiddleware. I am confused whether should I use the RetryMiddleware in DOWNLOADER_MIDDLEWARES or not. Thanks in advance!

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

@alecxe
Copy link
Owner

alecxe commented Sep 6, 2020

@i-chaochen thank you for the kind words and the questions. Docs could definitely be better for this project, I agree.

Do I need to first pip install scrapy_proxies, and then I add RANDOM_UA_PER_PROXY = True in my setting.py? or it is already included and I can add RANDOM_UA_PER_PROXY = True directly.

Yeah, scrapy_proxies is not listed in project requirements and you would need to install it separately.

Also, for scrapy_proxies priority, do I need to add another DOWNLOADER_MIDDLEWARE for scrapy-proxies? I mean there will be two DOWNLOADER_MIDDLEWARES, respectively. And I then just set two priorities of fake-useragent are larger than scrapy-proxies, so I can have proxy + fake user agent together?

It seems so. Though, I have not used this combination of scrapy-fake-useragent and scrapy-proxies myself. I'd say do some experimentation with the middlewares setup while logging proxies and headers.

Hope that helps.

@i-chaochen
Copy link

i-chaochen commented Sep 12, 2020

@alecxe Thanks. After reading your code and a couple of tries I think I figured it out and tested it OK.

  1. RANDOM_UA_PER_PROXY = True
  2. in the spider file, to add scrapy.Request(meta={'proxy'} : `your_proxy_address`)

But just need to remember, if we set RANDOM_UA_PER_PROXY = True, the UA would be fixed for each request and only random for each proxy address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants