Different behaviour of GET #3608
-
Dear all, I'm observing a weird behaviour of a (public) site when I use httpx to access it. I'm essentially emulating a browser accessing https://pubmed.ncbi.nlm.nih.gov/35642643/?format=pubmed. Curl works
As does requests import requests
result = requests.get('https://pubmed.ncbi.nlm.nih.gov/35642643/?format=pubmed')
print(f"Status: {result.status_code}")
print(result.text) With httpx 0.28.1, I get a 403. import httpx
with httpx.Client() as client:
response = client.get('https://pubmed.ncbi.nlm.nih.gov/35642643/?format=pubmed')
print(f"Status: {response.status_code}")
print(response.text) This yields
I have not been able to figure out the problem even after dumping raw headers etc. in httpx. FWIW, the request also works with aiohttp. I'd be happy about any input/help. I'm not affiliated with the pubmed/ncbi/nlm/nih site, I'm just trying to access it with httpx. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Ah, that's a little bit frustrating. 😬 It's unclear why the server is differentiating Perhaps they've been spammed in the past by crawlers using Incidentally, there might be some useful default behaviours that we could build into |
Beta Was this translation helpful? Give feedback.
-
Here is the answer from the PubMed Helpdesk. It does not directly answer the question. Thank you for writing to the help desk. Please do not use the PubMed web interface for programmatic retrieval of citation data. Our E-utilities API is designed for this purpose. There is documentation available at:
For example, the URL below will retrieve XML for the record that you are trying to access from the web interface: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=35642643&retmode=xml&rettype=text This is one simple example of an E-utilities URL. Please see the documentation to learn more. Kind regards, PubMed Team National Center for Biotechnology Information National Library of Medicine |
Beta Was this translation helpful? Give feedback.
Ah, that's a little bit frustrating. 😬
It's unclear why the server is differentiating
httpx
and returning a 403 response in this case. Either the server or the gateway has presumably been configured to disallowhttpx
clients here.Perhaps they've been spammed in the past by crawlers using
httpx
, and have a block in place as a result? I did try setting the request headers inc.User-Agent
here, though there's evidently some more complex client fingerprinting in place.Incidentally, there might be some useful default behaviours that we could build into
httpx
in order to help ensure that it's generally used as a well behaved client. Eg. respectingRetry-After
for default rate limiting, perhap…