Skip to content

scrapy does not work properly on recent Mac OS/hardware #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iannesbitt opened this issue Jan 3, 2023 · 1 comment
Closed

scrapy does not work properly on recent Mac OS/hardware #18

iannesbitt opened this issue Jan 3, 2023 · 1 comment
Assignees
Labels
upstream Dependency issue wontfix This will not be worked on

Comments

@iannesbitt
Copy link
Contributor

scrapy relies on pyOpenSSL which uses a cffi function that allocates memory in a way that is unsupported by Mac OS since circa 2020 due to security concerns. The relevant issue is pyca/pyopenssl#873.

cffi's documentation discusses the relevant function ffi.callback() here: https://cffi.readthedocs.io/en/latest/using.html#callbacks. I thought this might be an M1 chipset problem at first but this seems to imply that it's intentionally blocked at the OS-level instead.

In the event that there is a mission-critical reason to solve this, the solution at present may be the Apple software notarization process (as mentioned in this issue). However this would require money (Apple developer ID), time (aforementioned notarization process itself), and possible security risk. Otherwise either a future version of scrapy would have to stop using pyOpenSSL or pyOpenSSL would have to stop using ffi.callback().

Relevant software versions:

python==3.9.12
cffi==1.15.1
cryptography==38.0.4
pyOpenSSL==22.1.0
scrapy==2.5.0

Also tried cryptography>=39, pyOpenSSL>=23 as per search results.

The error that occurs from using scrapy is reproduced below:

(mnlite) ian@ mnlite % scrapy crawl JsonldSpider -s STORE_PATH=/Users/ian/bin/mnlite/instance/nodes/mnTestBONARES > /var/log/mnlite/mnTestBONARES-crawl-2022-12.log
2022-12-30 10:29:58 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: soscan)
2022-12-30 10:29:58 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 21.2.0, Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:52:09) - [Clang 14.0.6 ], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021), cryptography 3.4.7, Platform macOS-12.3-arm64-arm-64bit
2022-12-30 10:29:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-12-30 10:29:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'soscan',
 'NEWSPIDER_MODULE': 'soscan.spiders',
 'REACTOR_THREADPOOL_MAXSIZE': 8,
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['soscan.spiders']}
2022-12-30 10:29:58 [scrapy.extensions.telnet] INFO: Telnet Password: 1dd517d2e3040985
2022-12-30 10:29:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-12-30 10:29:58 [JsonldSpider] DEBUG: ALT_RULES = None
2022-12-30 10:29:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'soscan.middlewares.SoscanDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-30 10:29:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'soscan.middlewares.SoscanSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-30 10:29:58 [scrapy.middleware] INFO: Enabled item pipelines:
['soscan.sonormalizepipeline.SoscanNormalizePipeline',
 'soscan.opersistpipeline.OPersistPipeline']
2022-12-30 10:29:58 [scrapy.core.engine] INFO: Spider opened
2022-12-30 10:29:58 [OPersistPipeline] DEBUG: open_spider
2022-12-30 10:29:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-30 10:29:58 [JsonldSpider] INFO: Spider opened: JsonldSpider
2022-12-30 10:29:58 [JsonldSpider] INFO: Spider opened: JsonldSpider
2022-12-30 10:29:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-30 10:29:58 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://maps.bonares.de/robots.txt>: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks
Traceback (most recent call last):
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/internet/defer.py", line 1443, in _inlineCallbacks
    result = current_context.run(result.throwExceptionIntoGenerator, g)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
    return handler.download_request(request, spider)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
    return agent.download_request(request)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 335, in download_request
    d = agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/web/client.py", line 1753, in request
    endpoint = self._getEndpoint(parsedURI)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/web/client.py", line 1737, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/web/client.py", line 1608, in endpointForURI
    connectionCreator = self._policyForHTTPS.creatorForNetloc(
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 67, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext(),
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 64, in getContext
    return self.getCertificateOptions().getContext()
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1638, in getContext
    self._context = self._makeContext()
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1669, in _makeContext
    ctx.set_verify(verifyFlags, _verifyCallback)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/OpenSSL/SSL.py", line 1028, in set_verify
    self._verify_helper = _VerifyHelper(callback)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/OpenSSL/SSL.py", line 331, in __init__
    self.callback = _ffi.callback(
MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks
2022-12-30 10:29:58 [scrapy.core.scraper] ERROR: Error downloading <GET https://maps.bonares.de/finder/resources/googleds/sitemap.xml>
MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/internet/defer.py", line 1443, in _inlineCallbacks
    result = current_context.run(result.throwExceptionIntoGenerator, g)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 75, in download_request
    return handler.download_request(request, spider)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 65, in download_request
    return agent.download_request(request)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/handlers/http11.py", line 335, in download_request
    d = agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/web/client.py", line 1753, in request
    endpoint = self._getEndpoint(parsedURI)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/web/client.py", line 1737, in _getEndpoint
    return self._endpointFactory.endpointForURI(uri)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/web/client.py", line 1608, in endpointForURI
    connectionCreator = self._policyForHTTPS.creatorForNetloc(
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 67, in creatorForNetloc
    return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext(),
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/scrapy/core/downloader/contextfactory.py", line 64, in getContext
    return self.getCertificateOptions().getContext()
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1638, in getContext
    self._context = self._makeContext()
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/twisted/internet/_sslverify.py", line 1669, in _makeContext
    ctx.set_verify(verifyFlags, _verifyCallback)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/OpenSSL/SSL.py", line 1028, in set_verify
    self._verify_helper = _VerifyHelper(callback)
  File "/Users/ian/miniconda3/envs/mnlite/lib/python3.8/site-packages/OpenSSL/SSL.py", line 331, in __init__
    self.callback = _ffi.callback(
MemoryError: Cannot allocate write+execute memory for ffi.callback(). You might be running on a system that prevents this. For more information, see https://cffi.readthedocs.io/en/latest/using.html#callbacks
2022-12-30 10:29:58 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-30 10:29:58 [OPersistPipeline] DEBUG: close_spider
2022-12-30 10:29:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/builtins.MemoryError': 2,
 'downloader/request_bytes': 485,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'elapsed_time_seconds': 0.428005,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 30, 18, 29, 58, 568823),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 2,
 'log_count/INFO': 12,
 'memusage/max': 100007936,
 'memusage/startup': 100007936,
 "robotstxt/exception_count/<class 'MemoryError'>": 1,
 'robotstxt/request_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 12, 30, 18, 29, 58, 140818)}
2022-12-30 10:29:58 [scrapy.core.engine] INFO: Spider closed (finished)
@iannesbitt iannesbitt added the wontfix This will not be worked on label Jan 3, 2023
@iannesbitt
Copy link
Contributor Author

Closing as won't fix.

@iannesbitt iannesbitt closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023
@iannesbitt iannesbitt self-assigned this Oct 4, 2023
@iannesbitt iannesbitt added the upstream Dependency issue label Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Dependency issue wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant