Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue while running on azure HDinsight cluster #131

Open
brijesh-6899 opened this issue Nov 13, 2020 · 2 comments
Open

issue while running on azure HDinsight cluster #131

brijesh-6899 opened this issue Nov 13, 2020 · 2 comments

Comments

@brijesh-6899
Copy link

  • A description of the bug

    I am trying out a simple dask example on azure hdinsight cluster but application fails after submitting and gives below stack trace.

  • Steps to reproduce

    I am executing given code using python3.5 from master node (SSHed into it).
    I have archived venv using venv-pack and uploaded it on azure distributed/wasb storage.

from dask_yarn import YarnCluster; from dask.distributed import Client; cluster = YarnCluster(environment='wasb:///user/sshuser/dask_dedup.tar.gz',worker_vcores=2,worker_memory="8GiB",n_workers=2)

  • Relevant logs/tracebacks
    20/11/13 07:36:32 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
    20/11/13 07:36:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    20/11/13 07:36:33 INFO skein.ApplicationMaster: Running as user sshuser
    20/11/13 07:36:33 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.2.7-1/0/resource-types.xml
    20/11/13 07:36:33 INFO skein.ApplicationMaster: Application specification successfully loaded
    20/11/13 07:36:33 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2-azure-file-system.properties
    20/11/13 07:36:33 INFO sink.WasbAzureIaasSink: Init starting.
    20/11/13 07:36:33 INFO sink.AzureIaasSink: Init starting. Initializing MdsLogger.
    20/11/13 07:36:33 INFO sink.AzureIaasSink: Init completed.
    20/11/13 07:36:33 INFO sink.WasbAzureIaasSink: Init completed.
    20/11/13 07:36:33 INFO impl.MetricsSinkAdapter: Sink azurefs2 started
    20/11/13 07:36:33 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 60 second(s).
    20/11/13 07:36:33 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
    20/11/13 07:36:33 INFO client.RequestHedgingRMFailoverProxyProvider: Created wrapped proxy for [rm1, rm2]
    20/11/13 07:36:34 INFO skein.ApplicationMaster: gRPC server started at
    20/11/13 07:36:34 INFO skein.ApplicationMaster: WebUI server started at
    20/11/13 07:36:34 INFO skein.ApplicationMaster: Registering application with resource manager
    20/11/13 07:36:34 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
    20/11/13 07:36:34 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
    20/11/13 07:36:34 INFO client.RequestHedgingRMFailoverProxyProvider: Created wrapped proxy for [rm1, rm2]
    20/11/13 07:36:34 INFO client.AHSProxy: Connecting to Application History server at headnodehost
    20/11/13 07:36:34 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
    20/11/13 07:36:34 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
    20/11/13 07:36:34 INFO skein.ApplicationMaster: Initializing service 'dask.worker'.
    20/11/13 07:36:34 INFO skein.ApplicationMaster: WAITING: dask.worker_0
    20/11/13 07:36:34 INFO skein.ApplicationMaster: WAITING: dask.worker_1
    20/11/13 07:36:34 INFO skein.ApplicationMaster: Initializing service 'dask.scheduler'.
    20/11/13 07:36:34 INFO skein.ApplicationMaster: REQUESTED: dask.scheduler_0
    20/11/13 07:36:35 INFO skein.ApplicationMaster: Starting container_e10_1599254613788_0080_01_000002...
    20/11/13 07:36:35 INFO skein.ApplicationMaster: RUNNING: dask.scheduler_0 on container_e10_1599254613788_0080_01_000002
    20/11/13 07:36:35 INFO skein.ApplicationMaster: REQUESTED: dask.worker_0
    20/11/13 07:36:35 INFO skein.ApplicationMaster: REQUESTED: dask.worker_1
    20/11/13 07:36:37 INFO skein.ApplicationMaster: Starting container_e10_1599254613788_0080_01_000003...
    20/11/13 07:36:37 INFO skein.ApplicationMaster: RUNNING: dask.worker_0 on container_e10_1599254613788_0080_01_000003
    20/11/13 07:36:37 INFO skein.ApplicationMaster: Starting container_e10_1599254613788_0080_01_000004...
    20/11/13 07:36:37 INFO skein.ApplicationMaster: RUNNING: dask.worker_1 on container_e10_1599254613788_0080_01_000004
    20/11/13 07:36:42 WARN skein.ApplicationMaster: FAILED: dask.worker_1 - [2020-11-13 07:36:41.666]wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/.skein.pem: No such file or directory.
    java.io.FileNotFoundException: wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/.skein.pem: No such file or directory.
    at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2715)
    at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:2619)
    at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

20/11/13 07:36:42 INFO skein.ApplicationMaster: RESTARTING: adding new container to replace dask.worker_1.
20/11/13 07:36:42 INFO skein.ApplicationMaster: REQUESTED: dask.worker_2
20/11/13 07:36:42 WARN skein.ApplicationMaster: FAILED: dask.worker_0 - [2020-11-13 07:36:37.992]wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/dask.worker.sh: No such file or directory.
java.io.FileNotFoundException: wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/dask.worker.sh: No such file or directory.
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2715)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:2619)
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

20/11/13 07:36:42 INFO skein.ApplicationMaster: RESTARTING: adding new container to replace dask.worker_0.
20/11/13 07:36:42 INFO skein.ApplicationMaster: REQUESTED: dask.worker_3
20/11/13 07:36:42 WARN skein.ApplicationMaster: FAILED: dask.scheduler_0 - [2020-11-13 07:36:40.532]wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/dask.scheduler.sh: No such file or directory.
java.io.FileNotFoundException: wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/dask.scheduler.sh: No such file or directory.
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatusInternal(NativeAzureFileSystem.java:2715)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:2619)
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

20/11/13 07:36:42 INFO skein.ApplicationMaster: Shutting down: Failure in service dask.scheduler, see logs for more information.
20/11/13 07:36:42 INFO skein.ApplicationMaster: Unregistering application with status FAILED
20/11/13 07:36:42 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/11/13 07:36:43 WARN azure.AzureFileSystemThreadPoolExecutor: Disabling threads for Delete operation as thread count 0 is <= 1
20/11/13 07:36:43 INFO azure.AzureFileSystemThreadPoolExecutor: Time taken for Delete operation is: 58 ms with threads: 0
20/11/13 07:36:43 INFO skein.ApplicationMaster: Deleted application directory wasb://tech-dsl-hdi-spark-dev-2020-06-15t09-47-36-041z@dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080
20/11/13 07:36:43 INFO skein.ApplicationMaster: WebUI server shut down
20/11/13 07:36:43 INFO skein.ApplicationMaster: gRPC server shut down
20/11/13 07:36:43 INFO impl.MetricsSystemImpl: Stopping azure-file-system metrics system...
20/11/13 07:36:43 INFO impl.MetricsSinkAdapter: azurefs2 thread interrupted.
20/11/13 07:36:43 INFO impl.MetricsSystemImpl: azure-file-system metrics system stopped.
20/11/13 07:36:43 INFO impl.MetricsSystemImpl: azure-file-system metrics system shutdown complete.

  • Version information

    Please include version information for the following:

    • Python version : 3.5
    • Dask-Yarn version : 0.8.1
    • Hadoop version: Hadoop 3.1.1.3.1.2.7-1
@quasiben
Copy link
Member

Can the sshuser create a directory and write files to HDFS ?

07:36:41.666]wasb://dslhdisparkdehdistorage1.blob.core.windows.net/user/sshuser/.skein/application_1599254613788_0080/.skein.pem: No such file or directory.

@brijesh-6899
Copy link
Author

Hi @quasiben, thanks for your response. I checked and sshuser is able to create a new directory on HDFS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants