Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job status shows "successful" even when it actually "failed". #11

Open
shishirpy opened this issue Apr 27, 2020 · 0 comments
Open

Job status shows "successful" even when it actually "failed". #11

shishirpy opened this issue Apr 27, 2020 · 0 comments

Comments

@shishirpy
Copy link

I am running a harness server using docker compose. Here the steps to recreate the issue:

  1. Set up a engine with the following engine config:
{
  "engineId": "test_ur",
  "engineFactory": "com.actionml.engines.ur.UREngine",
  "sparkConf": {
    "master": "local",
    "spark.driver.memory": "3g",
    "spark.executor.memory": "1g",
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.es.index.auto.create": "true",
    "spark.es.nodes": "localhost",
    "es.nodes":"localhost",
    "spark.es.nodes.wan.only": "true",
    "es.nodes.wan.only":"true"
  },
  "algorithm": {
    "indicators": [
      {
        "name": "purchase"
      },
      {
        "name": "view"
      },
      {
        "name": "category-pref"
      }
    ],
    "num": 4
  }
}
  1. Add some data indicator events for testing.
  2. Run training job using:
POST http://localhost:9090/engines/test_ur/jobs HTTP/1.1
Content-Type: application/json
  1. You will get the following (similar) response for the above request:
{
  "description": {
    "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
    "status": {
      "name": "queued"
    },
    "comment": "Spark job",
    "createdAt": "2020-04-27T20:09:51.488Z"
  },
  "comment": "Started train Job on Spark"
}
  1. After some time make the following request:
GET http://localhost:9090/engines/test_ur HTTP/1.1
Content-Type: application/json
  1. You will get following similar response:
"jobStatuses": [
    {
      "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
      "status": {
        "name": "successful"
      },
      "comment": "Spark job",
      "createdAt": "2020-04-27T20:09:51.488Z",
      "completedAt": "2020-04-27T20:10:08.992Z"
    }
  ]
  1. Look at the last 500 lines in the harness log you will see the following messages:
harness          | 20:10:08.973 INFO  HttpMethodDirector - Retrying request
harness          | 20:10:08.974 ERROR NetworkClient     - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
harness          | 20:10:08.981 ERROR URAlgorithm       - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}
harness          | org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
harness          |      at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
harness          |      at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
  1. In the logs just below the error message you will also notice the following:
harness          | 20:10:08.990 INFO  JobManager$       - Job a6029311-ebb0-4120-90c9-fb40b1934264 marked as failed
harness          | 20:10:08.992 INFO  SparkContextSupport$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed in 1588018208990 ms [engine test_ur]
harness          | 20:10:08.995 INFO  JobManager$       - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed successfully
harness          | 20:10:09.004 INFO  AbstractConnector - Stopped Spark@587618d3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
harness          | 20:10:09.014 INFO  SparkUI           - Stopped Spark web UI at http://7b946919f4f5:4040
  1. We can see it has conflicting messages for the same job ID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant