Job status shows "successful" even when it actually "failed". #11

shishirpy · 2020-04-27T20:51:13Z

I am running a harness server using docker compose. Here the steps to recreate the issue:

Set up a engine with the following engine config:

{
  "engineId": "test_ur",
  "engineFactory": "com.actionml.engines.ur.UREngine",
  "sparkConf": {
    "master": "local",
    "spark.driver.memory": "3g",
    "spark.executor.memory": "1g",
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.es.index.auto.create": "true",
    "spark.es.nodes": "localhost",
    "es.nodes":"localhost",
    "spark.es.nodes.wan.only": "true",
    "es.nodes.wan.only":"true"
  },
  "algorithm": {
    "indicators": [
      {
        "name": "purchase"
      },
      {
        "name": "view"
      },
      {
        "name": "category-pref"
      }
    ],
    "num": 4
  }
}

Add some data indicator events for testing.
Run training job using:

POST http://localhost:9090/engines/test_ur/jobs HTTP/1.1
Content-Type: application/json

You will get the following (similar) response for the above request:

{
  "description": {
    "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
    "status": {
      "name": "queued"
    },
    "comment": "Spark job",
    "createdAt": "2020-04-27T20:09:51.488Z"
  },
  "comment": "Started train Job on Spark"
}

After some time make the following request:

GET http://localhost:9090/engines/test_ur HTTP/1.1
Content-Type: application/json

You will get following similar response:

"jobStatuses": [
    {
      "jobId": "a6029311-ebb0-4120-90c9-fb40b1934264",
      "status": {
        "name": "successful"
      },
      "comment": "Spark job",
      "createdAt": "2020-04-27T20:09:51.488Z",
      "completedAt": "2020-04-27T20:10:08.992Z"
    }
  ]

Look at the last 500 lines in the harness log you will see the following messages:

harness          | 20:10:08.973 INFO  HttpMethodDirector - Retrying request
harness          | 20:10:08.974 ERROR NetworkClient     - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
harness          | 20:10:08.981 ERROR URAlgorithm       - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}
harness          | org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
harness          |      at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
harness          |      at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)

In the logs just below the error message you will also notice the following:

harness          | 20:10:08.990 INFO  JobManager$       - Job a6029311-ebb0-4120-90c9-fb40b1934264 marked as failed
harness          | 20:10:08.992 INFO  SparkContextSupport$ - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed in 1588018208990 ms [engine test_ur]
harness          | 20:10:08.995 INFO  JobManager$       - Job a6029311-ebb0-4120-90c9-fb40b1934264 completed successfully
harness          | 20:10:09.004 INFO  AbstractConnector - Stopped Spark@587618d3{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
harness          | 20:10:09.014 INFO  SparkUI           - Stopped Spark web UI at http://7b946919f4f5:4040

We can see it has conflicting messages for the same job ID.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job status shows "successful" even when it actually "failed". #11

Job status shows "successful" even when it actually "failed". #11

shishirpy commented Apr 27, 2020

Job status shows "successful" even when it actually "failed". #11

Job status shows "successful" even when it actually "failed". #11

Comments

shishirpy commented Apr 27, 2020