Failed to run the 'Data Transfromation' feed if the select statement returns empty result

Description

Steps to reproduce:
1. create a table 'foo' with following schema in Hive:
firstname string,
lastname string

2. insert two records: {'a', 'b'} and { 'c', 'd'}

3. Use the 'Data Tranfromation' template and click 'Edit SQL' button then enter following SQL statement:
select * from foo b where b.firstname="e"

4. After creating the feed, click 'Start now' to run the feed

Expected result:
The feed should be run successfully though the target table is emtpy

Actually result:
The job was failed with following error:
2018-10-10 11:03:30,054 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] 18/10/10 11:03:30 ERROR datavalidator.StandardDataValidator: Failed to insert validation stats
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] java.lang.UnsupportedOperationException: empty collection
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1028)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1028)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at scala.Option.getOrElse(Option.scala:121)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1028)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:385)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:45)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at com.thinkbiganalytics.spark.datavalidator.StandardDataValidator.cleansedRowResultsValidationCounts(StandardDataValidator.java:324)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at com.thinkbiganalytics.spark.datavalidator.StandardDataValidator.getProfileStats(StandardDataValidator.java:104)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at com.thinkbiganalytics.spark.datavalidator.StandardDataValidator.saveProfileToTable(StandardDataValidator.java:212)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at com.thinkbiganalytics.spark.datavalidator.Validator.run(Validator.java:101)
2018-10-10 11:03:30,055 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=c17fbdf9-dfaf-373f-c51d-885b5e847c61] at com.thinkbiganalytics.spark.datavalidator.Validator.main(Validator.java:54)

The root cause is that the query result is empty

Environment

None

Activity

Show:
Boying Lu
October 10, 2018, 5:01 AM
Boying Lu
October 22, 2018, 2:16 AM

I submitted a new PR which is made against the Teradata/kylo at https://github.com/Teradata/kylo/pull/93

Jagrut Sharma
November 8, 2018, 4:08 PM

Merged PR.

Done

Assignee

Jagrut Sharma

Reporter

Boying Lu

Labels

None

Reviewer

None

Story point estimate

None

Components

Fix versions

Affects versions

Priority

Medium