Spark Profiler failing to profile some columns

Description

Environment: MapR 6.0.1
Spark Version: version 2.2.1-mapr-1803

When running a data feed for userdata1.csv I see the following stack trace 13 times in the nifi logs

2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] 18/08/15 19:46:46 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 3)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] java.lang.NumberFormatException: empty String
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at java.lang.Double.parseDouble(Double.java:538)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at com.thinkbiganalytics.spark.dataprofiler.histo.HistogramStatistics$1.call(HistogramStatistics.java:64)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at com.thinkbiganalytics.spark.dataprofiler.histo.HistogramStatistics$1.call(HistogramStatistics.java:61)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapToDouble$1.apply(JavaRDDLike.scala:109)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapToDouble$1.apply(JavaRDDLike.scala:109)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.Iterator$class.foreach(Iterator.scala:893)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.TraversableOnce$class.reversed(TraversableOnce.scala:101)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.AbstractIterator.reversed(Iterator.scala:1336)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.TraversableOnce$class.foldRight(TraversableOnce.scala:162)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at scala.collection.AbstractIterator.foldRight(Iterator.scala:1336)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$histogram$1$$anonfun$3.apply(DoubleRDDFunctions.scala:132)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$histogram$1$$anonfun$3.apply(DoubleRDDFunctions.scala:130)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.scheduler.Task.run(Task.scala:108)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2018-08-15 19:46:46,538 INFO [stream error] c.t.nifi.v2.spark.ExecuteSparkJob ExecuteSparkJob[id=5ba202c8-fea6-3908-a2fe-35a107ba14bb] at java.lang.Thread.run(Thread.java:748)

Environment

None

Assignee

Unassigned

Reporter

Jeremy Merrifield

Labels

None

Reviewer

None

Story point estimate

None

Affects versions

Priority

Medium
Configure