Test version : 0.9.1.1 & 0.10.0
Hi, team.
error :
Transform Exception
An error occurred while executing the transformation.
I figure out that visual query gets all data from the source table. so, If source data is big, memory issues of spark can occur.
above 100milions of rows (source data) makes that error on a visual query. (both local & yarn mode)
Driver and executor's memory are enough (spark.properties). As I know, if there is a problem about memory, logger tells me regarding OOM error. but, there are no kinds of error except below.
kylo-spark-shell.log
2019-01-17 18:39:53 INFO launcher-proc-1:SparkShellApp:61 - 2019-01-17 18:39:53 ERROR StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45000]]
2019-01-17 18:39:53 INFO launcher-proc-1:SparkShellApp:61 - 2019-01-17 18:39:53 ERROR main:StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45000]]
2019-01-17 18:43:40 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:43:40 ERROR StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45002]]
2019-01-17 18:43:40 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:43:40 ERROR main:StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45002]]
2019-01-17 18:51:43 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:43 ERROR YarnScheduler:70 - Lost executor 2 on sinbaram02: Executor heartbeat timed out after 150240 ms
2019-01-17 18:51:43 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:43 ERROR dispatcher-event-loop-5:YarnScheduler:70 - Lost executor 2 on sinbaram02: Executor heartbeat timed out after 150240 ms
2019-01-17 18:51:46 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:46 ERROR YarnScheduler:70 - Lost executor 2 on sinbaram02: Container container_1547610421793_0012_01_000003 exited from explicit termination request.
2019-01-17 18:51:46 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:46 ERROR dispatcher-event-loop-1:YarnScheduler:70 - Lost executor 2 on sinbaram02: Container container_1547610421793_0012_01_000003 exited from explicit termination request.
2019-01-17 18:59:43 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:43 ERROR YarnScheduler:70 - Lost executor 1 on sinbaram02: Executor heartbeat timed out after 173949 ms
2019-01-17 18:59:43 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:43 ERROR dispatcher-event-loop-1:YarnScheduler:70 - Lost executor 1 on sinbaram02: Executor heartbeat timed out after 173949 ms
2019-01-17 18:59:44 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:44 ERROR YarnScheduler:70 - Lost executor 1 on sinbaram02: Container container_1547610421793_0012_01_000002 exited from explicit termination request.
2019-01-17 18:59:44 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:44 ERROR dispatcher-event-loop-5:YarnScheduler:70 - Lost executor 1 on sinbaram02: Container container_1547610421793_0012_01_000002 exited from explicit termination request.
2019-01-17 19:04:43 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 19:04:43 ERROR YarnScheduler:70 - Lost executor 3 on sinbaram02: Executor heartbeat timed out after 158921 ms
2019-01-17 19:04:43 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 19:04:43 ERROR dispatcher-event-loop-7:YarnScheduler:70 - Lost executor 3 on sinbaram02: Executor heartbeat timed out after 158921 ms
2019-01-17 19:04:46 INFO launcher-proc-2:SparkShellApp:61 - 2019-01-17 19:04:46 ERROR YarnScheduler:70 - Lost executor 3 on sinbaram02: Container container_1547610421793_0012_01_000005 exited from explicit termination request.
Questions is, do you have any time-out logic in kylo-service or sparkshellapp?
this is container side:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/hadoop/yarn/local/filecache/10/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.1050-37/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/01/17 19:26:27 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 18187@sinbaram02
19/01/17 19:26:27 INFO SignalUtils: Registered signal handler for TERM
19/01/17 19:26:27 INFO SignalUtils: Registered signal handler for HUP
19/01/17 19:26:27 INFO SignalUtils: Registered signal handler for INT
19/01/17 19:26:27 INFO SecurityManager: Changing view acls to: yarn,kylo
19/01/17 19:26:27 INFO SecurityManager: Changing modify acls to: yarn,kylo
19/01/17 19:26:27 INFO SecurityManager: Changing view acls groups to:
19/01/17 19:26:27 INFO SecurityManager: Changing modify acls groups to:
19/01/17 19:26:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, kylo); groups with view permissions: Set(); users with modify permissions: Set(yarn, kylo); groups with modify permissions: Set()
19/01/17 19:26:28 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:33041 after 53 ms (0 ms spent in bootstraps)
19/01/17 19:26:28 INFO SecurityManager: Changing view acls to: yarn,kylo
19/01/17 19:26:28 INFO SecurityManager: Changing modify acls to: yarn,kylo
19/01/17 19:26:28 INFO SecurityManager: Changing view acls groups to:
19/01/17 19:26:28 INFO SecurityManager: Changing modify acls groups to:
19/01/17 19:26:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, kylo); groups with view permissions: Set(); users with modify permissions: Set(yarn, kylo); groups with modify permissions: Set()
19/01/17 19:26:28 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:33041 after 0 ms (0 ms spent in bootstraps)
19/01/17 19:26:28 INFO DiskBlockManager: Created local directory at /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/blockmgr-02713cfd-75fb-4bfc-ab9b-f22f5bbb4088
19/01/17 19:26:28 INFO MemoryStore: MemoryStore started with capacity 21.2 GB
19/01/17 19:26:28 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@sinbaram02:33041
19/01/17 19:26:28 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
19/01/17 19:26:28 INFO Executor: Starting executor ID 1 on host sinbaram02
19/01/17 19:26:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45009.
19/01/17 19:26:28 INFO NettyBlockTransferService: Server created on sinbaram02:45009
19/01/17 19:26:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/01/17 19:26:28 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, sinbaram02, 45009, None)
19/01/17 19:26:28 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, sinbaram02, 45009, None)
19/01/17 19:26:28 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, sinbaram02, 45009, None)
19/01/17 19:26:55 INFO CoarseGrainedExecutorBackend: Got assigned task 0
19/01/17 19:26:55 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/01/17 19:26:55 INFO Executor: Fetching spark://sinbaram02:33041/jars/mariadb-java-client-1.5.7.jar with timestamp 1547720813921
19/01/17 19:26:55 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:33041 after 2 ms (0 ms spent in bootstraps)
19/01/17 19:26:55 INFO Utils: Fetching spark://sinbaram02:33041/jars/mariadb-java-client-1.5.7.jar to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/fetchFileTemp5401080981497225014.tmp
19/01/17 19:26:55 INFO Utils: Copying /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/-7935189791547720813921_cache to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./mariadb-java-client-1.5.7.jar
19/01/17 19:26:55 INFO Executor: Adding file:/data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./mariadb-java-client-1.5.7.jar to class loader
19/01/17 19:26:55 INFO Executor: Fetching spark://sinbaram02:33041/jars/kylo-spark-shell-client-v2-0.10.0.jar with timestamp 1547720780470
19/01/17 19:26:55 INFO Utils: Fetching spark://sinbaram02:33041/jars/kylo-spark-shell-client-v2-0.10.0.jar to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/fetchFileTemp4637504695299463482.tmp
19/01/17 19:26:56 INFO Utils: Copying /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/-6295615651547720780470_cache to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./kylo-spark-shell-client-v2-0.10.0.jar
19/01/17 19:26:56 INFO Executor: Adding file:/data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./kylo-spark-shell-client-v2-0.10.0.jar to class loader
19/01/17 19:26:56 INFO TorrentBroadcast: Started reading broadcast variable 0
19/01/17 19:26:56 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:43680 after 1 ms (0 ms spent in bootstraps)
19/01/17 19:26:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 10.2 KB, free 21.2 GB)
19/01/17 19:26:56 INFO TorrentBroadcast: Reading broadcast variable 0 took 66 ms
19/01/17 19:26:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 27.6 KB, free 21.2 GB)
spakr ui
Details for Job 0
Status: FAILED
Job Group: cd79ec3973f941c4b13d016d36490f21
Skipped Stages: 1
Failed Stages: 1
Event Timeline
DAG Visualization
Stage 0
WholeStageCodegen
CollectLimit
JDBCRDD [0]buildScan at JdbcRelation.java:76
MapPartitionsRDD [1]persist at DataSet20.java:108
MapPartitionsRDD [2]persist at DataSet20.java:108
MapPartitionsRDD [3]persist at DataSet20.java:108
Stage 1 (skipped)
CollectLimit
mapPartitionsInternal
InMemoryTableScan
mapPartitionsInternal
ShuffledRowRDD [4]persist at DataSet20.java:108
MapPartitionsRDD [5]persist at DataSet20.java:108
CollectLimit 1000+- *(1) LocalLimit 1000 +- *(1) Project [year#0 AS year#69, p_num#1 AS p_num#70, sex#2 AS sex#71, y_code#3 AS y_code#72, c_code#4 AS c_code#73, height#5 AS height#74, weight#6 AS weight#75, waist#7 AS waist#76, l_look#8 AS l_look#77, r_look#9 AS r_look#78, l_ear#10 AS l_ear#79, r_ear#11 AS r_ear#80, l_pre#12 AS l_pre#81, h_pre#13 AS h_pre#82, g#14 AS g#83, col#15 AS col#84, tri#16 AS tri#85, hdl#17 AS hdl#86, ldl#18 AS ldl#87, hy#19 AS hy#88, yo#20 AS yo#89, hul#21 AS hul#90, hul2#22 AS hul2#91, hul3#23 AS hul3#92, ... 10 more fields] +- *(1) Scan com.thinkbiganalytics.kylo.catalog.spark.sources.jdbc.JdbcRelation@160ba834 [year#0,hdl#17,che#31,hul#21,weight#6,um#26,dam#25,yo#20,p_num#1,hul2#22,hy#19,y_code#3,col#15,gyu#29,tri#16,r_ear#11,l_look#8,waist#7,chi2#30,h_pre#13,g#14,sex#2,gamma#24,l_ear#10,... 10 more fields] PushedFilters: [], ReadSchema: struct<year:string,hdl:string,che:string,hul:string,weight:string,um:string,dam:string,yo:string,... [6] [Cached]persist at DataSet20.java:108
MapPartitionsRDD [8]collectAsList at DataSet20.java:93
MapPartitionsRDD [9]collectAsList at DataSet20.java:93
MapPartitionsRDD [10]collectAsList at DataSet20.java:93
Skipped Stages (1)
Stage Id ▾
Description
Submitted
Duration
Tasks: Succeeded/Total
Input
Output
Shuffle Read
Shuffle Write
1
collectAsList at DataSet20.java:93 +details Unknown Unknown
0/1
Failed Stages (1)
Stage Id ▾
Description
Submitted
Duration
Tasks: Succeeded/Total
Input
Output
Shuffle Read
Shuffle Write
Failure Reason
0
Transform Job
persist at DataSet20.java:108 +details 2019/01/17 19:26:55 57 min
0/1 (4 failed) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, sinbaram02, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 151293 ms +details
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, sinbaram02, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 151293 ms
Driver stacktrace: