big data issue on visual query

Description

  • Test version : 0.9.1.1 & 0.10.0
    Hi, team.

  • error : 
    Transform Exception
    An error occurred while executing the transformation.

  • I figure out that visual query gets all data from the source table. so, If source data is big, memory issues of spark can occur.

  • above 100milions of rows (source data) makes that error on a visual query. (both local & yarn mode)

  • Driver and executor's memory are enough (spark.properties). As I know, if there is a problem about memory, logger tells me regarding OOM error. but, there are no kinds of error except below.

  • kylo-spark-shell.log
    2019-01-17 18:39:53 INFO  launcher-proc-1:SparkShellApp:61 - 2019-01-17 18:39:53 ERROR StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45000]]
    2019-01-17 18:39:53 INFO  launcher-proc-1:SparkShellApp:61 - 2019-01-17 18:39:53 ERROR main:StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45000]]
    2019-01-17 18:43:40 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:43:40 ERROR StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45002]]
    2019-01-17 18:43:40 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:43:40 ERROR main:StandardService:182 - Failed to initialize connector [Connector[HTTP/1.1-45002]]
    2019-01-17 18:51:43 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:43 ERROR YarnScheduler:70 - Lost executor 2 on sinbaram02: Executor heartbeat timed out after 150240 ms
    2019-01-17 18:51:43 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:43 ERROR dispatcher-event-loop-5:YarnScheduler:70 - Lost executor 2 on sinbaram02: Executor heartbeat timed out after 150240 ms
    2019-01-17 18:51:46 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:46 ERROR YarnScheduler:70 - Lost executor 2 on sinbaram02: Container container_1547610421793_0012_01_000003 exited from explicit termination request.
    2019-01-17 18:51:46 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:51:46 ERROR dispatcher-event-loop-1:YarnScheduler:70 - Lost executor 2 on sinbaram02: Container container_1547610421793_0012_01_000003 exited from explicit termination request.
    2019-01-17 18:59:43 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:43 ERROR YarnScheduler:70 - Lost executor 1 on sinbaram02: Executor heartbeat timed out after 173949 ms
    2019-01-17 18:59:43 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:43 ERROR dispatcher-event-loop-1:YarnScheduler:70 - Lost executor 1 on sinbaram02: Executor heartbeat timed out after 173949 ms
    2019-01-17 18:59:44 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:44 ERROR YarnScheduler:70 - Lost executor 1 on sinbaram02: Container container_1547610421793_0012_01_000002 exited from explicit termination request.
    2019-01-17 18:59:44 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 18:59:44 ERROR dispatcher-event-loop-5:YarnScheduler:70 - Lost executor 1 on sinbaram02: Container container_1547610421793_0012_01_000002 exited from explicit termination request.
    2019-01-17 19:04:43 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 19:04:43 ERROR YarnScheduler:70 - Lost executor 3 on sinbaram02: Executor heartbeat timed out after 158921 ms
    2019-01-17 19:04:43 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 19:04:43 ERROR dispatcher-event-loop-7:YarnScheduler:70 - Lost executor 3 on sinbaram02: Executor heartbeat timed out after 158921 ms
    2019-01-17 19:04:46 INFO  launcher-proc-2:SparkShellApp:61 - 2019-01-17 19:04:46 ERROR YarnScheduler:70 - Lost executor 3 on sinbaram02: Container container_1547610421793_0012_01_000005 exited from explicit termination request.

Questions is, do you have any time-out logic in kylo-service or sparkshellapp?

this is container side:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/hadoop/yarn/local/filecache/10/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.1050-37/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/01/17 19:26:27 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 18187@sinbaram02
19/01/17 19:26:27 INFO SignalUtils: Registered signal handler for TERM
19/01/17 19:26:27 INFO SignalUtils: Registered signal handler for HUP
19/01/17 19:26:27 INFO SignalUtils: Registered signal handler for INT
19/01/17 19:26:27 INFO SecurityManager: Changing view acls to: yarn,kylo
19/01/17 19:26:27 INFO SecurityManager: Changing modify acls to: yarn,kylo
19/01/17 19:26:27 INFO SecurityManager: Changing view acls groups to:
19/01/17 19:26:27 INFO SecurityManager: Changing modify acls groups to:
19/01/17 19:26:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, kylo); groups with view permissions: Set(); users with modify permissions: Set(yarn, kylo); groups with modify permissions: Set()
19/01/17 19:26:28 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:33041 after 53 ms (0 ms spent in bootstraps)
19/01/17 19:26:28 INFO SecurityManager: Changing view acls to: yarn,kylo
19/01/17 19:26:28 INFO SecurityManager: Changing modify acls to: yarn,kylo
19/01/17 19:26:28 INFO SecurityManager: Changing view acls groups to:
19/01/17 19:26:28 INFO SecurityManager: Changing modify acls groups to:
19/01/17 19:26:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, kylo); groups with view permissions: Set(); users with modify permissions: Set(yarn, kylo); groups with modify permissions: Set()
19/01/17 19:26:28 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:33041 after 0 ms (0 ms spent in bootstraps)
19/01/17 19:26:28 INFO DiskBlockManager: Created local directory at /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/blockmgr-02713cfd-75fb-4bfc-ab9b-f22f5bbb4088
19/01/17 19:26:28 INFO MemoryStore: MemoryStore started with capacity 21.2 GB
19/01/17 19:26:28 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@sinbaram02:33041
19/01/17 19:26:28 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
19/01/17 19:26:28 INFO Executor: Starting executor ID 1 on host sinbaram02
19/01/17 19:26:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45009.
19/01/17 19:26:28 INFO NettyBlockTransferService: Server created on sinbaram02:45009
19/01/17 19:26:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/01/17 19:26:28 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, sinbaram02, 45009, None)
19/01/17 19:26:28 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, sinbaram02, 45009, None)
19/01/17 19:26:28 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, sinbaram02, 45009, None)
19/01/17 19:26:55 INFO CoarseGrainedExecutorBackend: Got assigned task 0
19/01/17 19:26:55 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/01/17 19:26:55 INFO Executor: Fetching spark://sinbaram02:33041/jars/mariadb-java-client-1.5.7.jar with timestamp 1547720813921
19/01/17 19:26:55 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:33041 after 2 ms (0 ms spent in bootstraps)
19/01/17 19:26:55 INFO Utils: Fetching spark://sinbaram02:33041/jars/mariadb-java-client-1.5.7.jar to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/fetchFileTemp5401080981497225014.tmp
19/01/17 19:26:55 INFO Utils: Copying /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/-7935189791547720813921_cache to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./mariadb-java-client-1.5.7.jar
19/01/17 19:26:55 INFO Executor: Adding file:/data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./mariadb-java-client-1.5.7.jar to class loader
19/01/17 19:26:55 INFO Executor: Fetching spark://sinbaram02:33041/jars/kylo-spark-shell-client-v2-0.10.0.jar with timestamp 1547720780470
19/01/17 19:26:55 INFO Utils: Fetching spark://sinbaram02:33041/jars/kylo-spark-shell-client-v2-0.10.0.jar to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/fetchFileTemp4637504695299463482.tmp
19/01/17 19:26:56 INFO Utils: Copying /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/spark-7065dc2d-1eb9-459e-a93a-ee4d1f72a5dd/-6295615651547720780470_cache to /data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./kylo-spark-shell-client-v2-0.10.0.jar
19/01/17 19:26:56 INFO Executor: Adding file:/data/hadoop/yarn/local/usercache/kylo/appcache/application_1547610421793_0013/container_1547610421793_0013_01_000002/./kylo-spark-shell-client-v2-0.10.0.jar to class loader
19/01/17 19:26:56 INFO TorrentBroadcast: Started reading broadcast variable 0
19/01/17 19:26:56 INFO TransportClientFactory: Successfully created connection to sinbaram02/70.70.190.197:43680 after 1 ms (0 ms spent in bootstraps)
19/01/17 19:26:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 10.2 KB, free 21.2 GB)
19/01/17 19:26:56 INFO TorrentBroadcast: Reading broadcast variable 0 took 66 ms
19/01/17 19:26:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 27.6 KB, free 21.2 GB)

spakr ui

Details for Job 0
Status: FAILED
Job Group: cd79ec3973f941c4b13d016d36490f21
Skipped Stages: 1
Failed Stages: 1
Event Timeline
DAG Visualization
Stage 0
WholeStageCodegen
CollectLimit
JDBCRDD [0]buildScan at JdbcRelation.java:76
MapPartitionsRDD [1]persist at DataSet20.java:108
MapPartitionsRDD [2]persist at DataSet20.java:108
MapPartitionsRDD [3]persist at DataSet20.java:108
Stage 1 (skipped)
CollectLimit
mapPartitionsInternal
InMemoryTableScan
mapPartitionsInternal
ShuffledRowRDD [4]persist at DataSet20.java:108
MapPartitionsRDD [5]persist at DataSet20.java:108
CollectLimit 1000+- *(1) LocalLimit 1000 +- *(1) Project [year#0 AS year#69, p_num#1 AS p_num#70, sex#2 AS sex#71, y_code#3 AS y_code#72, c_code#4 AS c_code#73, height#5 AS height#74, weight#6 AS weight#75, waist#7 AS waist#76, l_look#8 AS l_look#77, r_look#9 AS r_look#78, l_ear#10 AS l_ear#79, r_ear#11 AS r_ear#80, l_pre#12 AS l_pre#81, h_pre#13 AS h_pre#82, g#14 AS g#83, col#15 AS col#84, tri#16 AS tri#85, hdl#17 AS hdl#86, ldl#18 AS ldl#87, hy#19 AS hy#88, yo#20 AS yo#89, hul#21 AS hul#90, hul2#22 AS hul2#91, hul3#23 AS hul3#92, ... 10 more fields] +- *(1) Scan com.thinkbiganalytics.kylo.catalog.spark.sources.jdbc.JdbcRelation@160ba834 [year#0,hdl#17,che#31,hul#21,weight#6,um#26,dam#25,yo#20,p_num#1,hul2#22,hy#19,y_code#3,col#15,gyu#29,tri#16,r_ear#11,l_look#8,waist#7,chi2#30,h_pre#13,g#14,sex#2,gamma#24,l_ear#10,... 10 more fields] PushedFilters: [], ReadSchema: struct<year:string,hdl:string,che:string,hul:string,weight:string,um:string,dam:string,yo:string,... [6] [Cached]persist at DataSet20.java:108
MapPartitionsRDD [8]collectAsList at DataSet20.java:93
MapPartitionsRDD [9]collectAsList at DataSet20.java:93
MapPartitionsRDD [10]collectAsList at DataSet20.java:93
Skipped Stages (1)
Stage Id ▾
Description
Submitted
Duration
Tasks: Succeeded/Total
Input
Output
Shuffle Read
Shuffle Write
1
collectAsList at DataSet20.java:93 +details Unknown Unknown
0/1
Failed Stages (1)
Stage Id ▾
Description
Submitted
Duration
Tasks: Succeeded/Total
Input
Output
Shuffle Read
Shuffle Write
Failure Reason
0
Transform Job
persist at DataSet20.java:108 +details 2019/01/17 19:26:55 57 min
0/1 (4 failed) Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, sinbaram02, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 151293 ms +details
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, sinbaram02, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 151293 ms
Driver stacktrace:

Environment

None

Status

Assignee

Unassigned

Reporter

Kay

Labels

None

Reviewer

None

Story point estimate

None

Priority

Medium
Configure