'Size exceeds Integer.MAX_VALUE' exception when running Validation spark job
Validation spark job fails on java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE exception (full stack trace in attachment) when processing some parquet files. For some parquet files the job finishes and validation is done.
The validation job is run on HD Insight cluster running Spark 2.0.2 version (attached Spark UI screenshot with failed job).
Tested with 0.7.0 version of validation spark2 job as at the time fix for wasn't available. Version of Kylo is 0.8.0-SNAPSHOT (built on a commit 62a6b2f945af41883cdcff557322247737fd458a).
Thanks for providing the actual file throwing error. An additional parameter has been added the the validator (Refer ). The feed ran successfully on Spark 2 in Kylo VM using --numPartitions value of 10. Time taken was 15 minutes. Attached screenshot (success-041720167.png)
I tried to replicate the issue by generating a 34-column Parquet dataset with the same column types as in the attached log. Number of rows=7,000,000, and size of the file = 1.7 GB. Ran this via a feed on the Kylo VM and validation job completed. 1 executor with 512 MB memory.
Having access to the actual file throwing the error would probably help in reproducing the problem.
I also tried it on a newer version of Kylo (0.8.0-SNAPSHOT built with fixed ) with the same result (I attached 2 additional stack traces - one from spark logs and other from nifi-app.log). Attached logs are for running the validation spark job with 12g memory per executor and using only 1 executor.
More details about the source parquet file:
1) 127.4 MB parquet file (no compression applied)
2) 34 columns of various types (types can be seen in nifi-app.log)
3) 6776332 rows
Can you please provide the 1) size, 2) number of columns and 3) number of rows in the source parquet file that threw this error?
There are a few changes in the validator since version 0.7. Could you try to take the latest version and run it? It may still run into this issue, but it would be helpful to look at the complete logs.
This root cause is that the partition being processed exceeds Spark's in-built limit.