HDFS processors create unique file system handles which may be leaking resources

Description

Currently, HDFS processors disable the file system cache in reset AbstractHadoopProcessor.HDFSResources(). This may not be optimal as in theory we may be leaking HDFS resources across start/stops of the processors. I noticed the code in the class, with a comment by Matt Hutton:

Indeed, if you comment the config.set() the you cannot change umask by stop/start of the processor. This may be a bug in HDFS file system cache not picking up the changed config and thus not yielding a distinct fs handle when the value is changed. Or, perhaps it has to do with how we store and utilize the handles.

If the cache can be enabled it is possible to receive errors such as the one below, if a well time fs.close() is performed on any cached HDFS filesystem handle (similar to the issue introduced in KYLO-1499, which attempted to fix the umask setting). The work done under this ticket should ensure that the file system cache is always consulted before using a file handle, which would ensure that any closed file handles are reconnected if needed.

It may also be necessary to test in a test driver the effects of changing umask on cached file handles to ensure umask is appropriately applied to mkdir() operations.

Environment

None

Assignee

Tim Harsch

Reporter

Tim Harsch

Labels

None

Reviewer

None

Story point estimate

None

Time tracking

1h

Sprint

None

Fix versions

Affects versions

Priority

Medium
Configure