Need a NiFi processor to identify the source data set for a feed


We need a processor that would be put at the front of a flow that can be configured to load source data set metadata to be used down stream in the flow. Some requirements of the processor might be to:

  • Contain designer provided configuration settings that drives the feed input configuration stepper

  • Allow the template designer to limit the set of connectors/data sources supported by the feed

  • Retrieve and add data set metadata into the flow file at runtime

  • Allow branch on connector type to support alternate flow routes based the chosen type of input data set (file, JDBC, etc.)

The branching requirement above will probably be needed because, even though a template might designed to support a wide range connector types, certain alternate behavior could be necessary based on the type. For instance, if a file-based data set was chosen as the input, specific flow route may need to be followed in order to remove the source file as it is processed; something that would not be required when a JDBC data set was chosen.

This branching behavior could be implemented by a separate processor as well; which might well be needed anyway if the alternate routes will occur farther down the flow. The processor for this story might extend that branching processor to inherit that behavior. Alternatively, since NiFi already has a RouteOnAttribute processor, this routing feature can be dropped as long as the connector type was added as a flow file attribute by this processor, and there was no other info needed to make the routing decision.


Scott Reisdorf
June 28, 2018, 1:32 PM

I vote for simplicity

  • Have a KyloCatalogControllerService (or add to our existing MetadataService) to obtain connection information for the supplied feed flow datasets

  • Have a KyloCatalogReadProcessor

  • uses the KyloCatalogControllerService to obtain connection info

  • invoke the spark job to connect to the catalog and read the data

  • have 2 modes:
    1. have an option to write the content and replace the NiFi flowfile content
    2. return the dataframe, spark session id back as a flowfile attribute so downstream processors can use it.


Sean Felten


Sean Felten





Story Points


Epic Link



KYLO 0.11.0 Sprint 3