large numbers of flow files may share a feedts which causes validator to process too many records

Description

As observed in the S3 data ingest template. Dropping many files at once into the S3 dropzone can send many files through simultaneous in the feed. In Feed initialization it is possible for each unique flow file to share the same feedts, which is granular down to the millisecond only. Later when the flow file enter the Validator step, validator queries out all rows in the processing_dttm=${feedts} step which may be many files worth of records. The validator step is then run for each flow file.

In our observed case using userdata1, 2 and 3. We saw 8994 records in the valid table. 15 rows were invalid and the header was not stripped.

In the Data Ingest template, the merge processor plays a role since it merges files coming in quickly into just one flow file. Merge processor is not used in S3 Data Ingest since the flowfile do not contain data.

Environment

None

Assignee

Unassigned

Reporter

Tim Harsch

Labels

Reviewer

None

Story point estimate

None

Components

Affects versions

Priority

Medium
Configure