Data overwrite in MergeTable processor

Description

When using dedupe & merge strategy on partitioned table it can happen that when 2 flow files containing new data for same partition of a same feed are run at the same time in MergeTable processor the one which is processed last will overwrite new data added by the one which finished first.

Scenario:

  • create feed with "Dedupe and Merge" strategy and at least one partition

  • run initial ingestion with some data (initData) which will go to partition X

  • prepare 2 files (data1 and data2) to be ingested separately for the same feed

    • should be big enough to make MergeTable processor run for a while

    • should contain non-duplicate data

    • should contain data for partition X

  • stop MergeTable processor in re-usable flow until there are flow files for both input files in its queue

  • start MergeTable processor

Once ingestion is done the final count of rows in master table will be wrong, instead of initData + data1 + data2 it will be either initData + data1 or initData + data2 depending on which flow file finished last.

Possible solution:

  • Add option to MergeTable processor to block running 2 flow files for the same feed (can be based on merge strategy or user's setting)

Activity

Show:
Scott Reisdorf
February 23, 2018, 8:38 PM
Edited

Merge Processor will now have a 'Blocking id' attribute that can be used to block processing in the Merge Processor.
The Blocking Id is an arbitrary attribute/flow file expression that can be used to block processing in the MergeTable (i.e. ${feed} would only allow for 1 flowfile for a given feed to be processed at a time.
If more than 1 flowfile comes in for a given Blocking Id it will penalize and transfer that flow file to the blocked relationship where it can (in most cases) retry by going back to Merge table (see screenshots)

Flow files that are blocked get a 'blocked.time' attribute added to the flow file indicating how long they were blocked.

Done

Assignee

Scott Reisdorf

Reporter

Robert Hencz

Labels

Reviewer

None

Time tracking

0m

Time remaining

0m

Sprint

None

Priority

High