Issue on duplicate data in the same file in case of "Dedupe and merge"

Description

As reported by Fabian, if there is a file with an internal duplicate:

key,name,userid
0,paolo,1
1,vanessa,2
1,vanessa,2
2,carlo,3
3,kim,4

with the modality "Dedupe and merge", there is not the information for "Vanessa" on hive.
This happens because in the function "generateMergeNonPartitionQueryWithDedupe" in file TableMergeSyncSupport.java there is the control "having count(processing_dttm) = 1".
This check avoid the correct insert of data because for "Vanessa" the count is 2.
This bug can be solved by add a group by for the second part of the query *union all select " + selectSQL.

Environment

None

Activity

Show:
Scott Reisdorf
March 2, 2018, 12:19 AM

This file generates this query:

insert into table `test`.`kylo1462`
select `key`,`name`,`userid`, min(processing_dttm) processing_dttm from (
select distinct `key`,`name`,`userid`,`processing_dttm`
from `test`.`kylo1462_valid` where processing_dttm = "1519949249860"
group by`key`,`name`,`userid`
union all
select `key`,`name`,`userid`,`processing_dttm`
from `test`.`kylo1462`) x
group by `key`,`name`,`userid`
having count(processing_dttm) = 1 and min(processing_dttm) = "1519949249860"

The result is just 4 rows with Vanessa in there only once

Scott Reisdorf
March 2, 2018, 12:23 AM
Edited

it appears fixes this issue

adds the "distinct" clause to the first query fixing the duplication issue.

Done

Assignee

Scott Reisdorf

Reporter

Davide Gazze

Labels

Reviewer

None

Story point estimate

None

Time tracking

0m

Time remaining

7h 30m

Components

Sprint

None

Fix versions

Affects versions

Priority

Medium
Configure