As reported by Fabian, if there is a file with an internal duplicate:
key,name,userid
0,paolo,1
1,vanessa,2
1,vanessa,2
2,carlo,3
3,kim,4
with the modality "Dedupe and merge", there is not the information for "Vanessa" on hive.
This happens because in the function "generateMergeNonPartitionQueryWithDedupe" in file TableMergeSyncSupport.java there is the control "having count(processing_dttm) = 1".
This check avoid the correct insert of data because for "Vanessa" the count is 2.
This bug can be solved by add a group by for the second part of the query *union all select " + selectSQL.
This file generates this query:
insert into table `test`.`kylo1462`
select `key`,`name`,`userid`, min(processing_dttm) processing_dttm from (
select distinct `key`,`name`,`userid`,`processing_dttm`
from `test`.`kylo1462_valid` where processing_dttm = "1519949249860"
group by`key`,`name`,`userid`
union all
select `key`,`name`,`userid`,`processing_dttm`
from `test`.`kylo1462`) x
group by `key`,`name`,`userid`
having count(processing_dttm) = 1 and min(processing_dttm) = "1519949249860"
The result is just 4 rows with Vanessa in there only once
it appears fixes this issue
adds the "distinct" clause to the first query fixing the duplication issue.