When using 'Sync' merge strategy for data ingestion the final merge table has different location than the one defined when initialising Hive tables.
feed name: simple
Master Root Path property for Hive table initialisation: /hive/master
so the final location for master table should be /hive/master/testcat/simple which is a case after initialisation process is run.
But with 'Sync' merge strategy the job does following steps:
creates and populates synced table testcat.simple_1490699823029 with location /hive/master/testcat/simple_1490699823029
deletes original master table
renaming from testcat.simple_1490699823029 to testcat.simple triggers moving of data from /hive/master/testcat/simple_1490699823029 to /apps/hive/warehouse/testcat.db/simple (with /apps/hive/warehouse being the default location for Hive warehouse)
Not sure if it's related, but using 'Dedupe and merge', MergeTable also requires read and write permissions on '/apps/hive/warehouse', which is not possible in most Production environments.
The read requirement is because of a buggy Hive describe, but the write I don't know why.
I can create another jira ticket if it's not related
This is fixed in Hive 2.2.0:
The workaround is to mark the table as external when it's created, although this will require a code change in Kylo.
If I understand well, this issue is related to a Hive bug. I think that it is possible to resolve this issue changing the order of the operations:
1. Extract Location
2. Rename the target table (It change the location)
3. Create the new table with the correct name
4. Populate the new table
5. Drop the sync table. Since it is a managed table it will drop the old data
1. Optional: create a copy of the old table
2. Truncate the target table
3. Populate the new table
4. Optional: detete the copy of the old table
Both the solution increase the where the table is empty
1. Mark the table as external
2. Follow the same way
3. Unmark the table as external