'Sync' merge strategy moves location for Hive's master table (PR #83)

Description

When using 'Sync' merge strategy for data ingestion the final merge table has different location than the one defined when initialising Hive tables.

Example:

  • category: testcat

  • feed name: simple

  • Master Root Path property for Hive table initialisation: /hive/master
    so the final location for master table should be /hive/master/testcat/simple which is a case after initialisation process is run.

But with 'Sync' merge strategy the job does following steps:

  • creates and populates synced table testcat.simple_1490699823029 with location /hive/master/testcat/simple_1490699823029

  • deletes original master table

  • renaming from testcat.simple_1490699823029 to testcat.simple triggers moving of data from /hive/master/testcat/simple_1490699823029 to /apps/hive/warehouse/testcat.db/simple (with /apps/hive/warehouse being the default location for Hive warehouse)

Environment

None

Activity

Show:
Claudiu Stanciu
February 5, 2018, 2:57 PM

Not sure if it's related, but using 'Dedupe and merge', MergeTable also requires read and write permissions on '/apps/hive/warehouse', which is not possible in most Production environments.
The read requirement is because of a buggy Hive describe, but the write I don't know why.
I can create another jira ticket if it's not related
Thanks

Greg Hart
February 16, 2018, 9:03 PM

This is fixed in Hive 2.2.0:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RenameTable

The workaround is to mark the table as external when it's created, although this will require a code change in Kylo.

Davide Gazze
April 26, 2018, 4:06 PM
Edited

If I understand well, this issue is related to a Hive bug. I think that it is possible to resolve this issue changing the order of the operations:
WAY 1
1. Extract Location
2. Rename the target table (It change the location)
3. Create the new table with the correct name
4. Populate the new table
5. Drop the sync table. Since it is a managed table it will drop the old data

WAY 2
1. Optional: create a copy of the old table
2. Truncate the target table
3. Populate the new table
4. Optional: detete the copy of the old table
Both the solution increase the where the table is empty

WAY 3
1. Mark the table as external
2. Follow the same way
3. Unmark the table as external

Done

Assignee

Greg Hart

Reporter

Robert Hencz

Labels

Reviewer

None

Story point estimate

None

Time tracking

0m

Time remaining

16h

Components

Sprint

None

Fix versions

Affects versions

Priority

Medium