Support running Scala scripts for transformations in Data Wrangler

Description

As a user with many transformations (30+) to apply to my data table, I would like to be able to apply a list of transformations in 1 shot (batch) to the Visual Query/ Data Transformation, to save time.

Applying 20 transformations takes around 10-20 minutes to apply and get back a result. Having to create several feeds with many transformation is very time consuming, unless we modify the feed files and re-import the feed.

To facilitate this, we could:

  • write multiple rows in the transformation text field, where each transformation row is interpreted separately, or something similar to Edit SQL from a transformation/visual query

  • upload a file containing multiple rows with transformations

If one transformation fails, we could either fail all or simply alert the user to which one failed to be applied.

An example of how this is handled in Google Refine:
https://youtu.be/cO8NVCs_Ba0?t=7m21s

Activity

Show:
Claudiu Stanciu
October 29, 2017, 2:46 PM
Edited

Could help to simply apply the transformation and get back the schema, without the output rows, in the case of an analyst. I guess that this could be like a toggle -> with/without output rows, to render back the schema as soon as possible.

For a developer, it won't matter that much the schema or how the result looks like, since the transformation was most likely tested before. What is needed is to apply several transformations as quick as possible, without going back and forth between the list of transformations and actions to do.
For me, an advanced edit textbox (similar to Edit SQL), where I can apply all my transformations, would be more beneficial to setup my feeds. Speed is of the essence for the project setup

Hope it makes sense

Greg Hart
October 30, 2017, 6:58 PM

, it sounds like a different template may be better suited for your needs.

  1. Would it be easier to enter a Scala script into a textbox? You wouldn't get the Build Query or Transform Data steps but it doesn't sound like you need them.

  2. What about creating your transformation once, exporting the feed JSON document, customizing it for the next feed, and then importing the new feed?

Claudiu Stanciu
October 30, 2017, 11:47 PM


1. Scala textbox could be nice, in a way to bypass the query and transform steps, but for me at least, the Build query is very powerful and fast (most of the times), while the Transformation Data is quite slow. The scala box also implies more knowledge. For me, the best case would be a multiline textbox for the transformation data step, which might also be more beneficial for an analyst user (low exp)
2. Sounds like something that we can do already, which is modifying the feed configuration, post-export. Can also disable and then re-import the feed after modifying the transformations, but it's tedious, even for a developer.

Maybe we can bring some other opinions into the subject via the kylo slack, to improve the user experience!

Claudiu Stanciu
November 3, 2017, 7:57 AM

Been thinking that the Scala box might be the winner here, as you proposed first. If I'm thinking of not only a couple of transformations, but real spark code which can be used by Data Scientists, then the scala box might be better suited in the end, and it might be much easier to implement.

Tim Harsch
March 29, 2018, 9:40 PM

Changed title to reflect the outocme of the discussion. will represent batch transformations story.

Assignee

Greg Hart

Reporter

Claudiu Stanciu

Labels

Reviewer

None

Priority

Medium