Skip to content

Improving Transformation Performance by Launching Several Copies of a Step

June 13, 2011

There are many ways to help improve performance in your transformation, one of those ways are through the utilization of launching several copies of a step. In the case of where you are running Pentaho Data Integration on a machine that has multiple processors/cores, you can leverage those all within one transformation. Let’s take a look at a simple transformation that is made up of three steps:

 

  1. Generate Rows Step: This step generates 10,000,000 rows of data made up of two columns, column A and column B.
  2. Script Values/Mod Step: This step creates a new column C that is generated by adding the values of column A and column B.
  3. Dummy Step: This step receives the input from the previous step, however does nothing, it is simply a placeholder for testing purposes.

 

Transformation Flow

 

As these 10 million rows flow through the transformation it is evident that Step #2 (Modified Java Script Value Step) had to do that calculation on 10 million rows by itself, this produces a bottleneck. The one step leverages one processor and thus again is representative of a bottleneck situation within a data transformation. When we run the transformation above, it completes in 21.5s.

 

 

 

Additionally, we can view using Windows Resource Monitor the CPU utilization caused by the execution of the transformation. You will notice that in the graph below, that the transformation is consuming the cycles of CPU 0, while CPU 1 still has availability to offer additional processing power:

Now, let’s modify our transformation and add a duplicate Step #2 (Modified Java Script Value Step). There are two options when creating a duplicate step and directing the flow of data to that step.

  1. Copy: This option all rows are sent to all destination steps.
  2. Distribute: This option distributes the rows in turn to each step.

 

 

Now that we have created our duplicate step and the data flow is being distributed to both steps, we can now connect that duplicate step to the dummy step which results in our new transformation that has increase our bottleneck throughput as seen below:

When we run this transformation, we will notice three changes:

  1. Data being distributed to newly created duplicate step, each step is processing 5 million rows each instead of one step processing 10 million.
  2. Execution time being reduced from 21.5s to 16.4s resulting in a 24% increase in execution time.
  3. Multiple cores being utilized as show in the graph below.

 

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: