6,438 total views, 2 views today
Why Incremental Load?
Is your BI Application hitting Database for large volumes of data? If so, is it happening frequently? As BI applications are expected to deal with larger data sets, there is a need for frequent refreshes to get the updated data. In both the situations, it is not efficient to load all the data historically every time, just to get the latest updated records on timely basis. This is where the ‘Increment Load’ concept plays a tremendous role to make BI application more efficient.
What exactly the Incremental Load do?
The answers for all the above questions is with ‘Incremental Load’. An Incremental load improves the performance of loading process by pulling out only new and updated records, instead of the entire data and appends it to the existing data set(QVD). Simply put, Incremental load increments the data on existing table/QVD with newly updated records at each refresh. This way, it improves the load process 100 times better than the traditional loads.
How exactly Incremental load works?
Let us go in detail by putting it into real work. Below are the workflow steps for the implementation of the same.
- Without the Incremental Load, you need to load the complete data. Whenever you need to update the new records, you need to load the complete data again, which takes lot of time to load and store it in the local drive (QVD). With Incremental Load, you can load only the new/updated records.
- From the QVW, find the last updated record date in a table.
- Based on the last modified date, connect to the data source and pull the newly added records, which are beyond the last modified date. This can be done by using ‘where’ clause of load script.
- Append the newly updated records to the existing table locally to get the live data.
- Supplement the BI Application with the incremented table.
Let us go through the same with an example:
Real Time Demonstration of Incremental Load
Assume that we have Sales data holding 10 Mn records, and assume that it gets adds at least 1,000 records on a day. Now, ensure that we are implementing Incremental load in such a way that, we connect the data source for pulling only 1000 records by eliminating the process of retrieving 10 Mn records each day..
There are 3 different methods of implementing the incremental load.
Insert and Update
Insert, Update and Delete.
Let us implement these 3 methods with a live example.
Note: Following implementation can be applied to both QlikView and QlikSense applications as both tools originated from the same technology.
1. Insert Only
Let us consider a Restaurant daily sales data where customers purchase food items on a daily basis. For each transaction, we have Product ID field, which is a unique value, and Item field shows the item that the customer is purchasing. Sales field for the transaction amount, Store Name field indicates the shop name and Date Modified field shows the day, the transaction happened.
Assume that, as of that day, we have the data till 10-Jan-2017. The same data we have in a QVD locally (consider above first table for the same). Imagine a situation that there are two new records added in the database after the date 10-Jan-2017. Look at the second table above, and assume that the yellow highlighted records are added on 15-Jan-2017. These two new records should be appended to our existing QVD. Before fetching the newly added records in the database we need to know, upto which day our existing QVD has data. Based on that date, we can target the Database to fetch the new records that should merge into the QVD. The next question is how to get a Recent Updated record date that our existing QVD has? Answer is by loading the existing QVD and using Peek() function we can fetch the same.
Below is the script shows the logic for fetching Recent Modified Date.
Peek() function points the Nth row value of a field from a particular table based on a load order. If we pass a number ‘0’ then it fetches the 1st loaded value of a provided field in load order, If we pass ‘1’ then it fetches 2nd loaded value and so on. Note that if our table has N number of records and wants to get a last loaded value, in this case the exact index number cannot be defined. In this case -1 is an index number to fetch last loaded record.
Syntax: Peek(‘[Field Name]’, Row Number,’[TableName]’)
As we are fetching the latest Modified Date using Peek() function, we must need to store the value in to a variable. Here vLastUpdatedDate is our variable.
With this, we get the latest modified date, we connect to database and fetch the records where the Date Modified field has greater the value of vLastUpdatedDate variable. It can be done by writing simple ‘where’ clause condition (where [Date Modified]> vLastUpdatedDate) at the end of table load script.
From the above demonstration, we got the last updated date into vLastUpdatedDate variable from the existing QVD. The immediate step is to fetch the incremented records using this variable. This can be done by simple ‘where’ clause condition.
To summarize the steps that we have followed in the above script.
- Identify and fetch the recent updated records using ‘where’ clause.
- Merge newly added records to existing QVD/Table.
- Replace the old QVD/Table with Incremental Load.
Below snapshot is the script to fetch the incremented records and concatenating these records to the original QVD.
From the above script, Our RestaurantSales.qvd, got recently updated records. Let us load the same QVD in a separate QVW file to validate (assume that we are loading the table into final BI Application which has all the records).
If we look at the green marked rows, these are newly added records showing recently updated date i.e Jan/15/2017. But with the ‘Insert Only’ approach, we got duplicate records as shown in red mark. It is because of not checking the condition that whether the existing records has same data as the incremental records has. This situation occurs when the existing record gets updated recently in the database. Finally, in some situations, above scripts bring the duplicate records in to the final table. Hence this approach need to be reconciled to have more optimal solution. To overcome the same, we need to move on to ‘Insert with Update’ records method.
2. Insert and Update the records
This approach eliminates the duplicate records from the incremental table. For the same, we need to have a unique Key field to identify each row distinctly. In the above example of data [Product ID] is the unique field. If we look at the output table we have two duplicate records with the same Product ID = ‘SDUSA131’, Sales=245, etc., which is indicating that this product has updated on Jan-15-2017, hence we no more need to maintain the record with Date Dec-14-2015 as a Modified Date. This can be done by checking a condition on Existing table before concatenating to the Incremental records table.
And, assume that the record with Product ID = SDUSA139 sales amount is updated with sales amount 100 today (Consider today’s date is 15-Jan-2017). And we had the same record with old sales amount i.e. 60. For this scenario, our incremental load has to eliminate the record with sales amount 60 from the existing table and need to update to 100 with the Date Modified as 15-Jan-2017.
- It eliminates the duplicate records (Redundant records)
- It updates the old records with the updated data.
Below is the modified script to overcome this problem…
Load the updated QVD again in separate QVW, then the proper output table would be as shown below.
Now, we eliminated duplicate records (outdated records), Updated the records with modified data in Sales, and got the newly added records properly into the final table.
The next question is, if any of the old records got deleted recently in the database? Still the above script doesn’t handle the deletion of old records. Now, our incremental script should be able to handle all areas at a time, by including the deletion of old records which no more exists in the database.
3. Insert, Update and Delete.
Assume that, Record with Product ID=SDUSA125 got deleted on 15-Jan-2017 in original data source table, but which was retained in our existing incremental table, and we need to delete the same from the existing table, before concatenating with the incremental records.
This can be solved by doing inner join between Incremental table and original data source table. Here, we need to load only Product ID field from the original data set and inner join this, with the existing table to not to load the deleted records from the existing table. In real world, deletion of records from the database will not happen. However, to have an effective incremental engine, we also need to check this property in some cases.
Below is the updated script solving all the combination of scenarios.
Below is the final output.
This is the real time demonstration of Incremental load in QlikView.
Incremental Load Benefits:
- Brings efficient load at any instance by eliminating full load of data.
- It reduces the waiting time by 100 times compared to the traditional model to fetch full data.
- Reduces the traffic burden on Database.
- Eliminates the burden on Data source drivers.
- Incremental load reduces the burden on RAM.
- It acts like a JIT (Just-In-Time) engine in Data Extraction layer which fetches the data instantly.
- It uses QVD file formatted tables, which compresses the data by large extent.
Data Localization: Means that the incremental load used recently added data and appends to the recently incremented table which leads to access of data, always local to the BI application.
Developer Qlik View, Qlik Sense and Tableau
Suman is working as a Developer (QlikView, QlikSense and Tableau) for the past four years, involved in development of Business Intelligence (BI) solutions. Responsible for all activities related to the development, implementation, administration and support of ETL processes for large scale data warehouses and Performance optimization of the BI applications. He has expertise in complex data modelling, designing the scorecard dashboards, Ad-hoc reporting and building KPI’s. In the past two years, he has worked in several domains viz., Financial, Sales, Transportation, Insurance, Media, Healthcare, Telecom, Hospitality, Retail and Crime Analytics.