How To Test Your Data Warehouse

To start with, we need a Test schedule. The same is created in the process of developing the Test plan. In this schedule, we have to estimate the time required for testing of the entire Data Warehouse system. There are different methodologies available to create a Test schedule. None of them are perfect because the data warehouse ecosystem is very complex and large, also constantly evolving in nature. The most important takeaway from this article is that DW testing is data centric, while software testing is code centric. The connections between the DW components are groups of transformations that take place over data. These transformation processes should be tested as well, to ensure data quality preservation. The DW testing and validation techniques I introduce here are broken into four well defined processes, namely:

1. Integration testing
2. System testing
3. Data validation
4. Acceptance testing

Thus these types of testing are again divided in the following sub-phases:

Understanding of Requirements
Development of Test Plan and Test Design
Preparation of Test Cases
Execution of Test Cases
Reporting of the Test results

As you can see, the first major phase is the Requirements Definition. Data used in a Data Warehouse originates from multiple source systems. Therefore, defining good requirements can be extremely challenging. Successful requirements are those structured closely to business rules and in the same time address functionality and performance. These business rules and requirements provide a solid foundation to the data architects. Requirements are one of the keys to success.

The whole team should share the same tools from the project toolbox. It’s important that everyone works from the same set of documents and requirements (i.e. version of files). We have to create a rich set of test data, avoiding combinational explosion of sets. The automation approach should be a combination of Task automation and Test automation. An alternative strategy is to use the source systems to create the artificial test data. That way a realistic but artificial source data is created. The testers should be skilled as well.

Working with the concept of the Prototype model, the testing activities can be summarized in the form of the following checklist:

1. Multidimensional Schema

Workload Test
Hierarchy Test
Conformity Test
Nomenclature check
Performance Test
Early loading Test
Security Test
Maintainability Test

2. ETL Procedures

Code Test
Integrity Test
Integration Test
Administrability Test
Performance/Stress Test
Recovery Test
Security Test
Maintainability Test

3. Physical Schema

Performance/Stress Test
Recovery Test
Security Test

4. Front-end

Balancing Test
Usability Test
Performance/Stress Test
Security Test

In order to achieve our testing goals, we could focus on the following two approaches:

1. Manual Sampling – The testing activities covered under this approach are:
End-to-End Testing – It checks that data is properly loaded into the systems from which the data warehouse will extract data to generate Reports.
Row count Testing – To avoid any loss of data, all rows of data are counted after the ETL process to ensure that all the data is properly loaded.
Field size Testing – It checks that the data warehouse field should be bigger than the data field for the data being loaded. If it is not checked it will lead to data truncation.
Sampling – The sample used for testing must be a good representation of whole data.

2. Reporting – The testing activities covered under this approach are:
Report Testing – The reports are checked to see that the data displayed in the reports are correct and can be used for decision-making purpose.
Next major part of our testing is to choose a Test fixture strategy. I prefer to use Fresh fixture – ensuring that tests do not depend on anything they did not set up themselves. Combined with Lazy (Initialization) setup and Shared fixture (partitioning the fixture required by tests into two logical parts). That is assuming that we are happy with the idea of creating the test fixture the first time any test needs it, we can use Lazy Setup in the setUp() method of the corresponding Testcase Class to create it as part of running the first test. Subsequent tests will then see that the fixture already exists and reuse it.

The first part is the stuff every test needs to have present but is never modified by any tests—that is, the Immutable Shared Fixture. The second part is the objects that any test needs to modify or
delete; these objects should be built by each test as Fresh Fixtures. Most commonly, the Immutable Shared Fixture consists of reference data that is needed by the actual per-test fixtures. The per-test fixtures can then be built as Fresh Fixtures on top of the Immutable Shared Fixture). We also have the option called Minimal Fixture (i.e. use of smallest and simplest fixture possible for each test). Aiming at full post I have to mention the last option I would use, although it’s considered as Anti-pattern, is called Chained Tests (i.e. let the other tests in a test suite to setup the test fixture).

It is important to consider the Pesticide Paradox – often test automation suffers from static test data. Without variation in the data or execution path, bugs are located with decreasing frequency. Automation should be designed so that a test-case writer can provide a common equivalence class or data type as a parameter, and the automation framework will use the data randomly within that class and apply it for each test run. The automation framework should record the seed values, or actual data that was used, to allow reruns and retesting with the same data for debugging purposes.
It can utilize a TestFixtureRegistry via dedicated table – it’ll be able to expose various parts of a fixture, needed for suites, via discrete fixture holding class variables or via Finder Methods. Finder Methods helps us avoid hard-coded values in DB lookups in order to access the fixture objects. Those methods are very similar to those of Creation Methods, but they return references to existing fixture objects rather than building brand new ones. We should make those immutable.

NOTE: usage of Intent-Revealing Names for the fixture objects should be enforced in order to support the framework’s lookup functionality and better readability. To keep the current design the following implementation can be used – check if such registry already exists and remove it;

NOTE: in consideration must be taken the Data Sensitivity and Context Sensitivity of the tests that’ll rely on this module.

NOTE: take into consideration the Data Sensitivity and Context Sensitivity of the tests that’ll rely on this module.

Test Approach Phases

The suggested test phases are based on the development schedule per project, along with the need to comply with data requirements that need to be in place when the new DWH goes live.

Phase 1: Business processes

Data Validation
Performance Test
Functional Test
Data Warehouse (internal testing within ETL validating data stage jobs)

Data validation should start early in the test process and be completed before Phase 2 testing begins. Some data validation testing should occur in the remaining test phases, but to a much lesser extent.

Important business processes where performance is important should be identified and tested (when available) in the Phase 1. Performance testing should be continued in the later test phases as the application will be continuously enhanced throughout the project. In addition to Phase 1 testing, there will also be unit and functional testing. As unit testing is completed for a program, the tester will perform functional tests on the program. While functional testing takes place with one program, the developer continues with redeveloping and unit testing the next program.

Towards the end of Phase 1, the data warehouse team will be testing the data stage jobs. Unit testing should be completed first, then functional testing finishing a couple weeks afterwards. A final formal test will cap the end of Phase 1 testing.

Phase 2: Performance tests

Cross-functional process
Security Test
Data Warehouse (Repository testing and validation)

In addition to the previous tests, Phase 2 should also cover remaining test items that may not been tested in Phase 1 such as:

Phase 2 testing will be important because it is the final testing opportunity that the functional area testers will have to make sure the DW load works as expected before moving to regression testing in Phase 3. Some performance tests and data validation should be included in this phase.

Phase 3: Regression tests

Phase 3 testing is comprised of regression test periods to test updates that are required as part of the Company gaming platform. The functional area testers should have sufficient time to test in each regression test period.

Phase 4: Acceptance tests

Phase 4 testing is limited. In addition to the functional area testers, end users will probably be involved in this final test before the new system goes live. In customer acceptance testing, no new tests should be introduced at this time. Customer acceptance tests should have already been tested in prior test phases.

Evgeni Kostadinov

Evgeni Kostadinov prefers the challenges associated with testing of various technologies. He has extensive experience with UI, API, DW, Performance and Mobile. He has worked mainly in projects with Java, .Net and NodeJS environments for Telecom, Financial, Marketing and Banking Institutions. He actively participates in the development of the company’s QA and CI processes and infrastructure. Evgeni currently works as a QA Manager, technical trainer, as well as a QA Challenge Accepted lecturer.