ETL: the testing niche you've never thought of
In the broad landscape of software development, ETL (Extract, Transform, Load) testing stands as a pivotal yet often overlooked area within quality assurance, presenting untapped opportunities for QA professionals.
This talk highlights the crucial role of ETL testing in supporting data-driven decision-making and enhancing data integrity. It will explore the nuances of ETL testing, from data source validation to business logic application and data output verification, providing insights through case studies.
You will learn to navigate common misconceptions and barriers, gaining practical skills and theoretical knowledge to excel in this niche domain, thus opening new career paths and advancing QA practices in the era of big data.
Transcripts
[Aaron Hodder] Our next speaker has a passion for all things automation.
Name a framework, he has probably not just used it, but probably teaches a course on it as well. But today, he's here to introduce us to a testing specialisation that not many of us have been exposed to: ETL and data testing.
Here to help us navigate this novel frontier, we have Pato Miner.
[Applause and cheers]
[Patricio Miner] Thank you, everyone. And, let us start with a story to put you in perspective, and give you an idea on how I came up with ETL testing, or how ETL testing came into my life.
I got a job offer. The job offer was fairly vague. It was saying something about Hava, Spring Boot…okay…some data SQL.
I went there. I was like
“Okay, I knew how to do unit testing. I know most of the languages. API, mocking, I don’t have any problem.”
Once I’m there, they were like
“How will you do ETL testing for a data warehouse?”
I was like “What? Goodbye.”
But the story ended with me getting the job because the tester mindset and ability to mentor teams are something that they really liked. And they couldn't find anyone with experience in ETL testing.
That was my first thing when thinking about ETL testing.
So, why are we not hearing about ETL testing?
Why is ETL testing something that we don't speak about, in the testing community? Similar to game testing, like the talk that we just had.
Before coming with an introduction and all that, let me tell you that at the end of this talk, hopefully you will get a new niche. A new industry you can start thinking of, when getting into new tools, new processes.
I will give you some homework, because the mic is giving me teacher vibes. So I will give you something to investigate.
[Laughter]
I was talking with Ryan, and he was super right.
We have a lot of data warehouse work, here in New Zealand. With Microsoft and Salesforce coming through, that will start happening more and more.
So ETL is something we should be paying attention to, as the testing community from New Zealand.
That was not the button…I was testing it.
[Laughter]
You can see that I had too much fun with AI image creation.
I’m Patricio Miner. A couple of things about me.
I'm a consultant test engineer, here in New Zealand. I'm from Argentina, as you can tell maybe from the accent.
I help companies and individuals through creative engineering. I mostly script, automate and help feed those loops of feedback that everyone loves in DevOps.
I make things easier for developers and testing alike. I also teach test engineering. I’m a mentor from Free Range Testers; that’s a community I have in Spanish. I’m starting with DevAcademy Aotearoa, to start giving mentorships here in New Zealand, as well.
I’m also a Dungeon Master of a table of Dungeon and Dragons with two BAs and two developers, that's crazy.
So, first of all, data, nowadays, like it or not, it's everywhere.
Every time you make an appointment, every time you call your bank, every time you buy something, you do something, you're generating data points somewhere.
But it’s the same thing with oil, right? Data is like the new oil. You need to extract the true value of it, if you want to do something useful with it. If not, it's like throwing petroleum into a Renault or Ford Fiesta. You won't get much of it.
Testing in ETL plays a crucial role in ensuring quality of the data is there.
This happens way before any of the other testing that you might be aware of. This happens way before API testing, End-to-end testing, anything you can think of is using that data to make things.
So why not bring attention to the data that we are using for these systems?
So, these are the points that we will be covering today.
We are going to see why ETL is important.
We are going to see specific cases that I have been working on, on how ETL is applied.
A notion of how ETL works, as a process, in data warehouses, and what is expected of a tester.
I will give you an introduction to Great Expectations, which is an open source Python library for ETL testing.
Again, because ELT testing is not super popular, there are not many tools to automate ETL testing.
There were only two tools that I had to investigate a lot, to replace the automation framework that we have.
Tosca was one, which is a paid tool. And Great Expectations was the other one.
The name comes from a great book, so extra points for them.
There are differences we have with other types of testing. Which it's a good and a bad thing, because we need to put on our tester hats or tester mindsets, and start thinking how we can put quality into Extract, Transform, and Load. ETL comes from those processes. How we can make sure that the quality of that is there.
So, why is it important, ETL testing?
The first of all the things, of course, is data quality assurance. We need to make sure the quality we are using for the data is there.
ETL…to give you a big picture of how it works. I was saying that anything you do, any of the interactions you do today with businesses, with companies, online, is generating data.
That data is put into different sources, different places, handling interactions from YouTube, interactions from Google, interactions from different places.
And then someone, a Data Scientist or Business Intelligence says
“Hey, I think we need a promo for people doing this thing, in this particular demographic, with this tool, between these Ages.”
That's when ETL comes into place.
We have requirements saying, okay, we have these data points in these different sources. We need to apply this logic from the data, and load it into a table or view, for those persons to make the right decisions.
As you can see, the second point, decision making, it’s one of the main points.
If you have a business, a startup, anything trying to make decisions based on data. You need to make sure that the data you are using is reliable, is accurate, and it's there in terms of quality.
Then you have regulations. Sometimes it's not something that a company wants, but maybe the government says
“Okay, we have COVID happening. We need to make sure we are giving the right assistance for people who is having a hard time financially with this.”
So, we will need all the financial entities to produce this view/data. Then we have requirements saying
“Okay, we have these data sources, these points here. We need to transform this data because they need it a little bit different.”
Apply logic, put it into a table, and make that report happen.
That's one of the big things here, especially in Wellington. We have a lot of government/ministries. ETL is a big part of how decisions are made.
System migration is another one.
When we are moving data from A to B, we need to make sure that it’s happening according to the requirements that we're transforming and applying everything. We’re not missing, duplicating…or screwing up the data basically.
Performance optimisation is another one.
We need to make sure the queries we're using to make the logic happen is there, in terms of speed. That they are not loading forever, that they are not taking too much time.
Proactive bug resolution is another one. Because, again, data comes before API and End-to-end testing. So we will find those issues happening way before all those other types of testing are happening.
ETL is one type of functional testing. There’s functional and non-functional. ETL is one of them. That's something to keep in mind because when we talk about functional testing, ETL doesn't appear much there, on the list.
So, the first thing I thought when I was showing the team was
“Okay, I know how to test. We have a pyramid of automation. Let's say we have unit tests, integration tests, End-to-end test…but this is data. There is no end to end testing.
We have ways to take that data or SQL server… How exactly does unit testing work with this?”
So the conceptions we have from the normal or more spread notion of testing, changes a little bit when we talk about ETL testing. Even the reporting is different and we will see that later, when we see how we report on ETL testing.
So, one of the key things here is to forget everything you know about testing; in terms of how you do unit, API, End-to-end. Because nothing will be exactly the same for ETL testing.
For example, unit testing will just check those blocks for queries. A unit, in a system/normal object-oriented programming, you're testing the logic of a function. Isolate that from all the other systems.
In terms of data, you don't have functions exactly in a class. You have tables. You have views. So you have to start thinking in a different way.
Unit testing happens…the same thing.
We mock data, and analyse what data to feed, because we are dealing with huge amounts of data we need to test from production/real places.
As testers, we need to sit down, talk with BAs for the requirement and say
“Okay, we have these requirements here, why are we not considering this type of data, or what is going to happen when I do this?”
I remember one case, where a special character was exploding pretty much everything, when we were trying to transfer that data from the source to the target.
The tester mindset/tester hat is really useful. You need to have that curiosity. You need to have the initiative to ask questions.
“What if?”
“What if we do this?”
“What if we have this kind of data?
Especially because, again, when you try to pick up data for testing. When you mock data for ETL testing, it's really difficult to come up with all the scenarios we have in production. We may have, I don’t know, tens of thousands of entries, and it's really difficult to separate into groups.
There is something called Data Profiling that’s really useful. That is the tool that I will give you for investigation later. It helps a lot.
The tool in question here is Great Expectations.It's an open source library. It uses Python. It doesn't require a lot of knowledge in terms of programming. And it simplifies the complex task of ETL.
When I joined this team, we had a huge framework with Java, with Spring Boot. We had libraries to connect to the database, and it was like a Frankenstein of things happening there.
The results were there, but they were not super there, because we were using Test and She. If you're unfamiliar this is about to get a little bit technical.
Test and She is a tool to say, this test case is passing, this case is failing.
And in the term of ETL, it's not exactly the same. Because we need to say, we have this percentage of quality on the data. We have this percentage of expected values on this column. So it's not saying, this test case is passing or failing.
We need to communicate differently.
As I said, it's open source. You can download it after the conference. I invite you. I will put the links on the Discord server. It does all the heavy lifting for you.
When working with ETL, my case was maintaining this huge framework with Spring Boot, Test and She, other libraries to connect to the database, connecting to CSV to mock the data…it requires a lot to maintain. That was not really sustainable, so we had to find a solution.
When trying to find a solution, in the investigation, I was
“Okay, there must be a lot of ETL tools, because data warehouses, data is pretty much everywhere. So there must be a lot of things.”
No, there is not much. Actually, I found just two tools. One was Tosca, and this one was free. So of course, I went with the free option.
And this is a big picture of what Great Expectations does for you.
So, once you have a tool up and running, which is initialising it. You have all the methods and everything in place to connect to your data; being it a CSV, DataFrame, database, different connections for SQL server, DataData, different options you can do.
And then Great Expectations will handle pretty much everything for you. It will create expectations. It will create the reports. You will have the live documentation of what are you expecting from your data, and how is that running against the checkpoints. We will see, in more detail, what each of these things are.
But the important thing from here, is the tool provides you with the context of doing ETL automation properly.
I say ETL automation, not testing, because you're dealing with a huge amount of data. You cannot be going row by row to make sure that it’s happening.
It is true that as a tester, the true value of your work is analysing what are you expecting from your data, and just making sure that by the time you automate, you know exactly which requirements you're covering. What things were not covering the requirements, and what is the true value for the Data Scientist and Business Intelligence.
So, this is the report. It generates HTML reports.
And, there are a couple of funny things here that you will see.
So, it shows how many expectations. Expectations will be, like, test cases, a benefit from the normal testing world.
There is a column here called Observed Value. It has 0% of unexpected values there.
We can see that the expectation was that values must belong to a set. We have a set of numbers here. One of the key things when you're doing ETL testing on ETL programming, is you need to report on the quality of the data.
Same thing with the oil. I don't know, for some reason, I like to compare data with oil.
Yes. It's the same thing because sometimes, when doing prototypes, depending on what you're working, especially with regulations, things that you need to make fast, fail fast. You will say
“Okay, how many unexpected value percentages are we expecting?”
Maybe the business say
“Okay, we need this to happen fast so we don't really care, at this point, if we have 30% of unexpected values for this expectation. We're good to go.”
Which is different from presenting a test from Test and She, or an automated report saying
“This test case is failing.”
How many of those are failing? Maybe it's 0.01 of the data failing, and that's good for the business.
Which, as a tester, sounds weird right? If you see something failing, it’s like
“This is not ready to go.”
But again, testing is about communicating the state of things, to help people making the decisions, make decisions based on the right data.
So, this is one of the core differences that we find between regular testing and ETL testing. On top of the pyramid of automation, and all the processes we may have as testers being completely different.
Unit testing is not quite there. It's not the same. We don't have API testing because there is no API here. And End-to-end testing doesn't work that much, because we don't have user flows here. It's just data being transformed from A, B, C, to C, D, E from another place. So we need to start thinking in another way.
Here, from this experiment I did with Great Expectations, I downloaded it, and invited you to view it in a CSV. You can find a lot of data/CSV from New Zealand that you can download, and you can create expectations against it.
All this is generated automatically. You have the checkpoints, which are the executions you have. Again, you can connect this against CSV files, you can connect this against your DataFrames. You can mock data as well, which is a huge part of how you do unit testing. And I will come back to that in a second.
So, how many of you know Python? Just out of curiosity? In the testing community here?
Okay, that's good. That's good.
You don't need a lot of Python, so you can try it out. Again, you can download it. I will leave the link. But, let's go through how this works exactly.
Once you set up this package, you can create your Data Context. The Data Context, in this context, it's basically the skeleton of your project.
You will create the skeleton with everything to connect to create the data sources, running scripts, and test cases. In this case, it's expectations. Then create the runners for those.
One of the interesting bits is you can complement this with Airflow. I don’t know if you’ve heard of Airflow, it’s a scheduler. In data warehouses, it's really, really useful to make things happen in terms of ETL, and data logic, and extracting, and loading into different tables and views.
So, it has a really good synergy. Not that much with things like Jenkins or Sugar Lopes pipelines. But it works very well.
So, another of the differences we have with this and regular testing, is how we create Expectations.
We have data from different places. We have the Business Intelligence team saying
“Okay, we need this set of data. I don't know, and I don't care where you get it from. I need this data to be presented in this particular way.”
So that's how the requirement usually looks like.
Then, you have the Developers, Business Analysts, and the testers saying
“Okay, we know that this data comes from here. But it's not quite in the state that Business Intelligence wants. So we need to change this with logic, and do these changes before we load it in the target table.”
So, in that sense, you need to know a lot of the domain and how things work in your data warehouses. Data Profiling is one of the main techniques you will use to come up with this. Running expectations or test cases for ETL, will be as smooth as the test cases you have against which data you are going to run this test.
What I'm meaning is that, if you grab 10% of production data, it might not be that relevant, because it's not the diversity that you want. It's not the set of data you want. In that case, doing mocks is the best approach before doing the actual test on the final target table.
So, having, again, that tester mindset/the tester hat and saying
“Okay, we have these expectations. We have this logic. What happens when we have this data in the database? What happens if we don't have this, or it changes slightly?”
All those cases are stuff you're coming up with, as a tester, thinking critically
“How to test this?”
And one of the good things you have with this tool, in particular, is it provides you a data profiler.
A data profiler, in terms of ETL, is something that will help you say this data is relevant and I'm expecting this column, for example, to be this way; to be unique, to not have any duplicates, it should be a date format.
So, the data profiling works basically by feeding, let's say, 10% of data from production. You can feed that into the profiler. It will iterate and generate a draft of test cases or expectations that you will analyse, as a tester. And say
“Okay, this looks like what we want to test. This is not quite there. This is not correct. We can have this other type of value for this attribute.”
And this serves as a really good kick start because we’re dealing with a lot of data. A lot of possible different scenarios. The data profiler is a really good baseline in thinking about how to test that.
These are a couple of different ways you can create expectations. One of them is user domain knowledge.
Maybe you’re an expert in the data warehouse. You know exactly from where this is coming, where it's going, and why it's going there. So you can create your own use cases in a similar fashion. If you're aware of Karate DSL, you have pre made expectations you can use in the same thing.
Or you can do this introspection and have the profiler helping you. You can run the profiler against 50,000-60,000 rows. It will say
“Okay, these are the expectations based on the data that you filled. This is what we need to test.”
Again, that's a baseline. Don't take that as the final thing, but it's a really good approach.
The Checkpoints are the runners. So, similar to what we have when doing automation, we have runners. We can batch different suites of expectations we have, and run those through the valuation and the checkpoints. It will produce, first of all, an HTML report. You can hook it with Slack, with email, and send notifications.
You can use different sets of data. Again, CSV files, SQL database, Spark DataFrame, DataData, and in the case of New Zealand, there is a lot of DataData work, SQL server as well.
You can use local file systems, CSV files, S3 buckets in memory…all of that is something you can use.
I suggest you download CSV from any N Server Stats, or from whatever you want. I tried with a NASA one, pretty cool. Come up with all the expectations that you may think. And critically think if those expectations are correct or not. Usually it's a good baseline, but it's not quite there.
This is what I was saying previously, live documentation. Similar to Cucumber/Gherkin, but more useful?
You have exactly what you are expecting from the data that you're producing, and how much of the values are unexpected or not.
You have the scheme validation for the tables, and it's a really good document to share. It provides the level of detail that you might not have with Test and She, or the other tools that are not suitable for this kind of testing.
Here, we can see, it's checking the different columns that we have. How much data:
If we move everything correctly?
If we move all the values from one table to another?
If the logic is there?
And we can see we have 97% success, which may be enough for some businesses; for others, it's not.
So the conclusion for this.
I wanted to talk about a niche that I feel, as testers, we don't know about.
As I was talking with recruiters, ETL testers are being demanded in the market. And usually we don't have the knowledge or the skills to get those jobs. And sometimes we end up with those jobs, like me.
If you're prepared, it's even better.
So I invite you to grab your laptops after the conference, download the package, and try it out with any set of data you might find.
Share on Discord how it goes. We can have a conversation there. Thank you.
[Applause]