Quality Bits

Cloud, Infrastructure as Code and Testing IaC with Daniel Mohacsi

December 27, 2022 Lina Zubyte Season 1 Episode 9
Quality Bits
Cloud, Infrastructure as Code and Testing IaC with Daniel Mohacsi
Show Notes Transcript

Infrastructure as code - is it possible to test it? What opportunities are there in cloud? How can code standards support us in building high-quality products?

About all this and more listen to this episode of Quality Bits with Daniel Mohacsi, DevOps Lead at Lufthansa Systems Hungary.

Find Daniel on:
- LinkedIn: https://www.linkedin.com/in/dani-mohácsi-0604a652/

References:

Follow Quality Bits host Lina Zubyte on:
- Twitter: https://twitter.com/buggylina
- LinkedIn: https://www.linkedin.com/in/linazubyte/
- Website: https://qualitybits.tech/

Follow Quality Bits on your favorite listening platform and Twitter: https://twitter.com/qualitybitstech to stay updated with future content.

Thank you for listening! 

Lina Zubyte (00:05):
Hi, welcome to Quality Bits, a podcast about building high-quality products and teams. I'm your host, Lina Zubyte. Did you miss me? Are you ready for a new episode today? Today I'm talking to Dani Mohacsi. I worked with Dani a while back and he always struck me as an extremely curious person, and you'll hear more about that with some funny examples in this episode. We talk about the benefits of the cloud, infrastructure as code and some pitfalls that we have when we work with it, as well as how do we actually test infrastructure as code? Dani does give really useful tips in this episode, so let's check it out.

(01:03):
Welcome Dani to Quality Bits. It's very nice to have you here. Would you like to say hello?

Daniel Mohacsi (01:09):
Yeah, hello everybody, and thank you very much Lina for inviting me over. It's a pleasure to me to be here. I'm feeling really respected that you thought about me.

Lina Zubyte (01:23):
You are very respected. So we worked in the past together and I remember that you were one of two developers in that company. I was like, if you leave, I will cry and I wanted to cry and maybe I just cry when you left. I respected you. There's so much. So could you introduce yourself shortly?

Daniel Mohacsi (01:45):
Yes, I'm Daniel Mohacsi. I have about 14 years now of experience in various roles in the IT industry and now for the past few years I have been working at a medium sized IT company in Hungary. It's called Lufthansa Systems Hungary. We are in the aviation IT industry, and here at this company I'm working at the team of the CTO working as a DevOps expert. So whenever there's a question regarding of IT automation, software builds, whatever, then I'm the go to person.

Lina Zubyte (02:25):
Who are you apart from your profession?

Daniel Mohacsi (02:29):
I'm a father of a son. He is two years old already and I'm living on the countryside having a sort of nice life.

Lina Zubyte (02:41):
And there's one thing about you that I really remember. Of course you're very humble and that's what we see now, but you're very curious person and passionate person. That's what I remember from you. I remember how once we lost the internet in the office and you hacked the dinosaur game, so that dinosaur game was just playing for itself and that for me was like, it's such a nice memory of you and your curiosity and just spark the eyes about all things tech, which I really appreciate. So what have you been fascinated about recently in the tech world?

Daniel Mohacsi (03:22):
Well, as I'm working for an enterprise company, which is also having multiple locations, people from all over Europe and also outside of Europe working together and we are working in the airline IT industry. I'm fascinated how much potential still is in the cloud computing. So this is a concept that has been invented more than a decade ago by Amazon. Now I still see that not all of the companies are leveraging its full potential. With cloud, we can not only save a lot of money and effort, but also improve our everyday work. And that is something that companies need help in leveraging this. That's what I also started to learn. And there's so many things to learn there and there's so many things to improve when it comes to a daily business of an industry. This sort of gives me the motivation to change the world of aviation with the usage of cloud.

Lina Zubyte (04:29):
That's really interesting. Like aviation as one of the most strict domains, right? And then using the new opportunities. So how do you think these opportunities could be seized and used? What opportunities do you see in cloud that were not there before?

Daniel Mohacsi (04:47):
Formerly the companies not only in the aviation domain, but most of the bigger enterprises went on the path of outsourcing IT and then thinking of infrastructure as a cost center. And then there were this yearly purchasing of hardware infrastructure. Then the maintenance of this infrastructure. When you are an IT enterprise, you had to also employ air conditioning experts. Then in your data center there's a problem with the AC, you needed to employ experts who were there to fix the issues and that shifted the focus. So it requires a lot of overhead for these enterprises. So let's assume you want to deliver software as a company and then to do that you need to have a data center, you need to have your own servers without cloud. Of course you need to make sure that these servers are cooled and there's a lot of other aspects.

(05:45):
This is just one and it brought a lot of accidental complexity in my opinion. And with cloud, this could be reused. So with cloud you don't have to provision your infrastructure for a year in advance and you don't have to employ air conditioning experts as you just use the expertise of other companies who are on the market and competing against each other to give you the best price, which you are also doing as a software delivery company. So I think cloud lets you think about new opportunities because earlier when you for example, had an idea that you wanted to test on the market, then you had to look at first year, you had to look at your infrastructure, whether you have the required capacities for this year or for next year to run additional processes, not even containers, but to run your processes on an existing Linux machine and to provision the discs which are required for your storage.

(06:47):
And all these became available on demand in the cloud. And I think this is an opportunity that not many businesses, at least in my domain, not many businesses are leveraging these opportunities correctly. So there's much more in the cloud than what is being used in my industry. That's what I can see here. Of course it's different in eCommerce and it's different for smaller companies and we have a lot of regulations, so where we can store our data, how we can process our data. But after all these companies, these cloud providers are more excellent in terms of security and they can employ the best security experts and then we only need maybe our part. So we only need to implement the strategies that are laid down in the guidelines of cloud security.

Lina Zubyte (07:40):
Guidelines is quite a big topic. What are some of the strategies that you can think of when it comes to cloud and infrastructure that help to make the products we build more secure and reliable?

Daniel Mohacsi (07:55):
When it comes to security and when it comes to cloud and when it comes to reliability, what I can also think of is automation. When you can automate your processes and when you can improve your processes during automation, then you can reduce a lot of risks and then you can reduce the accidental complexity as well if you do it right. So when you can automate your infrastructure provisioning to an extent that you can recreate your infrastructure on demand, on multiple cloud services for example, or even on on-premise environment, then you can also achieve a greater degree of security without manual effort required to configure and deploy your applications. You can also achieve a better reliability. So that's what I can see now looking at our software, looking at our industry, looking at how we deliver, and I think there's a great potential in automation with the cloud and with the standards that cloud computing brings because what I also saw and I'm also fascinated about is that even competing vendors like Google and Amazon and Microsoft has aligned on the infrastructure standards.

(09:13):
So, for example, Google, Amazon, Microsoft and all the other smaller cloud providers are all supporting Kubernetes, which is sort of the defacto standard now. And also the similar is happening with the observability area that all the competing vendors, Data Dog, Splunk, New Relic, they are aligning on the open telemetry standards. So this is leading us to a nice direction that if you build your applications in a way that you follow these standards, so you containerize them, you use Kubernetes orchestration and you use for example, open telemetry for instrumenting your applications, then all of a sudden your application becomes cloud agnostic. And this is I think what software development has to understand that you no longer have to focus on the infrastructure part, you just standardized infrastructure. You only have to focus on your business domain, on your business logic. That should be the one that differentiates you from your competitors because their competitors will also be using these standards and they will not invent their own standard for container orchestration.

Lina Zubyte (10:28):
You said additional or accidental complexity? Yeah. Yeah, I think that term is really nicely summarizing lots of issues that companies have that we overcomplicate our work when we could have it more efficient and easier. And the concepts you mentioned I feel like also help us to have it more efficient and build better products in the end. What are some of the common bottlenecks that you see that many companies have in their processes that affect their efficiency?

Daniel Mohacsi (11:03):
So there are a few. First of all, I think one of them is the lift and shift approach. When you want to move to the cloud and you don't know how to do it, first idea would be, okay, let's take what we have on premise and let's recreate it in the cloud. That's okay for a time, but after some time it'll hurt you pretty much because you are not really leveraging the capabilities of cloud computing. And if you consider cloud as a data center, then what will happen is that your cloud bills will be much higher because in the cloud you don't pay or you pay for certain items that you're not paying for when you are treating your infrastructure in a data center. So I think that's that's very important aspect that if you're doing lift and shift, then you still have to modify at least the processes.

(11:56):
So you can of course create virtual machines and run your existing, so replicate your virtual machines from your data center in the cloud, but make sure that you are doing the right automation and you're not trying to automate the existing processes that you have. So what I'm talking about rather is that you have to have an automated way of provisioning the infrastructure, deploying your application. And by automation I mean you write it and treat it as code. And even if you're doing lift and shift, you can still containerize. I think you can still use the orchestration facilities provided by Kubernetes for example, or even container service you can be leveraging with a minimal effort. So what I've seen is that many people are really overestimating the effort for containerizing the applications. So after seeing and doing some containerization for legacy applications, it's really not a big deal.

(13:00):
So actually we are doing this or we were doing this in the lift and shift work, but it was done manually. So if you look at, for example, Docker file, there you have everything decoratively or maybe not fully decoratively, but that's a different topic. But there you have basically a set of instructions of what you have to do in order for your stuff to run. And that is basically what we've done already doing the lift and shift and data center work as well, that we created installation documents and operation guidelines and that is something which can be done in a machine interpretable way using containers, but let's not get that far ahead. So when it comes to lift and shift, I have seen a common problem in automating the way of the old legacy way of operations, way of deployment. So I have seen many, many teams implementing some really, really obscure and complex bash scripts to deploy their applications.

(14:04):
And a lot of things are basically because they don't really understand what a modern orchestration facility would bring to them. So for example, when you have a web server process not containerized, then you have to take care about which port is this. And so and I have seen that we had some scripts which were basically setting up some offset on those ports. So if you have to run multiple instances of that web server process, you have to use an offset and this logic and then service discovery. So how does another consumer service found this newly spinned up process on your machine, the new port? So this offset logic had to be implemented both in the producers and the consumers. And this is a lot of accidental complexity again, which is only understandable for people who have been inventing this concept. And this is I think a risk because this will make it proprietary.

(15:02):
So when you have an infrastructure automation script which is proprietary, it means you have a dependency on a lot of important people who could on one hand do other things which are more valuable for your company. On the other hand, they could simply leave if they are feeling that they are burning out in this framework that they have invented. And then you are there with this time that nobody understands, and this has happened to us already. So when we are onboarding people, we have to ramp them up to understand our in-house implementation of infrastructure code. So after all, I think this is where all the open standards come into play. So when it comes to infrastructure as code, I think we agree that for example, Terraform is the defecto standard or it's becoming the defecto standard for IaC, of course there are other newcomers on this market, for example, I like Crossplane very much, which is a way of managing your infrastructure as Kubernetes custom resources.

(16:03):
I really, really, really like this approach, but I think it still has to mature, but when you are using these open source and these sort of defacto or at least highly popular systems like Terraform, then you have an easier way to, for example, get new people onboarded. When you need a Terraform developer, you just hire one basically. And the same applies for Kubernetes, but when you have your own shell scripts, you cannot hire anyone to understand your logic. So yeah, that's the most important learning from my side, at least in the past few years. If our industry can leverage containers, leverage orchestration, leverage CI/CD and observability, then I think it's a bright future for the aviation industry's digitalization efforts. I think this has to be opened in terms of infrastructure,

Lina Zubyte (17:00):
It definitely looks like infrastructure as code can help us out to be more efficient and as well that companies face quite a lot of transformations. So just to step a little bit back there were a lot of terms that you used as well talking about infrastructure as code. How would you describe it to someone who has just heard of this term and is confused about it?

Daniel Mohacsi (17:26):
I think here there's another common pitfall, but I think what I like to use is that this is also code that you write and you maintain. This means you have to apply the same processes for this code that then you're applying for your business code. So it means this infrastructure code must be well documented, must be linted, you must have a CI/CD pipeline for this code and you must be able to test automatically and you must apply of course the PR principle, pull requests, the audit parts that we are familiar in the area of software development. You have to apply all these principles for these disciplines.

Lina Zubyte (18:11):
I really like it because I think also the same happens with test code in general. Very often there's a pull request, we tend to review the code logic, the code code, but we don't review as much test code and I feel like infrastructure code is very similar. So even though it does need review and if you configure it wrongly, all of our environment could go down, our product could be not accessible, but we don't review it as much. We more see it as a config as it is we push it straight to master and we don't have this big review process.

Daniel Mohacsi (18:49):
And now that you mentioned test code, I like your approach. So I also think that we have to talk about code in general. So you don't write this code first of all for the machine because then everyone would be coding in assembly instead you write this code for others. And that's something which is very important to understand for everyone who writes code. So you don't write it for the machine to understand it because that's just not reusable. Only you will understand it or maybe only the machine will understand it after a few months. I rather think about my code as a message for the next developer who wants to improve it. First of all, this is a discipline that everyone has to follow. I've read a nice blog post on GitHub as well that you should also consider your commit messages. And I really like this, I could really align with this, that you have to consider your commit message as a story.

(19:44):
When someone looks at your commit logs, it has to tell a story of how this code base has evolved. So I think is where it starts when a developer understands, doesn't matter if he or she writes test code or writes infrastructure code or writes business code, it's all code after all. And this code has to be understood by others. Especially when you're working at a large industry, there are chances that this code will be used after 10 years or this code has to be touched after 10 years. So it starts with following the disciplines that this code has to be read by others. I have seen many, many test frameworks for example, which were proprietary. So someone invented something which I don't know, tests the functionality or the non-functional requirements of an application. And then if it test failed, then no one really knew what he or she had to do.

(20:39):
So that's not a good test code. And I think the same applies for infrastructure code. So if you're leveraging infrastructures code, there are chances that, for example, when you write these Ansible scripts, and I'm not against Ansible at all, don't get me wrong, but when you write these Ansible scripts, it's very easy to do it wrong because it lets you do a lot of things. You can install packages, you can do whatever. And then when you for example install those packages with Ansible, no one really forces you to use a certain version. Or when you just invoke a script from your Ansible playbook, no one really forces you to follow some principles in the shell script. It can be some very complex logic which does some very, very important cause on the machine which it wants to configure. But then how do you test if you want to change that configuration, how would you maintain that code? How would you do changes? There are no real good practices when it comes to this infrastructure manipulation scripts when you're doing the Kubernetes containerization way.

Lina Zubyte (21:44):
Yeah, I feel we can really over-complicate everything and over-engineered in a sense. And having some standards always really helps. I wonder though, you mentioned testing infrastructure as code once I was in a project where there was just infrastructure and as a QA I was like, oh, what do I do here? And I kept on reading on testing infrastructure as code and I couldn't find a way. But it seems like you have some tips and tricks. So how would you test infrastructure code?

Daniel Mohacsi (22:16):
Yeah, good topic. So for infrastructure code, it depends on your environment. So for example, when you're using Ansible, there are tools like Molecule which can run certain assertions. With Ansible I think the most important problem is that it doesn't enforce immutability. So it might work in a test environment, but when you want to apply those changes in your real infrastructure then it might be different. No really forces you to avoid or to doing mutability. And I think that's the key when it comes to testing, automatically testing infrastructure code. So if your infrastructure is immutable, you have a better set of tooling and the higher reliability of your tests because that in the production environment also nobody did any changes that are not in your actual code base. At least that that's the promise of immutability and GitOps. And when it comes to testing, I think it's quite simple.

(23:21):
You can also think about the testing pyramid here for infrastructure code. So you can for example, have unit tests on your container level. If you're leveraging immutibilability and using an orchestration system like Kubernetes, then you have a very happy path because what you can do is you can obviously link your manifests and we have started to use some policy frameworks like the Open Policy Agent or Kyverno which are capable of validating your Kubernetes manifests against a set of policies. If you do this, you can on one hand shift left the policy evaluation and on the other hand do some sort of functional or sanity check on your infrastructure, so on your containers and how you orchestrate them, what images you are using. Also very important part is that you can also do vulnerability scanning on your container images in a standard way so you can put it in your CI pipeline when you are creating your containers and also run the same vulnerability checks in your production infrastructure.

(24:34):
So when there are new vulnerabilities discovered and then that can be alerted. Also, I think very important is that when you are leveraging technology like Docker for containerisation, you can first of all link your docker file. That's a very easy task. It'll enforce some important best practices like the before mentioned package versioning on your operating system. So linting like tools with Hadolint, I think it's a very, very important tool in your tool chain. Also, there are tools like Dive, which is capable of detecting duplicates in your container. So when you are not leveraging the layering of the container or open contain initiative properly, then Dive is able to alert this in your CI pipeline. So this is all about shifting left functional and non-functional testing. And with containers, this can be a broader set of testing than just testing your business code because you can also have vulnerability scans on your OS packages as well with 3d for example, you can also have vulnerability scans with the same tool for your Java or Java script or type script or whatever artifact that you deliver in your container.

(25:55):
And then when it comes to operating it or running it, then you have tools in Kubernetes that are capable of verifying your resource manifests like open policy agent or Kyverno. These are capable of continuously validating your infrastructure, your deployments, your whatever resources you're using in kubernetes. You can write the policies for them even for custom resources. And also the nice part is that these policies that you're writing and you're maintaining, you can treat them as code. So now I want to refer back to our discussion earlier about code. So these policies will be also treated as code and you can also have the CI pipeline as well for your policies. So I think with this you can have a full coverage of your code base that you are testing the important parts in an automated fashion and you're able to change it because if you have tests, that's also an important aspect, which you have this automated test. Then a new person who joins your team, he or she will be able to understand how or what are the important features of that code base by just looking at the test.

Lina Zubyte (27:10):
Yeah, I really like this aspect because I feel like with going towards infrastructure as code, we are reducing silos. We are heading towards the standard practices and in general, treating code as code, whatever code this is and having certain rules the way we approach things. I think a lot of it is also a mindset of this big collaboration with each other, sharing knowledge that I'm not a knowledge island, even if I live in a company, I still left beautiful commit messages, I left nice code and things like that. So now that we have these efficient processes and we're striving for that, there's one thing that happens naturally, we have more frequent changes. So what do you think about frequent changes in software development? Especially as someone who also knows a bit about aviation.

Daniel Mohacsi (28:14):
Especially aviation. So yeah, just referring back to the testing part and the infrastructure as code. So it'll bring you, we didn't talk really about the benefits or maybe we talked about, but not in terms of changing. So with this approach that we treat everything as code and everything is automated and when we want to change something, we do pull request in the repository, apply our own pipeline and then it's out. So with this we can reduce the risks of changing an existing system. So I have many times not only at this company where that if someone works don't touch it, but that's not how you improve. So if you are not touching a system in the IT word it will rot, there's no nicer word for it. So because there will be, for example, new vulnerability discovered like log4shell and I really liked, there was a talk on a conference earlier this year from, I'm not sure if I pronounce his name correctly, apologies if I don't you: Llewellyn Falco.

(29:21):
It was about changes in the software and he was really gave a good metaphor comparing software to aircraft. I could really relate to this metaphor and I think I can understand our challenges in the IT industry better after this talk that when you are thinking about relying on an aircraft, what would you look for? You would look for an aircraft which has a reliable set of history of maintenance records and whatever. And yeah, of course it has been refueled multiple times and it has been working reliably in the last 10 years. That's a reliable type of aircraft here. When you have to change and then you have to change the engine, when you have to change the type of propellers or whatever, change the screws in the fuel tank, in the aircraft, then you won't rely on this it it'll, it'll be grounded for a lot of time and it won't be safe to fly this aircraft.

(30:25):
And I can understand this, it's how you rely on these. I think the same applies for cars as well. You would rely on a car which requires no maintenance at all. But for software is the other way around. And his metaphor was about this log4shell vulnerability discovered last December, or maybe it was even November, I don't know this log4shell vulnerability. It was fixed a few times in December, but he had basically three different types or maybe more than three different types of software components. One of them was the happiest where everything was automated and whenever they did a change or even they didn't have to do a change because they received the pull request from dependabot from Github. And then when they merge it soon after that the fix was rolled out and then that were different. So the other end of the scale was a system, which no knew where the code is, it was obfuscated, so they couldn't even decompile.

(31:29):
And then whenever they found the code itself and they successfully built it, they had problems deploying it because nobody really were able to deploy the changes in the production environment. So that's the other end of the scale. This other end of the scale was the aircraft. So not in practice, but in the metaphor it's like an aircraft, so you don't change it, it works, let's not touch it, it serves its needs, but that's not how software systems work. So if it were like this, we were still be using Windows 3.11 or something like that and that's it. Changes drive the evolution I think, of this industry of the IT industry. And this is the message that has to be conveyed to all of the other industries that IT will only become a driver and you can only leverage the full potential of cloud. And we haven't even thought about AI and machine learning and all the nice stuff which is available as well in the cloud. So you will only be able to leverage this if you're able to take advantage in your systems.

Lina Zubyte (32:40):
Nice message. I feel like it's a good point for us to come full circle and to wrap up the conversations and topics we've touched on here. So what is the one piece of advice you would give for building high quality products and teams?

Daniel Mohacsi (33:02):
There's only one good advice is that don't try to think that you will be doing it right for the first time. So you will make a lot of mistakes and you will make a lot of learnings from those mistakes. That's very important. After a few, maybe 10 to hundred mistakes, you will get used to it and you will know how to avoid the same mistakes in the future, like writing tests for them, whatever. And then you will have a culture where making mistakes and learning from it is a benefit and it's not something that should be avoided.

Lina Zubyte (33:41):
Wonderful. Thank you so much, Dani. It was a pleasure talking to you.

Daniel Mohacsi (33:46):
Yeah, it was a pleasure for me too. And yeah, thank you very much again for inviting me and reminding me of the dinosaur game.

Lina Zubyte (33:56):
Yeah, such a cool game. It still exists, right?

Daniel Mohacsi (34:00):
I completely forgot about it, but yeah.

Lina Zubyte (34:03):
All right, thank you so much and bye!

Daniel Mohacsi (34:07):
Bye. Have a nice day.

Lina Zubyte (34:11):
That's it for this conversation. Thank you so much for listening and sticking to this very end. Kudos to you. I've combined the notes and resources mentioned are there, so check it out if you want to learn more about the tools or contact Dani or read more on the topics we spoke about. And until the next time, first of all, subscribe, tell your friends about this podcast. Let me know what you think and keep on caring and building and working on those high quality products and teams. See you next time.