Observability, Monitoring, and Platform Engineering with Abigail Bangser Artwork

Quality Bits

High-quality products and teams: what are those? In this podcast, Lina Zubyte takes you on a journey to understand better how to create more efficient and successful tech products with excellent quality.

All Episodes

Quality Bits

Observability, Monitoring, and Platform Engineering with Abigail Bangser

May 01, 2023 • Lina Zubyte • Season 1 • Episode 18

0:00 | 40:39

What's the difference between observability and monitoring? What's a telemetry? Why do they matter, and how could we use them more beneficially?

Abby Bangser shares her journey from a QA to a platform engineer in this episode. We discuss the similarities between testing and infrastructure-related areas like site reliability engineering. In addition, Abby explains observability and shares some sound advice regarding implementing it.

Find Abby on:
- LinkedIn: https://www.linkedin.com/in/abbybangser/
- Mastodon: https://hachyderm.io/@abangser
- Twitter: https://twitter.com/a_bangser

Mentions and resources:

Abby's talk on measuring quality in production at Ministry of Testing (free sign up is enough to access it): https://www.ministryoftesting.com/talks/567372c2?s_id=14698896
Abby's course on Test Automation University - Introduction to Observability for Test Automation: https://testautomationu.applitools.com/observability-for-test-automation/
ThoughtWorks graph showing different roles and overlaps by Laura Paterson
Some more fun readings suggested by Abby:

If you liked this episode, you may enjoy this one:
Testing in Production with Rouan Wilsenach https://www.buzzsprout.com/2037134/11623978

Follow Quality Bits host Lina Zubyte on:
- Twitter: https://twitter.com/buggylina
- LinkedIn: https://www.linkedin.com/in/linazubyte/
- Website: https://qualitybits.tech/

Follow Quality Bits on your favorite listening platform and Twitter: https://twitter.com/qualitybitstech to stay updated with future content.

If you like this podcast and would like to support its making, feel free to buy me a coffee: https://www.buymeacoffee.com/linazubyte

Thank you for listening!

Lina Zubyte 00:00:07
Hi, everyone. Welcome to Quality Bits - a podcast about building high quality products and teams. I'm your host Lina Zubyte. In this episode I'm talking to Abigail Bangster. I've seen so many talks by Abigail, and I always loved learning from her when it comes to observability, understanding better the platform side of things and how it affects quality. So in this episode we're talking about platform engineering, SRE, observability, how they all tie down to quality and testing. And you're going to learn more about telemetry and other useful concepts. Enjoy this conversation.

00:01:09
Hi, Abby!

Abigail Bangser 00:01:11
Hello!

Lina Zubyte 00:01:12
Thank you so much for your time and for agreeing to be a guest here at Quality Bits. To introduce yourself briefly: how would you describe yourself?

Abigail Bangser 00:01:25
Well, thank you for having me, first of all. And how would I describe myself? So, work wise, I am and have always been a testing/quality related professional in technology. But for the first time, I actually have a job title that is not quality or testing related. And I am a principal engineer at a small startup called Syntasso, which is trying to build an opensource framework for helping platform teams build custom platforms for their internal supporting of their developers and in internal features. That comes from a long history of building internal tooling and being on platform teams and really enjoying working with other engineers as the customer for the work that I do. But obviously I'm a lot more than just work. I also am an avid sports athlete person. I love playing sports more than watching to be fair and played a lot of lacrosse and basketball and soccer growing up. And today I play tag rugby around the the public pitches and fields in London. Mostly these days followed up by a pint at the pub, but it's still fun to take it out and run around on my creaky bones here and there.

Lina Zubyte 00:02:43
Your career progression is really inspiring to me. So I'm thinking of this graph. I guess we can link it somewhere. So when I worked at ThoughtWorks, I saw it created by someone (who was not a QA) and there was different roles that we have in engineering teams and QA ended up overlapping with many aspects from different roles and they were in the center as a role. So always when I work as a QA, I feel like QA can grow into many different areas: so you can go more into product, you can go more into infrastructure and it's basically a lot of things can be testing related. And one of my friends keeps saying ""Everything is testing!"", which I do not always necessarily agree with. However, how do you see platform engineering, SRE and testing relating to each other?

Abigail Bangser 00:03:40
Yeah, it's funny, I actually started my career at ThoughtWorks and was originally thinking I'd join as a developer, learn how to become an engineer, and I actually got hired in as a QA in the junior role. And found myself from that point forward over my next seven years, avoiding the roles that would have pushed me closer to software engineering. So I actually thought it was quite boring to have a single role. I found that as a QA, I had this expectation and this opportunity to, I used to say, stick my hand in all the cookie jars, right? I could get involved in user testing and get involved in implementation of features like a software engineer. But also I was deeply involved with the release plans and infrastructure and obviously testing and quality and stakeholder management and all these other aspects that while it was always encouraged that software engineers would get involved with those things as well, that was absolutely auxiliary to the rest of their work versus it was core to my work as a QA. So from ThoughtWorks with a kind of quality mindset, I became a QA on a platform engineering team at a company called moo where we produce business cards. And that was a really interesting role for me because I was given the opportunity to work side by side with the platform engineers. You really couldn't tell the difference if you showed up in our team. Most of the organization didn't realize I had a QA title. It was quite funny, but yeah, I had what I would say is massive impact on the quality of releases there, including moving from a a fortnightly release train to continuous deployment and things like that, which I think you don't think of as a testers job. But at the same time, the difference in quality of the release from having these massive piles of commits, sometimes as many as 25, 30 different commits from as many as eight different teams piling into a single big bang release and switching that to having enabled and empowered the software developers to release their own code straight to production. It really changed the ownership model of quality and the opportunity to test, and I think it was a huge win for quality. So to me that's where I see kind of DevOps activities and platform engineering overlapping with quality. And from moo, I moved on and became an SRE title for a few years at a small startup. And with that you're talking about some of the ""-ilities"" that become hard for entry level QAs to sometimes advocate for, but they get spoken about a lot. Things like reliability, things like resilience to failure, things like performance. And as an SRE, your goal is to look at the quality of the system as a whole and think about what happens if any one part of the system degrades. So again, it's just looking at overall quality, overall impact to users and finding different tools in your toolbox to manage risk. And so yeah, I definitely see a lot of overlap and with intention between all those roles, even if the execution might differ a bit.

Lina Zubyte 00:07:05
For me, it also was sort of an evolution because often we start in a very specifically defined role. For example, you're a manual tester, you do your tasks and then you grow into learning that there's much more and you actually could prevent issues. And when you're thinking about preventing, that's where you get into reliability, performance - how is the system defined as well? And I feel like from all this understanding, then the concepts like testing in DevOps got created or continuous testing which say you should test everywhere.

Abigail Bangser 00:07:45
Yeah, absolutely. I think when you talk about the evolution, I often describe my career in that way. It all seems quite natural in that I started with learning about automation because that's where everyone thinks everything's about code. And then I realized that I was automating a bunch of things that were maybe not built to expectation. So I started evolving into requirement gathering and analysis and moving the conversation of quality and testing further up to the left, right. We talk about moving it to the left. From there, I started realizing that it didn't matter what we built, what we talked about. If it doesn't get to our users in a productive and usable way, then and quickly, then we're not getting the benefit we hoped. So I started working on things like deployment pipelines and consistency across environments and test environment creation and alignment with production and things like that. And then I started realizing that it's cool if it's in production, but if we don't know how our users are using it and the impact it's having and any risks its running into, then what are we doing? We need that feedback loop and that's what got me into deeply into observability and incident management and production area things. Every time there was another question to ask another technique or area of software delivery to explore.

Lina Zubyte 00:09:09
So how would you describe platform engineering? What is that?

Abigail Bangser 00:09:14
So platform engineering, in my opinion, is trying to apply DevOps principles. The principles of saying the same people who build the software should operate and manage that software's full lifecycle in production in use from the users, managing incidents and those kinds of things. If we say that's DevOps principles - we're trying to apply those to high leverage tooling internally to the organization. And so I often when talking about platform engineering, look at the fact that we went from this idea of throwing code over the wall, from dev to ops, realizing that that was not the most productive way of working, and introduced this concept of DevOps where the team has ownership of the code and the operations. But then we felt a lot of pain about the breadth of experience and skills the teams needed to manage that level of ownership. And so now we're trying to think about how do we go back to that specialist mentality that how do we give teams the ability to focus on the things that they are most skilled and most focused on? Without going back to that bad world of pointing fingers between ops and dev where there's slowing down of releases and those kind of things. Then platform engineering is an evolution that aspires to take shared capabilities like databases or queues or other kind of infrastructure and capabilities like continuous delivery pipelines and tries to deliver those as a service from a centralized team in the organization. And there's a lot of questions of like, isn't this just going back to dev and ops? And that's where it's really key to treat the platform with that DevOps mindset. So building and operating a piece of software and particularly adding in also that product mindset, that idea that this piece of software is not just infrastructure, it's not just code. It's actually a capability that users depend on and users have aspirations for and expectations of. And how do you build that so that it's a joy to use and it's trusted and that the users are, the users being the application developers, are keen customers to the platform engineering team.

Lina Zubyte 00:11:43
Yes, software is so many different parts and building a good product for me always associates with good practices of monitoring, for example. And I do know that you have had talks about it, and I would love to dig deeper into this. So when we talk about terms: telemetry, monitoring, observability. What are they and how do they relate with each other? Is monitoring different than observability? And what's the telemetry versus metric?

Abigail Bangser 00:12:18
Oh, you're just opening a can of worms asking about monitoring versus observability. But yes, I think there is a difference. It's and it is worth picking apart to a certain extent. So telemetry is often used to describe the data that you collect or that you generate. So you would think of telemetry as the log lines that are generated, the metric data that's generated. And that telemetry can be in different shapes, it can be at different scales, and then it can be used for different activities. And so this is where there can be a little bit of turf wars between monitoring and observability, though I think that's settled down a bit in recent years because some people consider observability just a marketing term to change the conversation from monitoring. But actually, I think there's a really clear difference there where it speaks to what you expect from your data and what you expect from asking questions of your data.

00:13:17
Monitoring is particularly well positioned to answer questions you know you want to ask. So let's say, you know, you want to track CPU consumption of your application. That's something that you can track with monitoring tools extremely efficiently, extremely cost effectively and over time track those things. But CPU is more like a symptom than a cause and it doesn't really tell you the impact, right? In an ideal world, you use up all your CPU at all times, but there's impacts to your users if you do that. So what you want to be asking is questions that are new behaviors and come out unexpectedly. So in a world where users start calling up your support lines and saying that there is problems, it's not uncommon to look at your monitoring dashboard and think everything actually looks a bit average, you know, maybe a little bit higher than usual consumption, but pretty, pretty much what you'd expect. And now the question becomes, well, what's happening? So you're users aren't calling for no reason.

00:14:20
And with observability, the data you're collecting, that telemetry is shaped in such a way that you can explore it to ask new questions in a really easy and approachable way. So you're using events which are basically structured logs with a little bit of extra metadata to be able to ask questions of high cardinality data. And that's data that has a lot of unique possible values. So for example, user IDs are very, very high cardinality, the most high cardinality because each will be a unique number and monitoring systems don't handle that high cardinality well because they handle that smoothing across the whole system really well. That ongoing monitoring of expected data. Observability systems are built around that high cardinality data, that unique data to help you slice and dice questions which users are being impacted, which regions of the world, which devices. These are all things that you can use to triangulate where the pain is being felt and then hopefully what that translates to in your own system. So which services are creating that pain? And then observability helps you find where the problem is. And in my experience, once an engineer knows where the problem is, it's only a matter of time until they're able to fix it. The hardest part is figuring out where to look in such a big system, that is our production systems today.

Lina Zubyte 00:15:48
Would you say that once we find an issue with help of observability practices and we know about it, that we could turn it into a monitoring practice, meaning that it could become an alert, for example, which tells us next time if this happens?

Abigail Bangser 00:16:06
Yeah, I think that's quite common. There's a saying that dashboards are the scar tissue of a team and of an organization. The idea being that when you find a problem and you dashboard it because you want to be able to find it quicker the next time and you end up with often a high number of these dashboards. And you can think of this a bit like automated testing. In some ways, when you find a problem and you fix it, you sort of want to cement in the fix. And sometimes that means adding an automated test. So you have that validation on every change that nothing is going to have that same experience again. You're never going to see that same bug again. And in the observability/monitoring world, it wouldn't be uncommon to go from saying, Wow, we realized that this piece of telemetry was really useful, this piece of information was really useful. Can we start tracking that in our monitoring systems and keeping a consistent eye on it? And that's just a judgment call. And that's where back to our first conversation that... everything's a bit of testing. We need to think through the cost of maintenance of that code, the cost of maintenance of another automated test, the cost of maintenance of another dashboard or of another metric or of another alert. And depending on the likelihood of this occurring again, the impact, if it occurred again, which you can only evaluate after you've put in the fix, right, to know if that's realistic to re-occur. Then you may choose to go ahead and make an alert from that or make a dashboard. But it may be such a novel issue that you solve in such a complete way that it's okay to not result in any long term monitoring of it.

Lina Zubyte 00:17:44
Exactly. I always have a rule that I say you fix a bug, add the check. Talking now, I was like, Oh, should I also think of like add an alert? But as you say, maybe it doesn't make sense in most cases. If we fixed it, we fixed it.

Abigail Bangser 00:17:58
To be fair, one of the first things I try and do when I find issues is if it was difficult to track it down, to add telemetry, that would have made it easier to track it down. Right? So if maybe there was a gap in our logging that made it hard to pinpoint where the problem was. Going in and adding a log line or adding a trace or adding a metric maybe would help clarify the issue in the heat of that moment of identifying what's happening. And then also it would make it so that whether or not you cement it with an alert or with a test, automated test, you would be able to track it down a lot faster the next time around because you'll have easier visibility into what's going on.

Lina Zubyte 00:18:42
That's a very common problem, actually, that our logs are not specific, they're not helpful often. So the first step is to improve them so we could get some more information from it.

Abigail Bangser 00:18:55
Absolutely.

Lina Zubyte 00:18:57
What are the benefits of having good observability practices in place?

Abigail Bangser 00:19:03
Yeah, I think observability, one of the concepts that gets talked about is that the more observable your system is, the more democratic the ability to maintain it becomes. So when things are kind of cryptic and hard to track down and maybe require a lot of permissions to jump around different systems and get visibility to things only the most senior engineers, both in tenure at the organization and seniority in the industry, really have a chance at identifying a problem. But with observability, you're giving those tools and chances to the whole team. And so one of the conversations that I have a lot when trying to talk about bringing observability to a product or to an organization is about making sure that whatever tools we choose to introduce, we choose to introduce those at all environments, because a more junior engineer may not spend much time debugging production, but if they're using the exact same tools to debug their local or their development environments, then if and when they are called in to help with debugging production, they will least be given a fighting chance to have an impact and to be able to identify a problem. So yes, I think it's about empowering and spreading out the ability for your engineering team to support software is one of the really big benefits of observability.

Lina Zubyte 00:20:37
Very often improving our systems is a job that is not very tangible in the sense. It's hard to see it as a feature which you deliver and you realize immediately. How could we sell this idea to stakeholders at the company? And maybe not always we have to sell it either. Is every company ready for observability? Maybe there's something they should use first before they approach that step?

Abigail Bangser 00:21:05
Yeah. So I think every organization is ready for a certain level of investment in observability. As you say, each organization will have different needs and different, you know, abilities to to spend time on this kind of thing. But I used to, when I was more heavily involved in kind of test automation and that side of things, I used to fight quite hard to get an exemplar test at each of the kind of levels that I thought were important for the product. So if I thought that the product would benefit from API level testing or from UI level testing or integration type mock testing, I would try to make sure we had exemplar tests and a framework in place for that type of test, even if it wasn't widely used. And the reason I did that is because I found that often engineers had the right idea on what should be tested when and where and how, but they also had a lot of pressure around delivering a feature. So if the framework didn't exist to support that type of test, they'd sort of budget and figure out some other way to do that, even if it wasn't the best way. And I think the same thing goes for observability.

00:22:13
If your tooling only produces telemetry that supports monitoring, that has that low cardinality data, things that are more equivalent to true false or small subsets of possible data points, then it's going to be hard to ask people to think about what are the more detailed pieces of information that they may need to debug in production or to bug their software in any environment. So getting in the ability to introduce trace data or high cardinality data, even if you don't then go through and scatter it throughout your whole product and your whole application suite, it will start to naturally grow and be available to you as and when you find opportunities. So that's where I would say getting started with that: adding in open telemetry libraries, adding in structured logging instead of a more traditional kind of string based logging, these things could make a big difference.

00:23:15
But as for selling it, I think one of the SRE topics that became popular after the SRE book from Google is the concept of Service Level Objectives, also called SLOs. And there's sort of an internal view in comparison to what a lot of us will have already heard of as SLAs or Service Level Agreements. And the only difference between an agreement and an objective is an agreement has a punishment if it breaks. So you might say to your users that you promise them 99% uptime, 95% uptime, whatever it is that you promise, and if you don't, you'll give them back a portion of their fees. With a service level agreement like that, you don't want to find out after it's been breached because now you have a penalty to pay back to your users. You as an organization want to know before that happens. So you might set an objective which would be an internal facing service level tracker that is set to a few percentage points lower. So let's say you promise 95% uptime. You may set your objective to 98% uptime. And that way you have a little bit of a buffer from when that objective gets breached to when you would have to be paying out to the agreement. These objectives are really powerful because they help to showcase to an organization when you are tempting fate with the quality of your software. And they're very helpful also for the team to have confidence to take some risks. If you're trending it 99 or 100% uptime and your agreement is only 95%, it's okay to take a few risks to push the product forward, introduce new features. But if you're sitting at 96, 97, you've broken your objective, you're trending very close to having to pay out customers. Now, this is when you need to be paying down any costs associated with reliability, resilience, observability, things that will help you maintain the system at the agreed levels of service.

Lina Zubyte 00:25:26
Yeah. So it's basically using data to help us prove that it's worth it.

Abigail Bangser 00:25:34
Yes. As always, you've managed to say what I've said so much clearer and so much more effectively.

Lina Zubyte 00:25:43
So you've mentioned that we could include examples of how to do things. For example, if we want a certain level of a test, let's have it as a base so that people could build on top of that. What about observability and monitoring practices? I assume also we could start by building some kind of skeleton so that people could build on that. Are there any other tips that you would have for people to start understanding how their systems work better and improve monitoring or observability?

Abigail Bangser 00:26:19
Yeah. As you summarized, I think getting some of the auto instrumentation tools that are becoming quite mature and popular these days, like open telemetry in helps a lot, making sure if you don't have access to a user interface to explore that data locally and on your development environments in your test environments, getting that in place would be highly beneficial. But if let's say you're already in that place, you've got all the tools, you've got all the data, then what I would say is start asking questions of your data. Start doing exploratory sessions where instead of asking questions like what would be the experience of an onboarding engineer into, our onboarding user, into our application, start asking questions like: What would be the experience of trying to find the most popular user journey in our application using only the telemetry data? And when you start asking these questions, you start to gain insight into where your data is strongest, where it's weakest, what kinds of questions you can ask, what limitations you might find, and you may find some quick wins of places where you can add more fidelity, more data, and you also may just gain a lot of experience and confidence to start jumping in to things like incident management, where you want to try and identify the impact of an incident and you will have that experience of going in, looking at the data, tracking by customer and things like that.

Lina Zubyte 00:27:54
That sounds wonderful. Just that often the reality I see in the companies, maybe that's my bad luck, is that monitoring is there.... So, for example, there's lots of data. We do have telemetry. We may be able to use this data. However, we're drowning in noise. The dashboards likely are red. How can we improve from this situation and somehow maybe put more importance to the topic?

Abigail Bangser 00:28:25
Yeah, so that's definitely a problem with just quantity of data. So that's some of the stuff first of all that you can ask yourself in those exploratory sessions, like what data in here is just not useful and can we remove it? So I've identified lots of places where a log line was added into a loop as a part of local debugging and never removed again. And so in production we have all these log lines coming in, sometimes gigabytes a day that is just absolute noise when it gets past that early development phase. So definitely having a look and asking those questions can help. But you raise a really good question as well about like alert fatigue. And I think that is something that again, where you talk about how a lot of these experiences as an SRE and in production support overlap and correlate to my experiences in QA and testing. I've definitely been a part of organizations and teams where the CI pipeline has tests failing all the time. It's just saying run it again, running again, even building in retries into the testing framework itself so as to try and reduce that noise of failure.

00:29:39
And one technique I've used in both test automation and alert is to just quarantine noisy alerts or tests. Take anything that's just often going red and often being noisy and push it to the side so that you create, even if it's only a very small group of automated things, alerts and tests that you can trust implicitly. You don't have to sit there and say, Oh, it's probably flaky. No, if something goes red in that group, you know, there's a problem. Now that may not be a representative group of the information you care about for your product. So once you have that group of stable tests and alerts, then you need to do some evaluation on if that's actually providing you value and traceability of like the impact to your users and start prioritizing the flakey ones to bring them back based on their value to your users. So yeah, I would say getting yourself to a point where you're no longer fatigued is key. Then worrying about the kind of quality of coverage because at that point you at least have some sanity regained and hopefully can can apply that to that. Now breathing space to making sure you have the right things covered.

Lina Zubyte 00:31:02
Yeah. Step by step. So let's imagine we're in an ideal world. We have all the practices of observability in place. An issue happens in production. What would you look at first? How would you debug it and find the root cause of it? Just a hypothetical scenario.

Abigail Bangser 00:31:25
Ah, just hypothetical. Thankfully, if a pulse hasn't quite raised to heights, it's just hypothetical. But I guess, first of all, I should say that I would lean away from using the term root cause. This is a particularly hot topic in the resilience engineering space about the fact that there's often not just one cause, there's often a bunch of things that go wrong at just the wrong time, and they create that kind of perfect storm of things, and you can't really blame any one of them independently. But the question you ask still very much stands of like, what do you do when something goes wrong and you need to to take action in production? And I think that it's key to keep track of the phases of an incident. So identification is one phase. How did you know something was going wrong? Was it a manual alert in the sense that customers started calling in or an engineer spotted something? Or was it an automated alert? Once you know that there's been an issue identification. You're then looking at trying to understand and triage the impact of that problem. How broadly is it impacting your system and your users? How deeply is there a workaround available? You're looking for these kinds of impacts to understand the severity of the problem and who you need to be in touch with about the problem? Is this a a global outage or can you get on the phone to one customer in particular about the problem at hand? Once triage, what you're looking for is trying to get people back online as quickly as possible. So you're looking to mitigate the problem. This is often not an ideal situation. For example, I've been in situations where the mitigation is to restart the service on a regular cadence. Like every hour we restart the service because there's some sort of a memory leak and as long as the service gets restarted, the memory gets reset and and everything is healthy until we can figure out where and what that memory leak is to have a long term fix. Once stable and mitigated, that's when you start worrying about that long term fix and that real like cementing of quality back in. So things like adding in those automated tests, adding in that more telemetry data to make it easier to identify and obviously fixing the problem in the software or the infrastructure that caused the impact to begin with. And it's that sort of like cycle of going from identification to triage to mitigation to solution that you then can use as a basis for any kind of retrospective on the incident. Can you identify it faster and easier? Can you triage it faster and easier, mitigate it faster and easier and then solve it? So that would be sort of the the high level process of it.

Lina Zubyte 00:34:18
I really like your comment on a root cause. I've never heard of that before, that it's so heated term.

Abigail Bangser 00:34:25
To be fair, it's not so dissimilar to the idea of if automated testing is testing, but in the resilience engineering community. So it can get quite deep, quite fast. I'll try and send you some good readings on the topic just to get it kind of into the mindset of people who are much deeper studied in that than I am.

Lina Zubyte 00:34:46
Yeah, it's always, you know, you start learning more and more and it becomes more complicated because they're like, Oh, maybe that's not that. Actually.

Abigail Bangser 00:34:56
Absolutely.

Lina Zubyte 00:34:57
I really liked that in one of your talks, you said that we test in production or we look at observability as well because most of the customers won't reach out to us. Right? So the support reports already just a glimpse of what our users are experiencing. So when it comes to observability practices and discovering issues that maybe we did not know about, what is the most maybe fun, our best issue that you could remember of in your career that you found using observability?

Abigail Bangser 00:35:35
Well, that's a very good question. So possibly a good example might come from my time when I was working at moo, where we did like a physical product that required shipping and one of the teams that invested the most into bringing observability into their their tech stack was the team that owned shipping. So the team that actually managed once the printing finished with the business cards and other products, and they were all boxed up and they were ready to get from our warehouse to the customers: how do we choose where to ship things or how do we choose how to ship things? So how do we choose which company to pay for to ship that product? And we need to obviously balance cost and efficiency for that. So always using DHL or FedEx would obviously be a high cost, but often a very fast pace of delivery versus using some of the more public offerings like USPS in the U.S. or Royal Mail in the UK or elsewhere might be a slower rate, but also often a lot cheaper. And so the team used the data associated with observability to identify not only the costs and the impacts of shipping dates from the different companies in order to try and optimize that, but they also are actually able to track the consistency or the uptime of those companies. So a lot of times those companies websites or APIs would go down for periods of time and they were able to identify which ones were more likely to be available, how long they would think that those third party systems might be down for based on what we call yesterday's weather, like the experience they've had with them in the past and whether or not they could wait to ship those products until that API came back up, or they might need to spend more money to ship it with a faster, more expensive company while waiting for the target company to be ready to be used. So it's that sort of data where we're able to break down things by high cardinality data like what shipping, shipping companies we're using that gives us even product insights and not just technology insight into our running systems.

Lina Zubyte 00:38:09
That's a great example. So to wrap up this conversation, what is the one piece of advice you would give for building high quality products and teams?

Abigail Bangser 00:38:22
Oh, there's always so many things and it's always so contextual. But I think one of the things I spoke about recently with some people is about how having a high trust like high psychological safety environment can be a really important underpinning to any successful organization and product and team. And where this often comes up is around actually internal dynamics, the ability to work together. And I think that shouldn't be underestimated. It's extremely important. But where this is actually coming up in a recent conversation was around how having that high psychological safety as a product engineer or a test engineer or anyone in the product team allows you to be confident going into sometimes difficult conversations with customers. It allows you to go into those conversations and be vulnerable and potentially encourage and hear some tough feedback without being worried about your position on the team, your job, and basically your value to the company. And unless you're willing to sit in those hard conversations and hear that hard feedback, you'll never really be able to grow your product because people will learn that, you know, you just try and brush away any hard feedback or explain away any issues that they're feeling and they just stop bringing them to you. So I think that kind of confidence, psychological safety within a team can have rippling effects to the way in which you manage your product over time.

Lina Zubyte 00:39:58
Thank you, Abby. I truly enjoyed our conversation.

Abigail Bangser 00:40:01
Absolutely. Thank you so much for having me today.

Lina Zubyte 00:40:05
Thank you so much for listening. If you like this episode, please rate it and share it with your friends. You're going to find all the useful information and where to find Abby in the notes. And until the next time. Do not forget to continue caring about and building those high quality products and teams. Bye.

Lina Zubyte

Host