Prezi Engineering - Medium

How using Availability Zones can eat up your budget — our journey from Prometheus to…

Grzegorz Skołyszewski — Mon, 09 Dec 2024 16:31:05 GMT

How using Availability Zones can eat up your budget — our journey from Prometheus to VictoriaMetrics

Intro

By 2024, Prezi’s monitoring system, built around Prometheus, was becoming outdated. It was already 5+ years old, running on a deprecated internal platform and accumulating a significant amount of costs every month.

At the beginning of the year, we decided to deal with the “future problem” and modernize our metrics collection and storage system. Our goals were to run the monitoring system in our Kubernetes-based platform and reduce the overall complexity and costs of the system.

We achieved these using VictoriaMetrics. This post describes our journey, the challenges we faced, and the results we achieved from the migration.

Previous state

Our Prometheus-based system wasn’t that problematic by itself — we ran a pair of instances, to achieve high availability, for each of our Kubernetes cluster. We also had one extra pair for non-Kubernetes resources, and one for storing a subset of metrics with longer retention. You can see the high-level architecture of the system in the diagram below.

Our Prometheus-based system architecture

Just before the migration, we had 5 Million active series at any given point in time. It’s also worth noting that our microservices ecosystem was already instrumented for producing metrics in Prometheus format, and it was something that we didn’t want to change — it’s at this stage de-facto the standard (although it is slowly becoming superseded by OpenTelemetry).

There are some challenges when operating such a system:

Exploring metrics or configuring rules must target specific installations. This made dashboarding and alerting more difficult, and it already is difficult for most non-SRE folks in general.
The instances Prometheus ran on had to be really beefy to handle our load.
As mentioned in the introduction, the instances were running on the previous version of the Prezi platform that was already deprecated. We really wanted to move off.

The options

Now that you know what we were dealing with, let’s look at what we could have done with it. We set out to explore our options, considering both managed and self-hosted solutions. We quickly realized that we couldn’t afford to ship our metrics to any of the vendors out there. We would have to spend at least 2x the current cost, and, the perspective of modern self-hosted solutions being even cheaper, led us to drop that path.

On the self-hosted end of the spectrum, we had:

Thanos
Mimir/Cortex
VictoriaMetrics

Some members of the team were already familiar with Thanos and Cortex, so these were the biased first-choice tools that we first tried to understand. But we didn’t stop there and made a complete comparison for the concerns that we cared about. You can see the table from one of our exploration documents below.

Differences between Mimir, Thanos and VictoriaMetrics, taken from our exploration documentation.

We initially thought that using block storage may be a downside of VictoriaMetrics. Nothing more wrong — while it’s tempting to use the infinitely-scalable object storage (like S3), the good old block storage is just cheaper and more performant. Given that cost control was one of the priorities, we saw an opportunity to run the system cheaper, and quite possibly — with less complex architecture. For example, thanks to using block storage, VictoriaMetrics no longer needs any external cache subsystem, as is the case with the other two.

In the process of exploring what VictoriaMetrics has to offer, we also took a small detour and talked with the good folks at VictoriaMetrics to see if buying an Enterprise license for self-hosting, which enables some features that we could have wanted, is within our budget. Turns out we didn’t really need these features, but buying the license wouldn’t break the bank for us either. And, there’s nothing wrong with asking for a quote!

VictoriaMetrics stood out thanks to its simplicity and cost-efficiency, which we tested in a Proof of Concept.

VictoriaMetrics Proof of Concept with some challenges

We jumped into the implementation of a small proof-of-concept system based on VictoriaMetrics, to see how easy it is to work with (what’s good from the most cost-effective system if you can only get there after 3 months of tuning it back and forth?), how it performs, and to extrapolate the cost of the full system later on.

VictoriaMetrics allows you to install VictoriaMetrics Single — all-in-one, single executable, which acts almost exactly like Prometheus. It can scrape targets, store the metrics, and serve them for further processing or analysis. We knew from the start that we wanted to use VictoriaMetrics Agents to scrape targets, as that allowed us to host a central aggregation layer installation and distribute the agents — all of them contained, collecting metrics only within their environments (be that Kubernetes cluster, or AWS VPC).

The initial idea

We wanted to host the tool on Kubernetes at the end, so it made sense to rely on the distributed version of the system— for high availability and scalability, it just sounded good. We took the off-the-shelf helm chart for the clustered version — one, where VMInsert, VMStorage, and VMSelect are each separate components.

The concept is fairly simple — VMInsert is the write proxy, VMSelect is the read proxy, and VMStorage is the component that persists the data to underlying disks. On top of that, we also installed VMAlert — the component used for evaluating rules (Recording and Alerting).

High level overview of VictoriaMetrics Cluster architecture, taken from our exploration documentation.

We didn’t want to test agent options yet

We initially used Prometheus servers with remote_write for testing but quickly found that VictoriaMetrics Agents were far more performant for our needs. Even though we had a lot of headroom on the instances, the Prometheus was just too slow to write to VictoriaMetrics.

Installing VictoriaMetrics Agent was easy with the already existing scraping configuration. We simply replicated the configuration — that was enough to make the Agent work.

The cost and the performance

We managed to create a representative small version of the system. That allowed us to test the performance of reads and writes, and see how much resources (CPU time, Memory, and storage size) the system used. We were absolutely delighted. We found queries that were timing out after 30 seconds in Prometheus, returning data in 3–7 seconds in VictoriaMetrics. We didn’t find any queries that were performing significantly worse.

We also found that the resource usage footprint was minimal. The data is efficiently stored on the disk, and compressed, and the application uses very little CPU time and Memory. Our estimations at the time showed: 70% less storage, 60% less memory, and 30% less CPU time used. This, together with bin-packing in Kubernetes made us excited about saving a significant amount of money spent on the system.

Well done, VictoriaMetrics!

skynesher/E+ via Getty Images.

Too good to be true, or how using Availability Zones can empty your wallet

So it was working, and it was working well. We were scraping metrics and using remote_write to store them. We could query the metrics in Grafana (added as Prometheus data source, because VictoriaMetrics’ MetricsQL, the query language, is a superset of PromQL — which is fantastic!), we even added some alert rules and saw them trigger. That was so smooth. Too smooth.

A couple of days later, we found that we had accumulated a significant amount of dollars, which was attributed to the network traffic in our environment. Turns out that running a distributed metrics system, where each time you query or write a metric, you get an extra hop (VMSelect or VMInsert to VMStorage), can be costly when you put that in the context of inter-zone traffic in your hyperscaler (AWS for us). Not only were typical metric writes and reads subject to that , but evaluating rules (and we have some really heavy recording rules) also used the same route. That was concerning and made us stop and rethink our approach.

DjelicS/E+ via Getty Images.

We needed to figure out something else.

Back to the roots

If you scrolled up to the previous state diagram, where I showed how we used Prometheus, you might see that we used a pair of instances for HA. We decided to keep that approach for our new system. Instead of using the clustered version of VictoriaMetrics per Availability Zone, we tested the installation based on two separate VictoriaMetrics Single instances, each in a different AZ. We went into “save as much as possible mode” at that time, and we traded local redundancy for a global redundancy — since a single cluster with distributed components would be enough for us, reliability-wise — two instances in a hot-hot setup would also do it!

Installing two single-replica Deployments of VictoriaMetrics Single worked flawlessly for us (spoiler — it still does work flawlessly more than a half year later 🚀). We no longer cross Availability zones with our extra hop traffic.

We added a pair of VictoriaMetrics Alert instances next to each VictoriaMetrics Single instance, operating in the same Availability Zone.

Aggregation Layer overview based on VictoriaMetrics Single instances.

We set up a load balancer in front of the instances for reading the metrics, mainly used by Grafana. Occasionally, one of the VMSingle instances goes down — then the traffic is sent to the other one. When the instance is unavailable, we don’t lose data — agents buffer it, and while we may skip a couple of recording rules evaluations, VictoriaMetrics provides a neat way to backfill rules using vmreplay.

The only time the traffic goes across AZs now is when an agent is not hosted in the same zone as the target VictoriaMetrics Single instance. This is something that can not be worked around, as long as we want two agents to write the data (which is then deduplicated smartly by VictoriaMetrics).

The final architecture and other notable mentions

Finally, our architecture looked like below:

VictoriaMetrics-based system architecture

(Yes, the diagram looks a bit more convoluted than the diagram for the previous system. This is the price you pay for having a more-performant and cost-effective system with a better user experience 🙃)

There are also other use cases, which I haven’t touched on above — the long-term storage, and using VictoriaMetrics Operator to scrape non-Kubernetes and improve system configuration capabilities. I want to expand a bit on these and one extra special thing below.

Long-term storage

We also wanted to migrate our long-term storage installation of Prometheus. When exploring VictoriaMetrics, using an enterprise license to have different retention configurations for series was tempting, but we checked and it wasn’t the most cost-effective way to do it.

We also had a brief episode of sending these metrics to Grafana Cloud, where we have 13 months of retention. That cost us pennies, but at the time of adding it, we had two Grafana installations — self-hosted, and Cloud instance.

Having both short-term and long-term metrics in one Grafana would require us to add the Grafana Cloud Prometheus data source in our self-hosted instance. That’s nothing, but we found something better — we just set up yet another VMSingle instance with a different retention setting. We not only pay even less but have 100% of metrics in our infrastructure.

Michael Blann/DigitalVision at Getty Images.

VictoriaMetrics Operator

Our scraping and rules configuration for the previous system was overly complicated, with a baggage of tech-debt — neither we nor our users understood how to configure the system, sometimes. We wanted to change that.

We chose to install and configure VictoriaMetrics using the Kubernetes Operator. All of the components are managed by the Operator, as well as the configuration of the system. That allowed us to distribute the configuration concerns to our users — our product teams can now configure alerting for their services from their repositories. If you want to know how we pulled that off, let me know — that would definitely be material for another post.

Scraping non-Kubernetes resources with VictoriaMetrics Operator

When we were setting up the system in production, VictoriaMetrics Operator was still in its early days. There was no support for Service Discovery of non-Kubernetes targets (now there is one), and there was no way to install VMAgent (Operator-managed Custom Resource) that wouldn’t be injected with the same configuration as the other VMAgents in the cluster (at least not an easy, maintainable way).

To overcome these and still collect metrics from our other workloads, we chose to install an additional VictoriaMetrics Agent using the helm chart and configure it statically. This works for us because the targets don’t change that much and are mostly infrastructure-related, so the people configuring the scraping are more familiar with Prometheus/VictoriaMetrics than, say, a Python-focused Software Engineer.

Single pane of glass in Grafana Cloud with self-hosted metrics

Lastly, the very recent change that is worth mentioning — consolidating our Grafana instances. We now have only one instance of Grafana, thanks to a smart solution offered by Grafana Labs — Grafana Private Data Connect. We install the agent next to our VictoriaMetrics, which sets up a SOCKS5 tunnel between our and Grafana Labs’ infrastructure. That allowed us to add a self-hosted VictoriaMetrics as a data source in Grafana Cloud. What’s more — it’s free (except for the network traffic)! Neat! Well done, Grafana Labs! 💪

Note: We are a happy customer of Grafana Labs and their Cloud offering, as you may know from How Prezi replaced a homegrown Log Management System at Medium or Grafana’s Big Tent Podcast S2E2, where Alex first explained how we landed on Grafana Loki for our Log Management, and then explained how we use Grafana IRM for our Incident Management. Check these out!

What have we gained from migrating our system?

The benefits can be summarized as follows:

Cost Efficiency: Saved ~30% on system costs.
Performance: Query speeds improved significantly, with heavy queries completing in 3–7 seconds (vs. 30+ seconds).
User Experience: Streamlined metrics access and configuration via Kubernetes-native tools.
Scalability: The system is now future-proof for growing workloads.

Lastly, working on the migration allowed us to learn a ton, and work on something interesting and challenging.

Migrating from Prometheus to VictoriaMetrics transformed our monitoring system, offering cost savings, performance gains, and an improved developer experience. If you’re considering a similar move, we strongly recommend evaluating VictoriaMetrics for its simplicity and efficiency.

How using Availability Zones can eat up your budget — our journey from Prometheus to… was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How To Turn Red Energy Into Strategy And Migrate All Your Tests While You’re At It

Attila Vágó — Tue, 26 Nov 2024 04:13:06 GMT

An in-depth look at migrating over 140 Ruby-based Cucumber tests to a Java-based test automation framework…

One of the more major challenges a software engineering organisation tends to face at one point or another in their lifetime is technical debt that simply cannot be “paid back”. Even with the best of intentions, it does happen, and it can happen for a myriad of reasons, one of them being a stack change over time or a certain language or framework’s fall from grace over the space of a decade or two. Add to that some inevitable brain drain, and you have yourself a migration trifecta.

Over the last few years, an ongoing conversation between engineering teams was our hefty suite of Cucumber regression (E2E) tests written in Ruby. As the years have gone by, Ruby has slowly become the abandoned child of our stack. There was a lot of appetite for it initially, and fairly widespread skillset in the teams. The language was popular, Cucumber was popular, so writing tests in Ruby was also popular. Until it wasn’t. By 2021, whenever our Cucumber tests came up in conversation, you could feel the dread setting in. Everyone wanted to get rid of them, but nobody had the time, will, or energy to do it. After all, we were talking about roughly 200 tests.

By the end of 2023 we had virtually no Ruby skills left in the company. Be that infrastructure or development side.

Because it’s important to remember that regression tests don’t just run in a vacuum or on local machines. Writing them, updating them, is only half the equation. The other half is an entire infrastructure that enables those tests to run as part of your CI pipelines. At this point, it wasn’t just developers who wanted to — and I quote word for word — “kill it with fire”, our developer experience team (DX), who were tasked with maintaining the Ruby infrastructure were also getting exhausted by its costly and unsustainable maintenance, nevermind the risk of ending up in a situation where some dependencies would simply not be supported at all anymore, blocking the pipelines and thus critical releases to production of our products. I mean, just look at these gems, and I say that both literally and figuratively:

ruby 2.5: release: release date: 2017-12-25, EOL: 2021-04-05 (latest version: 3.3.6)
google chrome 75:  release: 2019-06-04 (latest version: 131)
bundler gem v1.17.3: release: 2018-12-27 (latest version: 2.5.23)
cucumber 3.1: release: 2017-11-28

As one of my DX team-mates aptly put, it was a time-bomb ready to blow at any time. The last time I heard that, I had to migrate an entire frontend from Angular 1 to React and do so while also moving a monolith to microfrontends.

But I’ll be honest, I also tend to be intrigued by challenges that keep not getting solved for a long time. Perhaps it’s a form of self-validation or just “red energy” as one of my therapist friends calls it.

If you ever used anger to fuel positive change, you used red energy.

By spring of 2024 it was decided. I am going to make it my personal goal for the year to once and for all migrate all the Ruby Cucumber tests to our Java-based E2E framework. I was hell-bent on doing whatever was necessary to get it done. Unbeknownst to me, Turu, a colleague of mine from the QA team had a very similar energy fueling a very similar goal. I know that 9/10 times the word “synergy” is used completely unnecessarily in conversations, and we’re all tired of hearing it, but this time the synergy was real. I was going to need the QA team’s support to some extent anyway, but seeing our goals intersect — love the boardroom lingo, aye? 🙂 — was a massive relief as it meant we were going to be able to share the load somewhat more evenly and accomplish — now our collective goal — faster. Believe it or not, sometimes throwing more people at the problem does help. As much as I love Fred Brooks’ timeless software engineering classic, it doesn’t always apply.

A few words on strategy

In short? Let’s call it the “80 days around the world” strategy. I could say we time boxed it, but that sounds boring, and tying our success somehow to Jules Verne sounds more fun. Regardless of what you call it, that aspect — especially in hindsight, and hindsight is always 20–20 — was crucial to getting this migration done.

I have learnt this doing a lot of proof-of-concept projects and hackathons. Creating an unmovable constraint — designers know this first-hand — inspires people. Creative ideas surface, people suddenly become more dynamic, adaptable, and start focusing on what truly matters — the outcome by a certain date. In this case, we really did give ourselves around 80 days with a singular goal: migrate everything.

Migrate everything in 80 days. How? Doesn’t matter. Get creative. Stay pragmatic. Get. It. Done.

Anyone who works in software development knows that prioritisation is a tricky business. A lot hinges on it. In this case, everything did. I ran all the Cucumber tests locally, and quickly realised we will have to be smart about what we migrate, when and why, so to make sure we stayed efficient:

I reached out to teams to find out if they had any redundant or deprecated tests. Some did, so I marked them for deletion.
I looked at the currently passing tests, and created the first batch to migrate. These got priority because all of these tests were running on live software, used by millions of customers. If, for whatever reason, we would suddenly end up running out of time, we’d at least have the most important tests migrated.
Then I created a second batch, while my colleagues from QA already began giving a helping hand in migrating them to our own test automation framework (TAF). This second batch was all the flaky tests, the ones failing for whatever reason or the disabled ones.
Finally, there was a last set of tests that covered some of our A/B tests. Initially, I almost made the mistake of starting with these, but then I realised by the time we’re done with the migration, most of these A/B tests will have already been concluded. That turned out to be true, and out of 20 or so, we only had to write tests for 3.

Once prioritisation was ready, the QA team (partially) and myself (full-time) got working on the implementation part. Tests after test, one by one, day after day, we could see the progress. We used a traffic-light system. Tests that we migrated, we marked with green 🟢, tests we were working on we marked with amber 🟠, and tests we found did not need migrating, we marked with red ❌. At all times, everyone involved knew who was working on what test. I decided to waste as little time with Jira tickets as possible, so we did most of the tracking in a Confluence doc.

Were we ruthless with our time-saving measures? Perhaps. But did we deliver the work on time? You bet!

Once all the tests were migrated, QA did a final review to make sure we tagged everything correctly, important test cases weren’t missed, and as an output, we created a log table that showed what Cucumber test ended up in what TAF test. Literally within days of migrating, we already had engineers making use of this log as they now had to find the old Cucumber test cases in their new home.

A diagram of the entire process created by the author in Freeform.

The final step in the strategy was setting up the CI appropriately. We wanted to make sure these tests were parallelised, but in doing so, we had to keep infrastructure cost in mind. Our Ruby tests, while a pain in the neck in every other way, they used a fairly low amount of resources, while the Java tests were a tad more resource-hungry, but DX figured out a good resource to test ratio to keep costs in check. With that in place, I had the honour of pressing the archive button on the repository and announcing to the entire company that we have finally killed all our Cucumber tests.

What ultimately enabled a successful migration

Looking back, trying to run a retrospective in my head of what went well, and why we finally managed to pull this migration off, there are a few things that come to mind, and some of these I have come to consider key to any successful project going forward.

We had a common goal. It cannot be understated just how important it is for everyone to row in the same direction. It empowers those doing the work to focus on it and do it well. So, the support of both my team and the QA team was crucial. Turu, our senior automated QA specialist had this migration as a personal 3rd quarter goal just as much as I did, so we were both heavily vested in getting the work done successfully.

Zero wasted time. Apart from a few initial meetings with QA, my team and I had around what we wanted to achieve and some historical context, the only meetings we had were a weekly 1-hour sync between Turu and myself. That’s roughly a day’s worth of meetings over a 10-week project. That’s not to say that meetings are bad, but every so often they cost the project, and we couldn’t afford that.

Keeping the goal in mind and the goal was clear: migrate all the tests as effectively as possible within the time we had. At times, that meant merging more test scenarios into one, or moving a test into another existing test as a scenario rather than a standalone test.

For each test, we did whatever made most sense instead of sticking to a 1:1 carbon-copy approach.

Translated to tangible business outcomes

But that’s the engineering (including QA) success story and as I mentioned in “How to Sell Engineering Needs To Product Managers” we owe ourselves and the business as engineers to translate engineering needs to business needs. I’d be the first to shy away from work that makes no business sense. While I’m no CFO, nor do I intend to ever become one, any effort that doesn’t make any business sense doesn’t sit well with me. That said, no project will ever be done “because it sits well with Attila”, so let me translate this particular engineering need to a business need.

When you have tests written in a language that nobody knows or cares to learn, those tests will be either poorly written or not written at all. This increases the chance of customer-blocking bugs that could go unnoticed until customer support is alerted, at which point it’s already too late and costly. So, a more robust product results in less customer support calls, aka money saved.

The other downside of a severely outdated test infrastructure is maintenance. Ideally, a software company wants to spend as little money as possible on maintenance. Features or A/B tests are more interesting, and they make more money. Maintenance that costs 10 times more than it should, is a waste of finances, brings down morale and might even be the cause of being unable to hire new engineers. There’s only so much money in an engineering pot, and we much preferred spending it on new tools or perhaps even additional headcount than maintaining a severely aged infrastructure.

Reducing complexity increases velocity. It really does come down to that.

As our DX team repeatedly highlighted, we were sitting on a time-bomb. Waking up every day to the very real possibility that one of our Ruby-Cucumber dependencies gets nixed because of its age is not a great place to be in when the core functionality of your product — such as signup, payments, and analytics — depends on it. Such a situation would have caused severe disruption for Product, wasted A/B testing runtime, increased manual QA and customer support costs for weeks if not months, potential loss of customers and revenue. This is unacceptable, especially when you are on a growth trajectory.

Finally, this migration was also a massive enabler. Within weeks of completion, having all of our tests in one place, we were already able to identify areas where we can make our tests more efficient, spend less time in the CI, and be more confident in what is being tested — aka have a real and meaningful understanding of our coverage. This can only mean one thing: better velocity in 2025 and beyond, and if there is one thing that Product Managers love hearing, is higher throughput. 😉

Closing thoughts on migrations, AI, and machine learning

As QA and I were wrapping the migration up, I couldn’t help but reach certain tangential conclusions that, I feel, will be food for thought for many of us software engineers and quality engineers in the coming year(s).

While completing a migration like this is an exciting opportunity for some of us — myself included — it’s not something most engineers would volunteer for, and for good reason. Migrations can be a can of worms, you’re touching a lot of legacy code you’ve never seen before and have no historical context on. You’re likely going in a little blind.

Then there’s also the monotonous aspect of the job. Especially when it comes to writing E2E tests, once you have everything in the framework available to you, writing the tests themselves can feel like more of the same, which brings me to my next point and an interesting realisation.

At one point, by pure luck, I downloaded the latest version of IntelliJ that features Full Line code completion. Within minutes, I started seeing the IDE suggesting my next line of code, be that a new page object or an assertion, and what do you know? It was often right! Often enough, that I saved 2–3 days’ worth of time over the course of the migration. This was machine learning in action, under human supervision, which made me think…

If there is one job that I’d like generative AI to do in the future, it’s maintenance and migrations.

It would have been great to feed a model our Cucumber and TAF tests, let it figure out what was missing, migrate those tests, run them and even deploy them with minimal human supervision. Now that’s something I could really get behind, and who knows, with another healthy dose of red energy it might soon become reality. 😉

Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. Author. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! Read my Hello story here! Subscribe for more stories about LEGO, tech, coding and accessibility! For my less regular readers, I also write about random bits and writing.

How To Turn Red Energy Into Strategy And Migrate All Your Tests While You’re At It was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Rare Insight Into The Daily Challenges Of An Experiments Team

Attila Vágó — Tue, 09 Jul 2024 13:21:02 GMT

If you thought feature development was tough, try developing A/B tests all day, every day… 😉

Photo by Jason Goodman on Unsplash

I think I’d need four hands to count all the types of projects I have touched in over a decade. From small tech start-up to mid-size web agency and from mid-size established tech company to large tech corporations, I’ve seen them all and contributed to codebases more than I can or care to count. And yet, in Prezi — two years after joining — I found myself in an entirely new context: an experiments team, and my mind is blown. Every. Single. Day.

What does an experiments team do?

If you’re thinking greenfield projects, I’m going to stop you right there. This is the team that has little to no chance to such luxuries. It’s quite the opposite. In Prezi, the experiments team’s main purpose is to think of A/B tests, plan them, implement them, watch them perform and then, based on the outcome, release one of the variants, or scrap the entire test. That’s the gist of it, anyway. Why would we do that? One word: success. The success of Prezi as a whole and implicitly our customers’.

Our experiments team is actually called “Growth & Monetisation” or GM — though that always makes me think of General Motors, which we have nothing to do with. As far as I know, the only cars we ever built in Prezi were all made of LEGO. 😄 We’re in the business of inspiring, unforgettable, impactful presentations and our team makes sure to drive that message home to more happy customers than ever. Naturally, you can have the best product out there, if the path to the product isn’t frictionless, there is a high churn. So that’s what we do: we try to understand based on data generated by A/B tests where potential customers drop off, why, and how we might be able to change that for the better.

I very deliberately used the word might there. Predicting success in software development is about as accurate in my experience as predicting the weather in Ireland. While there is not much we can do about Irish weather, we can test hypotheses in a software product relatively easily. You see what works, and armed with that knowledge, you make further decisions. But, you’ll also quickly realise success isn’t a given in an A/B test. In fact, the industry standard — which we’re also tracking — is 30%. That’s 7/10 A/B tests failing or being inconclusive.

Running an A/B test implies a chance of failure. You have to accept that to ultimately succeed.

But, with a healthy dose of pragmatism and informed optimism, you might also see the failed tests as a great learning opportunity. If nothing else, these will either stop you from investing in the wrong features, prevent you from entering a technical rabbit-hole or even inform and shape future A/B tests and get them that much closer to a successful outcome. And when you think like that, you realise that no time spent on A/B testing is lost time, as one way or another, it feeds directly into your product strategy.

Allow me to inspire you with a few examples. The first two examples will be successful experiments that ended up either increasing revenue or the number of registered users. The latter two will focus on failed experiments we learned a lot from.

A privacy control call-to-action. A Prezi created by a free account is always public, and we made that abundantly clear as soon as the user entered the editor, before even getting a chance to use the product and get excited by the prospect of creating unique, engaging, multidimensional presentations or generate one with Prezi AI. Sure, the goal of the immediate — in your face — privacy notification was well-intentioned. We wanted users to be both aware of their presentations being public, and give them the chance to upgrade. This experience, however, was one of high friction.

Our theory was that we might be able to improve on this and not lose subscriptions in the process, maybe even win some more, so instead of the notification, we just added a visible call-to-action button signalling the document was public. On click, the user — as before — had the option to just acknowledge and keep it public, or upgrade to a paid membership and make the Prezi private. And guess what? Not only did we not lose subscribers with the new approach, but even gained some.

Confusing SSL Callout. In our paywalls we wanted to give our customers peace of mind by calling out that transactions are SSL encrypted and 100% safe. One would think this is a great example of caring about your users. Except just like parenting can go very wrong when you become a helicopter parent, caring for users can also go sideways. In this particular instance, instead of gaining subscribers, we lost a considerable number of them because:

It was confusing to see that message on a free trial start page.
It was placed at the beginning of the form, rather than the end, below the payment button.
There is a chance it’s quite a redundant message these days when it’s assumed and expected that all transactions are safe. Heck, most browsers won’t even load unsecured sites anymore, so you expect to see something like this more on scam sites than legit products.

One small change in the right place. A fantastic example of sometimes just how small but incredibly effective an A/B test can be, is adding a single line of text to our product selector. We expected that mentioning the new AI capability for creating presentations would perhaps get us a small increase in bookings. Turns out, we were just thinking small, and the real increase was significantly higher. Granted, we couldn’t have added that line without the teams. Granted, we couldn’t have added that line without the teams having built all the AI features, but it shows how important it can be to bring that to users’ attention where it makes the most impact in the flow.

One big change in the wrong place. How many times have you seen complete redesigns being done on websites and apps in hopes to recapture users, only to find it had no effect? In fact, it might have even made things worse. We did the same with our business users but as an A/B test — why commit if you don’t have to, right? We expected a modest increase in bookings, and how wrong we were. Turns out, our business users got put off by the redesign, the features we thought were interesting for them, made them pause enough that we ended up with an unexpected negative impact on bookings! 😱 The good news is, we quickly learned how not to do a redesign and that we might want to highlight more relevant features to them in future iterations.

A/B tests are the financially responsible way of developing software. Agile development on steroids if you will.

So, A/B tests to the rescue, right? But in the history of software development there hasn’t yet been a solution that didn’t bring its own set of challenges and that’s what I really want to focus on, so anyone wanting to truly invest in experiments and improve their product, does so with eyes wide open. The rewards may be undeniable, but the challenges aren’t negligible either.

And what does that mean for engineers?

On the web, opinions are split on whether having specialised skills as an engineer is better than being T-shaped. As a staff engineer, I have come to the conclusion that just like having the right tools for the right job, having the right engineers within the team is also crucial to success. On our team’s page, we have a short but sweet table of what the ideal engineer looks like in terms of skills to feel comfortable. In contrast, a native apps team’s table would probably look very different. See? Right people for the right job.

Our team mindset table — Hire the right people, with the right skills and the right mindset — screenshot by author

I won’t however just leave you with a table up for interpretation, as I think all of those traits are worth a paragraph or two of clarification.

Impact driven as opposed to being creative or technology driven. In over a decade, I have met more passionate engineers about technologies, coding paradigms and abstractions than those who just want to get stuff out there for the sake of learning and iterating. To be perfectly frank, you need both in an engineering organisation if you don’t want your product to become unusable and unmaintainable. Heck, even within our team, some of us care more about the architecture, software integrity and efficiency of the product than others, but we all share the conviction that whatever we do, must have a tangible benefit to us as a team, Prezi and ultimately, the user. This is not the team where you randomly get to try a new frontend-library and rewrite one of your services in Rust — as exciting as that may sound.
It’s important not to confuse being versatile with jack-of-all-trades. That being said, the ideal engineer on our team while may not necessarily be a hard-core full-stack engineer, they won’t shy away from jumping into either sides of the codebase. In our case, that means a wide range of front and backend libraries and frameworks. It sounds intimidating perhaps, but in reality it’s a lot less about the expectation of knowing everything, but rather the openness to discover it all over time.
Being an efficient engineer deserves an article — if not a book — of its own, but let me condense it into a couple of thoughts. Us, engineers, have the tendency to polish code, refactor to the point of giving the impression we’re not writing software but creating the Milo of Venus. In an experiments team, we’re more focused on creating meaningful stick figures. As long as we’re able to gauge from the experiment the data we need, the goal is achieved. The code doesn’t have to be optimised (unless it’s getting in the way of being able to run the test), and keeping implementation as simple as possible is a prime objective. As long as it’s testable and revertible, you have yourself a candidate for release.
Having a data driven attitude is key, and I think it drives a lot of the other traits. How often have we, engineers, developed useless features over weeks, months, maybe even years? It’s not uncommon. In an experiments team, however, you don’t have the luxury to do that. Unless there is data to support a code-change, a new feature, a variant of a feature, it simply won’t happen.
Being an avid learner goes hand-in-hand with being data-driven. The focus in an experiments team is on understanding what happened but more importantly why, as the answer will drive the next experiments and possibly a considerable part of the product strategy.
A competitive engineer, comfortable with bold ideas, doesn’t necessarily mean reckless. It also doesn’t mean a lot of “hacking stuff together”. It’s rather a fine-tuned skill of seeing through the technical challenges in such a way that they’re able to propose the shortest technically viable path to success, and that path doesn’t have to follow the status quo.

On a personal note, I would argue that many of the above skills are worth picking up over time for any engineer. As one moves from company to company, from team to team, being able to adapt to different mindsets can very positively impact one’s career.

If you find yourself having the opportunity to join an experiments team, go for it, learn from it, make the most of it. You’ll thank yourself later.

All fingers in all pies

Before joining the GM team in Prezi, I was lead engineer on Prezi Video for Zoom, and later, on the first two waves of Prezi AI. Both, especially in the case of the former, meant that development was mostly spent in a couple of repositories, in very distinct areas of the product. Prezi Video for Zoom was a web app of its own, and Prezi Present — where Prezi AI was released — is mostly a self-contained entity as well, unless you start veering into service territory, but we have dedicated teams for that. In contrast, the very first day I joined the GM team, I found myself checking out not one, not two, but a list of repositories and as time passed, a few more. I have eight running at the moment in my development environment, and that still doesn’t cover all the possible flows a user could take on the Prezi website. Add to that Prezi Present, which we still contribute to with experiments, and you have yourself a context in which certain complexities are unavoidable.

You may wonder, why unavoidable? Can’t other teams run their own growth and monetisation experiments in their respective areas of expertise and ownership? I have no doubt that in certain organisations that is possible. And even in Prezi, for instance, we were able to do that with Prezi Video for Zoom. Our Infogram team can also operate similarly, as it’s a distinct product. However, when it comes to the rest of what Prezi essentially is — the Prezi website, Prezi Present and Prezi Video — one has to approach it holistically, and we must be able to own the experiment end-to-end, which conveniently brings me to what an experiment lifecycle looks like.

Experiment lifecycle

A picture’s worth a 1000 words and because this article is vertiginously approaching 4000, I’ll rely on a diagram to tell most of the experiment lifecycle story.

Experiment lifecycle diagram created by author in Apple Freeform

Releasing an A/B test is — quite literally — only half the work and half the story, but let’s see briefly what these 13 steps in the experiment lifecycle are:

Ideation is a somewhat nebulous step, and it involves a lot of product manager/product owner (PM/PO) sorcery outside the scope of this article, but generally speaking, ideas will be based on market research, data, previous findings, user feedback, etc.
Ideas there may be many, but it’s important to keep focus on what moves the company goals forward in a viable context. Sometimes ideas can be really good, but other things need to happen before they become feasible.
Having a low-fidelity design — a rough sketch — of what the experiment and the user flow would look like can further validate the idea or uncover logical fallacies. At this point, you might already find engineers to be a great asset in the conversation.
Getting to the planning stage means this is now going ahead full-steam and gets into the upcoming sprint. In our case, we tend to work kanban style, so whoever is next willing and well-suited enough to pick the work up, gets to do so. Every so often you’ll find that the experiment is not just a story, but an entire epic, in which case several engineers might allocate their time to it being led by a project lead.
Development is as self-explanatory as it can be. It’s the coding stage, including writing automated unit, integration and regression tests, adding the feature switches and getting everything into a (or more) pull request for code review.
Our manual QA team member(s) ensure everything has been done to spec and execute some regression testing as well. Given the number of experiments we run, it’s a much-needed peace of mind to know at least one set of objective eyes checks everything.
Releasing deserves a section of its own, so keep reading. For now, let’s just say it involves setting the feature switch configuration up to the desired cohorts and enabling them. Once it’s released, a cleanup task is automatically generated for a later date (see step 12).
Spot checking ensures we’re on the right track with the experiment, nothing blew up, we’re not seeing any majorly negative results or collateral damage in signups or upgrades.
After a few weeks, the experiment is stopped, so no more new users are getting exposed to the test. At times, we might allow the users who have been getting the test variant to keep having access to it to further observe user behaviour. This usually lasts no more than another 2–3 weeks.
Evaluation is all about interpreting the data, understanding the learnings. This is the moment we may decide to release a variant (success) to all users or stick to the control (fail).
Rollout is essentially the outcome of the evaluation — all users get one variant going forward, which from that point on becomes the control.
Cleanup is another phase I deemed important enough to highlight in its own section, so do keep reading, but the short of it is, we ensure that all redundant code, tests, and feature switches are done away with. This triggers steps 4, 5 and 6, all culminating in the final step…
Everything is done. The variant is rolled out, the code is cleaned up, and we have either learned something (failed experiment) or achieved something (successful experiment).

That’s the gist of the experiment lifecycle, but as I mentioned, there are a couple of stages there that are really worth digging into more to truly understand some of the complexities and challenges a team like ours can face on a daily basis.

Dealing with feature switches

Some will call them the best human invention since fire, while others, a necessary evil. I, for one, think it’s a very useful tool, but like every tool, it can be overused or misused. In our case, it’s invaluable to have the option of setting up a new feature switch for every experiment and variant. The more challenging part is keeping track of them all.

For context, we have 9 engineers on the team, and generally speaking, we aim for just as many experiments per sprint. Some quick maths suggests 160 experiments per year, but let’s go with a more conservative 100 experiments instead. Just assuming you have two variants per experiment already means 300 feature switches. 100 of those control the bucketing of the variants. If not handled correctly, things can get quickly out of hand, so we have devised some ways to avoid that:

Adding a special prefix for feature switches that control the variants.
Using team-based feature switch prefixes.
By making sure each feature switch has a clear ownership marked — we use a unique team email address.
Each feature switch will have a link to the experiment note or the Jira ticket it refers to.

This varies from organisation to organisation, but in Prezi, it’s mostly the software engineers who add, configure and clean up feature switches. We opted for this approach as it keeps the control of software integrity in engineering’s hands. We don’t have to worry about product owners inadvertently breaking regression tests by turning switches on and off at the wrong time.

Releasing an experiment

While releasing an experiment will ultimately come down to just flipping a switch — a feature switch that is — there’s a lot more to it and how much exactly, can vary from experiment to experiment. Some are a lot more involved than others. As I am writing this, I am working on an A/B test that involves three frontend bundles (think apps), and four different services. Even if you’re experienced and QA did a fantastic job making sure we haven’t broken anything, there are still a myriad of things that can fall through the cracks.

To make sure releases go as smoothly as possible, we adopted an already standard practice from aviation and medicine — a checklist.

Surgeons use Surgical Safety Checklists, and pilots rely on Pre-flight Checklists to ensure the best outcomes. We call it a release document, but it’s really a checklist as clearly stated in the head of each document:

This document is meant to be used as a checklist for the person who’s driving the release to be able to do it in a calm, collected, professional way. Also meant to act as a document for others, so when troubleshooting is needed, all the information about what was happening during a release is recorded. — Prezi internal release document

All such documents are signed off by at least one— but ideally two — senior or lead engineers on the team.

To some, this might seem excessive, and at times it really is, but in weighing the costs and benefits, as a team we concluded this approach gives us enough value and confidence to stick to it. Just to illustrate some of the items on the checklist, here’s what we look for:

Have the relevant senior/lead engineers signed off on the plan?
What components are meant to be deployed and have they deployed successfully?
Have all relevant teams been notified about our intent to release the experiment?
What’s the feature switch configuration?
Is the testing scenario working on production as expected?
Is the A/B test distribution as expected on OpenSearch?
Any unexpected spikes in Grafana?
Are there any new relevant Sentry error logs?
What action(s) to take in case of needing to revert?
If all of the above OK, notify internal stakeholders of successful release.

It’s a cross your “T”s and dot your “I”s kind of exercise, but out of it we get a log we can reference later and the assurance that anything that could have been prevented, has been prevented because, you know… Murphy’s Law. 😉

Having released, however, doesn’t mean we’re done. Far from it. There’s cleanup, and it’s such an important part of what our team does that I felt it deserved its own section, so without further ado…

Cleaning up

I hate doing the dishes, so by week’s end there’s a literal pile of them waiting to be washed. Now, remember those 300 feature switches? That’s precisely the pile we desperately want to avoid. Because feature switches as useful as they are, they quickly pollute the code to a point it becomes unmaintainable, which would result in us losing more and more velocity over time. As a team, you can easily grind to a screeching halt if code is not maintained, and as an experiments team, we’re particularly prone to having this happen if we’re not vigilant.

One way we’re working on preventing such a situation is by automatically creating cleanup tickets for each experiment. Jira isn’t so bad after all, aye? 😄 You see, once an experiment goes live, it will stay live for at least a couple of weeks. Gathering useful enough data to make pragmatic product decisions, doesn’t happen instantly, so usually a few weeks after the A/B test release a decision gets made. Either we stick to what we had before — aka we keep the control variants — or we keep one of the other variants. Often it’s just one, but there are times when an A/B test has a total of as many as four variants. Let me pseudocode an example:

if(isActive('amazing-feature-variant-a')){
   ...
} else if(isActive('amazing-feature-variant-b')){
   ...
} else if(isActive('amazing-feature-variant-c')){
   ...
} else {
   ...
}

Regardless of which one we keep, three of those have to go. You can imagine, of course, that often times an A/B test is far more involved than just showing a component or not, so cleanup can become quite an undertaking, as you want to make sure you understand the variants that have been added, the relationship with the rest of the codebase and the overall user flows, so cleaning up doesn’t result in collateral damage. This typically means editing tests as well.

You might wonder, in case of a lost A/B test where we end up sticking to control — to what we had before — can’t we just revert to the original PR? The answer is maybe, perhaps partially or not at all for the following reasons:

You might be able to revert if the initial change was very clean, and other changes to those files haven’t been done since. In a high-traffic codebase, that’s quite unlikely to happen, though.
You might only be able to do a partial revert if some of the changes happened in a low-traffic codebase, while others in higher-traffic codebases. The A/B test I am working on right now touches several repositories. I could imagine one or two of those repositories seeing light enough traffic that I could just revert, but the rest would require a more involved approach.
If you’re only dealing with high-traffic codebases, you simply don’t have this option, but you will also find cases where you added code for one of the variants that’s actually going to be useful for future work. Maybe you wrote a nice utility function, or refactored some code as part of the A/B test to make your life easier. You surely don’t want to revert that.

When all is clean and done

I won’t gaslight you into thinking we don’t deal with technical debt, awkward tech stacks, or breaking pipelines like every other team and engineering organisation out there. We do, and some of our challenges aren’t even new to many developers out there. It’s more like a unique flavour of what other teams deal with daily, and it’s unique enough that we found ourselves having to fine-tune how we do things, improve our processes, and continuously refine and shape ourselves as engineers into individuals reflecting the previously illustrated skills (mindset) table.

This is what has worked for us. This is what gets things done. For now. Just like we experiment with features, we experiment with ourselves as individuals and as a team. Sometimes that means we succeed, other times it means we learn and move on, or we learn to move on. It’s a journey, and it requires stamina, but ultimately, it’s well-worth the effort. So, yes, A/B tests for the win! 🎉

A Rare Insight Into The Daily Challenges Of An Experiments Team was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Prezi replaced a homegrown Log Management System with Grafana Loki

Alex — Thu, 08 Feb 2024 15:26:37 GMT

Prezi has quite a sophisticated engineering culture where solutions are built that do the job. Some solutions that have been built in the past stood out and aged well. In other areas, some solutions have lost traction compared to industry standards.

In the second half of 2023, we modernized one of those areas that was not market-standard anymore: the logfile management system of Prezi. This is our testimonial.

Photo by Álvaro Serrano on Unsplash

How it was before

We traced the beginning of the existing solution back to 2014. So it is safe to say that it was a stable solution.

The following depicts the solution. Every workload Prezi ran was instrumented with a special sidecar that took care of handling all log messages. That sidecar was built on top of two open-source solutions: scribe (https://github.com/facebookarchive/scribe), a tool built by Facebook, archived on Github in 2022.
Scribe took care of receiving log events, aggregating them, and sending them downstream.

The second component, sTunnel (https://www.stunnel.org/), took care of encrypted communication from the workload systems to the central system.

Prezi collected log events from all environments in one central place and made them accessible to engineers.

legacy log management system

Yes, the picture is telling the truth: the consumption of collected log events happened a good part of 2023 over SSH and not over any UI.

That alone was a fact to reimplement the whole solution and develop it with current market-best practices in mind. Our goal was to make the user experience more accessible and the query results better shareable.

Logshipping

With that in mind, we started the project’s first iteration. Our first take was to provide a central system that could aggregate and display log events in a more user-friendly way. Also, we wanted to get rid of the sidecar to ease operational load: while a sidecar per se is not a bad thing and a very battle-proven design pattern, it comes with certain costs when running thousands of pods.

The sidecar solution was born in times when Prezis workload ran on Elastic Beanstalk which means just an additional container on a probably oversized EC2 instance.

With the shift to Kubernetes as the workload engine the oversized EC2 instance vanished but the sidecar remained. Also, Kubernetes offers a very standardized way to consume logs from containers: stdout and stderr of the container are written to file on the Kubernetes worker hosts by the container runtime. And files can be easily consumed.

We did exactly that and used one of the established tools in that domain — filebeat — which is capable of reading the mentioned files and enriching the resulting events with metadata from Kubernetes — e.g. pod and container name and namespace.

Details on the log shipping process

This was the first optimization. The second was how events will be sent to downstream systems.

Operating in cloud environments requires a fast shipping of events away from nodes as these nodes can vanish at any time.

A common design pattern for this is to use a message queue as the first persistent layer. This can protect downstream systems in case of event bursts. It also decouples the individual parts from each other which can be helpful for maintenance or even replacements of tools.

Most of the time, the message queue used for that is an Apache Kafka installation that is capable of storing events at scale. As we already used a Kafka setup to store business events from multiple sources, we went that route without digging further into alternative persistent layers.

Sending events to a message queue

Once the events are in the queue, they can be parsed and ingested into a central system.

Parsing and Storing

In our first take on this, we planned to set up the central log file management system inside our cloud environment. When doing that, there are 2 major options to go: do something with Elasticsearch or use Grafana Loki as the backend.

The very first take

We’ve started with AWS OpenSearch service as a backend and Logstash to feed events from Kafka into our OpenSearch cluster.

As we run most of our software on Kubernetes, we also set up Logstash on Kubernetes and soon discovered all the joy of running a JVM inside containers. We suffered a lot of out-of-memory kills of that component.

Storing and indexing a massive amount of data into OpenSearch leads to massive indexes that soon have not been manageable anymore. This was caused by the vast amount of non-standardized fields in the application logging. A lot of heterogeneity in the fields and the contents leads to a lot of parsing errors. The most prominent example is the time and date format. Some applications have been logging unix timestamps, whereas others are using a string representation.

We discovered that if we don’t control the sources, a solution based on OpenSearch will not service us well. Going to control the source by evangelizing a common log scheme throughout all applications would have been the only way to make this work.

The overhaul

We started to look into an alternative to Logstash to get rid of the memory issues. We started to replace it with vector.dev which has a smaller footprint, a more flexible configuration, and it can also send to more possible backends. Logstash, without any modification, is tied closely to the OpenSearch ecosystem. But as spoiled above there is another major option to save log events: Grafana Loki.

With the replacement of Logstash, we got rid of the constant restart but not of the constant indexing errors.

Soon we started to look into Loki as an alternative. Also, we considered the hosting option as the running and maintenance of a log management system is not one of our core tasks. Running that system is more or less a commodity and takes away precious time that could be spent otherwise.

Focusing on our core tasks as the SRE team is also beneficial for customers of Prezi.

Optimizing the central systems

Looking at log management systems is in most cases also a buy or make (host) decision: Do one want to have the whole aggregation systems self-hosted or can this be offloaded to some 3rd party vendor?

Security and compliance concerns aside, this mostly boils down to the question of “How much can we spend?”.

With the security clearing to send logs to a 3rd party vendor and the budget to do so, we started to look at the hosted version of Loki. It turned out to be within our cost range and it can service us well: They had no issues with our ingestion rate. The way Loki stores log events as streams was perfect as it moved the problem Opensearch had with the variety of field contents from indexing time to later. With Loki, those differences surface at query time and can be tackled by predefined dashboards. This way we don’t lose any events by parsing errors.

The way to consume events that are stored in Loki is a very common user interface: Grafana, which is a well-known dashboarding solution and is already in use. With that, engineers can rely on existing tool knowledge.

Offloading logs to an external vendor also removes them from your direct control. To avoid any issues with any retention period, we also started to write logs additionally to S3 to archive them. That way, we have control over them and can use them in case we need them later.

Details on parsing and storing

With that last piece in place, we have been able to shut down the above-mentioned original log management system at the end of 2023.

The result

Looking at the completely new log management system, we went from a very homegrown solution to a modern stack:

We consume logs via a standard API of the container runtime
Sending the events to Kafka enables us to consume them decoupled from the creation time. Kafka also stores events for a certain period, so any downtime of downstream systems does not cause data loss.
Vector enables us to feed events into multiple sinks. Even though not outlined before, it enables us to make certain parsing and routing decisions when parsing events. But that is part of another story.
Loki enables us to consume event streams via the well-known Grafana UI and query a vast amount of data in real time.

The whole process

Lessons learned

The whole project took us most of a year until we shut down the old solution. We took this amount of time to verify all is well set up, all engineers are onboarded and familiar with the solution, and the solution is capable of handling all different peak situations.

Keeping the old system running has been a good decision. By that, we have been able to optimize the new system until it was able to handle the load and satisfy our needs
Starting to advertise a common logging scheme through a company is beneficial. That scheme makes collecting and analyzing events simpler. It gives a better user experience, too, because a timestamp is always in the same format for example.
Controlling log levels and the understanding of the various log levels are also crucial. What one engineer sees as debug another emits as info. Creating a common understanding is helpful.
Decoupling the different components from one another enables us to change them if we have other requirements or find better solutions. E.g. if we start to get unhappy with Vector, we can replace it without any hassle as the interface between the log source and Vector is Kafka.

How Prezi replaced a homegrown Log Management System with Grafana Loki was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Prezi Serves Customer Traffic

Alex — Tue, 09 Jan 2024 09:56:32 GMT

Prezi has a global audience that depends on the fast and reliable accessibility of its content. In this article, we look into the way Prezi serves content from a network perspective.

Photo by Z on Unsplash

See this article as a general overview of how content can be served on a global scale. This is not the only and probably not the ultimate solution but one way to do it.

The overall flow of this is depicted in the following image. Prezi runs on AWS and uses AWS services to offer customer-facing internet endpoints.

Network diagram

The DNS zones and records are managed in Route53. Customer traffic goes through AWS Global Accelerator to decrease latency before it is filtered by AWS Web Application Firewall (WAF). The traffic is then terminated at the Application Load Balancer (ALB) and forwarded into the cloud environment where most of the workload runs inside Elastic Kubernetes Services (EKS).

Some of the customer traffic is going to AWS Cloudfront which is used to deliver media assets that benefit from being cached closer to the customer.

The following article will go over these components, see what they do, and discuss the benefits those components offer.

Find the best path (AWS Route53 and Global Accelerator)

Having customers worldwide and offering services over the Internet poses multiple challenges. One of them is to reduce latency. Cloudflare defines latency as the “amount of time it takes for a data packet to go from one place to another. Lowering latency is an important part of building a good user experience.” (source https://www.cloudflare.com/en-gb/learning/performance/glossary/what-is-latency/)

That said, the challenge in having customers worldwide is the heterogeneity of the network normal people call “the internet”.

When we look at the lower network layer at the internet topology, we can see many different networks peered together.

The following image shows parts of the peering connections in Latin America that form the internet’s backbone. For a data packet, going from South America to Miami means traversing through multiple networks and every network adds a little bit to the complete travel time.

taken from https://global-internet-map-2022.telegeography.com/

Going back to the challenge of controlling latency for customers there are generally speaking 2 options:

Offering services close to the customer to avoid far network travels
Offer a fast path from the customer to the place where services are offered.

The best path for most of the world

Prezi uses the second option by offering a fast path to services via AWS GlobalAccelerator. This service enables customer traffic to be routed most of the time via the global AWS network instead of the public internet.

This routing reduces latency. In experiments from my local machine, optimized requests traveled 200ms faster than the not-optimized ones. The total time until I got an answer went down from 800ms to 600ms.
Loading the Prezi dashboard when logged in needs at the moment roughly 150 individual requests which all benefit from the decrease of 25% in latency.
Please keep in mind that the real percentage of acceleration depends on multiple factors like location and current routing situation.

Whenever a customer sends requests to prezi.com, those requests are routed to the closest AWS network endpoint and then transferred inside this global network.

And the best path for inhabitants of Virginia

As stated in the headline of the previous chapter, most Prezi customers go to Global Accelerator except those who reside in Virginia. Those customers are already close enough to the service endpoint and are routed directly to the following components.

Note: the network diagram above does not show this route to avoid being too complex.

Implementation

To achieve this, Prezi leverages geo-balanced DNS queries in Route53 so that different IP addresses are returned depending on the location.

The following screenshot for a practical example. The first lookup is executed from a local machine in Europe, and the second one with an activated VPN from Virginia.

The first DNS query returns the endpoints for the Global Accelerator, and the second query from Virgina returns the endpoints of some AWS load balancer (see the following chapter).

Terminal showing different DNS lookup results depending on location.

Alternatives

The alternative to this network-based approach is to move offered services closer to the customer. This can be achieved for example by deploying instances into selected cloud regions. To achieve this, the whole application stack needs to be deployed and some backend synchronization is needed — as part of the Prezi, suite enables collaboration in-between multiple users.

Serving from a single region reduces the complexity and streamlines deployment.

Protection (AWS WAF and Shield)

While the internet is a wonderful place to connect, collaborate, be creative, and a lot more, it is at the same time also a place that attracts bad actors. It is widespread that public and well-known endpoints are the target of distributed denial of service (DDoS) attacks. Prezi leverages the combination of AWS Web Application Firewall (WAF) and Shield to protect the downstream infrastructure from these threat vectors.

Every request that needs to reach Prezi infrastructure is evaluated through these components. Certain endpoints are protected via a specific rate limit to make sure they are not hammered.

For example, it does not make sense to send multiple requests for the login endpoint within a small amount of time. To protect sensible endpoints, the AWS WAF can respond with HTTP/429 (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429).
See the following screenshot of how a triggered rate limit looks in the browser console:

Chrome Developer tools show one HTTP response

On a bigger scale, the traffic flow is monitored by AWS Shield and blocked when a DDoS attack is detected. When Shield detects a DDoS from multiple traffic sources, those sources get blocked.

Alternatives

Offering services over the Internet without any protection is a bad idea. Any public-facing IP is attracting traffic and if a company has reached a certain scale it attracts bad actors. There are alternative solutions and vendors like Cloudflare or Akamai that can offer the same protection service. As we run our workload on AWS the natural choice is AWS WAF as the integration is easy.

Access (ALB, EKS, API Gateway)

Requests allowed to enter reach the AWS-managed load balancer fleet that keeps track of routing those requests into the VPC environment that hosts the actual workload. The load balancer uses our public TLS certificate to offload HTTPS connections from customers.

The application load balancer (ALB) is used for routing based on the HTTP Host header. This means that based on the domain used, ALB can forward traffic into the isolated workload environment.

Running inside the Kubernetes fleet is a self-written API gateway. The purpose of this component is to build more detailed routes based on request paths or other identifiers. Most of the backends are based on Python and Scala. Those pods run inside the Kubernetes offering of AWS: Elastic Kubernetes Service.
Traffic is routed into these pods either by a WSGI conform application server in Python land or directly by the JVM for Scala services.

As the mentioned API gateway runs also inside Kubernetes, it can forward traffic to the target backend services based on different routing guidelines within the cluster network. The API gateway offers the flexibility to do advanced routing to the microservices based on configuration by the developers.

When you think back about the scope of AWS WAF usage, there was no check for malicious content and requests. We use a different web application firewall to check for bad requests and protection against cross-site scripting, injections, and other things that might harm Prezi — or our customer.

Content delivery (CloudFront)

Prezi’s main purpose is to deliver amazing presentations that most of the time contain visuals like images and gifs. They can be served via a content delivery network (CDN) that can reproach content closer to the customer.

Loading resources from a CDN does decrease the time in which the user waits for the resources to appear.

Also on the cost focus, it is cheaper to serve content from CloudFront instead of serving it every time from the backend. This applies especially to assets like images that don’t change often.

Due to the deep integration into the ecosystem, in our setup, there is no other choice than CloudFront. Technically, it should also be doable with CloudFlare or any other CDN vendor.

Wrap up

The article above describes the architecture Prezi uses to serve content to a global audience.

There are multiple different ways to serve traffic — even if running on AWS.

How Prezi Serves Customer Traffic was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Value Of Experimentation In Software Development

Attila Vágó — Tue, 28 Nov 2023 22:04:43 GMT

Going beyond POCs, spikes, and hackathons…

Photo by Andre Hunter on Unsplash

For the most part, I think that software engineers, developers, programmers, coders — call ’em whatever you like — all tend to share an important trait — curiosity. Yes, even the jaded cynics with a decades-long career behind them. Curiosity may have killed the cat, but when it comes to those of us who write code for a living, it can actually breathe life back into one’s day-to-day, be that at the office, at home or stranded in an airport while traveling around the globe writing software.

Yet, as much as we all love solving problems and writing code, let’s admit that more often than not we feel like we’ve done it all before, the only variable is the context. Heck, many companies hire us exactly because we’ve done something before, we’ve done it well, and now they want us to do it for them. It’s called having experience and expertise and being appreciated for it. It’s no wonder then that the older a software engineer gets, the more you hear across the industry they’ve “been there, done that”, “it’s just another job”, “there’s nothing new under the Sun” alongside the sigh of having to “go back to the salt mines” after a well-deserved holiday or vacation.

What exactly happened?

Well, one could say, life happened. The day-to-day happened. Naturally, features need to be developed, and those features follow an endless list of other features. Frontend engineers are writing their 789th API call to some internal or 3rd-party service, the backend engineers are writing yet another… you guessed it… microservice, and so on. A sense of monotony can easily set in and even when you truly enjoy coding, sometimes you just can’t shake off the urge to experiment, to try something out that’s not on the Jira board.

Of course, one could argue there is always a good spike or POC to sink one’s teeth into, and should there be none, by now most software companies hold regular hackathons. I remember joining Prezi in January 2022, literally starting with a week-long hackathon. It was fun, I even participated, and I got to know half the company in a single fast-paced, inspiring week. But hackathons don’t happen every day and the rush of it can wear off quickly, hence the need for something else.

A simple, yet effective solution

Because you see, life doesn’t just happen in a vacuum, we do have some control over it. One effective solution is perhaps to move from a full-stack developer role to an SRE position, or from frontend to backend, and while those are both very effective for getting back the excitement and a sense of newness in one’s day-to-day, it’s not the simplest of solutions organisationally speaking.

So what is?

When we developed Prezi Video for Zoom, we quickly realised as a cross-functional team that beyond just our respective areas of expertise, we had a lot more to give. We found that many of our planning sessions or design syncs resulted in not just making decisions about what needs to get done next, but tangential ideas were coming from every direction, and some intriguing enough that while we couldn’t justify jumping on them as a team, we also couldn’t deny its potential value, and some of us felt really eager to at least create a demoable version of it. So how do you do that without it being on the product roadmap?

When you really believe in something, and you want to do it, you create time for it.

And we called this time “Gold Days”, a refreshed concept in Prezi, where a team sets aside some time — in our case two days — every two months to allow team members to experiment around the product. And when I say experiment, I mean anything goes:

Develop and demo an entirely new feature that you really believe in.
Modify or improve an existing feature the way you imagine it should work.
Try out previously unexplored 3rd-party services and see what integrations might work well in the future.
Tackle a technical challenge that was never truly solved before in the product.

For transparency, we tracked these experiments in a Confluence document that anyone could add ideas to. Not just our team. Everyone from the company. On the first day of Gold Days, anyone who wanted to participate from the team (it wasn't compulsory), would pick one of the ideas they felt strongly about, and would commence experimenting. The expectation was that by the end of the second day there would be an output in the form of a demo, a findings document or both.

Another important aspect of this setup was the understanding that no output was guaranteed to become part of the product. It was still product’s decision what’s worth spending more time on and what isn’t. But that was part of why Gold Days worked well for us. There was zero pressure to build something for production, but all the freedom to experiment within the product.

The benefits of experimentation

Before I proposed reviving this concept within the team, I realised that beyond the obvious answer of “because it’s fun” to the equally obvious question of “why?”, there had to be other reasons to give the team the bandwidth for experimentation. Here’s just some of them:

People care about their ideas, and the more we allow them to explore, the more we fulfil a natural need. Satisfying that need leads to people who feel more accomplished, even if the result doesn’t end up in the product. That idea is no more in the back of their minds, and there are no more frustrations about never having had the time to try it.
It teaches lessons. Ideas on the surface can be great, but just as well prove to be unfeasible when trying to implement them. In the case of a success, one gets to learn how to get quickly from an idea to something they can demo to the team. In the case of a failure, the entire team learns about UX or technical challenges that were never considered before.
It sharpens presentation and documentation writing skills, both essential for a software engineer. It’s one thing to build it, and it’s another to sell it and document it in a way that anyone can understand the benefits of the demo.
It fosters new ideas. Regardless of a demo being successful or not, seeing it rather than just having a few lines of text in a Confluence document or a fleeting comment on Slack, can start tangential conversations that can ultimately become new avenues for the product. Inspiration can come from anywhere, even a failed demo.
In the case of a successful demo, it might just become the feature that users love the most.

For my team, Gold Days was a great space to express ourselves as software engineers and create different conversations. It was a great opportunity to learn new tricks, try positively crazy stuff and break the mundane while inspiring ourselves and others.

Not necessity, but curiosity is the mother of invention. Experiment.

Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes and blogs. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! Read my Hello story here! Subscribe for more stories about LEGO, tech, coding and accessibility! For my less regular readers, I also write about random bits and writing.

The Value Of Experimentation In Software Development was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read This Before Building Your First Zoom App

Attila Vágó — Mon, 09 Oct 2023 14:41:09 GMT

Helping developers focus on delivering value, while architecting a better Zoom application.

Photo by Safar Safarov on Unsplash

I take it, if you’re reading this, you’ve made the decision or at the very least are toying with the idea of building a Zoom app. That’s great news for many Zoom users like myself who rely on the additional functionality and features these apps give our daily in and out of meeting experiences, but as a software engineer, you might want to understand first just what kind of architectural and coding decisions you need to take to build the best possible app.

Over the last year, we have built and released a Zoom app and added countless features, all of which taught us things we wish we knew upfront. In this article, I’ll share everything the team and I have bumped into along the way and, where appropriate, even provide code examples.

A quick primer on the Layers API

The best way to visualise what it really is, is to compare it to Photoshop. I know. It’s an unexpected comparison, but it makes a lot of sense. In Photoshop, everything is a layer. The same applies to Zoom’s Layers API. You have the background layer, you have the camera layer and finally at the top you get a web layer, and this final web layer is what makes the Layers API really cool and where the JS SDK runs.

And now onto the meaty part of this article. Prepare for some aha moments and interesting snippets of code. For content, we have used React with hooks, a bit of Context API here and there, no state management libraries (Redux, yuck), Radix components for most of the UI. For tests, we went with Cypress, but more on that in a section of its own.

External integrations

It’s nigh on impossible these days to build a web app and not rely on some sort of external integration, even if that’s something as simple as pulling a library from a CDN that isn’t on your domain.

Examples of whitelisted URLs in the Zoom app settings. Screenshot by author.

While in the Marketplace app settings panel under App Credentials > Domain Allow List you’ll find the option to add the required domains, you’ll quickly find that not every domain is treated the same way. Here are a few examples:

google.com—straight up not allowed because it’s too broad. This one is important as it can affect your ability to easily integrate Login with Google, use YouTube in the app, etc.
*.giphy.com—we found that the wildcard may or may not work, so we ended up adding all the 5 Giphy subdomains to the allow list like media0.giphy.com, media1.giphy.com, etc.
img.icons8.com—this one isn’t disallowed, but it requires extra information as to why the app needs it.

Unfortunately, you won’t always know what domains you will need to allow, so planning ahead may not be possible, but if it is, I would recommend adding all the required domains and subdomains to the Domain Allow List as soon as you can, as adding new ones also requires an app review by Zoom.

Multiple contexts, two browsers with limitations

Anyone who isn’t familiar with CEF, should start reading up on it right away. The camera view on all operating systems uses CEF (Chromium Embedded Framework), that has a number of otherwise potentially very useful APIs disabled, like navigator.mediaDevices and all of its methods. This is also true for the inMeeting and inClient contexts, where on Mac it’s running the native browser, Safari. This means that getting extra control over audio and video or screen sharing isn’t possible. You also cannot throw popups and enable permissions to access various APIs like you would in the browser. In the camera view, even a simple alert('Y U No work?') won’t work.

In terms of working with the various contexts, your app will want to have the context change in a state variable, so you can watch it in your useEffect. A mature app will end up looking something like this:

useEffect(async () => {
 // initiate zoom SDK, determine contexts, and do other stuff
}, [appContextChanged])

if (component === 'Loading') {
        // Only display spinner in sidebar and not camera webview
        return !navigator.userAgentData || navigator.userAgentData.platform === '' ? (
            
                
            
        ) : (

        );
    } else if (component === 'camera') {
        return ;
    } else if (component === 'sidepanel') {
        return (
                
        );
    } else if (component === 'sandbox') {
        return ;
    } else if (component === 'preMeeting') {
        return ;
    } else if (component === 'unsupported') {
        return ;
    }

Watching appContextChanged is important, as we found that users often open the app from the main client rather than the meeting, which means, our UI had to update to accommodate that change. Of course, if you happen not to have a different UI for the two contexts, then you don’t have to, but I’d argue that you’re not taking advantage of the full capabilities of the Zoom SDK, as many APIs are not available in the main client, which in itself is a consideration to keep in mind.

Inter-context communication

Speaking of context, the Zoom SDK offers a handy postMessage method and, paired with the onMessage event, you essentially have a communication channel between the camera and the side panel, aka between CEF and Safari on a Mac for instance. That’s handy for countless things. You can send data, state changes, etc.

It does, however, come with one important limitation: the payload cannot exceed 512 KB, and that’s not a lot. In our case, that meant that we couldn’t send blobs, Base64 or DataUri images across contexts. It also meant, we had to really be efficient with the data we were sending. In our app, at one point, we were sending all slide data to the camera. Imagine 100 slides’ data, especially when it includes rich text data in HTML strings. That quickly adds up, and we found that out once we added the import feature for PDFs and PPTs that for each page generates a slide on the fly. It quickly turned into a bottleneck that we had to work around.

Debugging

There is one unexpected positive use though for postMessage and onMessage. You can use these for debugging. Given the fact that the CEF view cannot be inspected in the browser as it runs in the camera, we had to get creative and send all error messages from the camera to the meeting view. Maybe not ideal, but it certainly works and saved us countless hours of debugging.

It can be something as simple as telling the side-panel view that the camera is ready:

zoomSdk.postMessage({
    message: 'camera ready',
});

Or handling unhandled rejections like so:

window.onunhandledrejection = event => {
            zoomSdk.postMessage({
                message: `UNHANDLED PROMISE REJECTION: ${event.reason}`,
       });
};

Local storage isn’t quite…

We all love local storage, don’t we? It’s very useful to keep track of certain app states, information, etc. As long as you don’t delete the browser or deliberately clear local storage, the data is reliably accessible. That’s not quite the case when building Zoom apps.

Something we found out the hard way — but in hindsight it makes all the sense in the world — is that, once the user logs out of Zoom or switches to another user, that local storage is cleared, or to be more accurate, you’re getting a new instance of the embedded browser, so all that data is gone. Logging back in with the previous user also doesn’t bring back the data.

What this all means for your app is that you’ll have no choice but to save some information into a database. That can be something as simple and easy to spin up like Firebase, but it is an extra lift, and you need to account for it. On that note, yes, Firebase is fully supported by Zoom apps.

App state

Speaking of state and data, while we didn’t use any state management libraries in our app, apart from React’s own context API, we did make another pragmatic decision around where to handle state. Given the fact that:

local storage isn’t really persistent
there are two separate browsers that can only communicate via postMessage
and we cannot really inspect the camera view

We decided that all the app state variables should stay in the side-panel view. This decision effectively rendered the camera view a passive component that only reacts to messages it receives from the inMeeting context (side-panel).

Sidebar size

The inMeeting context is basically the side-panel or sidebar view. By default, its width is practically a mobile screen viewport width. That’s great for many things, but every so often you might find wishing for more real-estate, which Zoom gives you through expandApp. In fact, you can even toggle programmatically between collapsed and expanded modes, like so:

await zoomSdk.expandApp({
     action: 'expand' | 'collapse',
})

There is but one twist to this, though. It’s either or. There is no in-between width or a set percentage / pixel width option. This means that your designs have to be clever enough to work around that limitation. We opted to stick to the collapse mode, as the expand mode would have been far too wide.

Mobile

Within the app settings under Features, you will also find a toggle for mobile. You will likely — but wrongly — assume that turning it on is all that’s required to have your app running on phones and tablets. Besides the additional information we had to provide for Apple devices, we also quickly found out that the Layers API is only partially supported on iOS, iPadOS, and Android. Given how many people use these devices nowadays, you need to account for this in your design, development, and app distribution strategy. We opted to stick to just desktop.

Turning on mobile client support. Screenshot by author.

Testing

I briefly alluded to testing, but it deserves a couple more paragraphs. Honestly, this will probably be the biggest bottleneck you’ll have to account for, and there are several reasons for that:

Your local browser sandbox is not a 1:1 replica of the app. It can’t be, as the browser can’t run Zoom’s APIs.
Your Zoom environment due to the two separate browsers and multiple running contexts has its own limitations, many of which I have already mentioned.
Testing in the client requires either each developer having a working local setup with nGrok, with multiple development copies of the app set up the exact same way, or a CI that generates builds on every push. The downside of the first approach is that it’s simply not scalable. Someone on the team will always have some tiny difference in their app settings in the Zoom Marketplace for things to fall over. The latter works, but it’s slow as you have to wait for every build to generate and in our case that took about 5 minutes / build, and when you do that 10–20 times a day, it adds up.
Finally, automated testing can’t be done well / or at all on the native instance, as the flow includes logging into an account, adding an app, running it, then removing the app. Just logging in requires verification via a code sent through email, so we found that certain tests had to stay manual and done every day to avoid surprises.

It’s safe to say that developers in our team spent about 10–20% of their time with testing complexities.

App reviews

While in our experience, app reviews by Zoom were always swift — up to 48 hours — you will want to strategise what changes you make and when in the app settings on the Zoom Marketplace. All changes require a manual review by Zoom, so perhaps try to add as many APIs, events and whitelisted URLs as you can upfront. Work closely with your product manager and UX designer, and add everything you need before the story even gets into your backlog to save time while in development.

So, which aspect of building a Zoom app surprised you the most? For us, building Prezi Video for Zoom was a very rewarding experience. We had to learn a lot, and do it quickly, while delivering features that half the time brought about challenges of their own. Had someone told us everything you’ve just read upfront, we would have probably achieved even more. But now, armed with all this knowledge, you can architect and design your new Zoom app idea with less unknown unknowns, focus on the fun stuff and make virtual meetings better for everyone.

Read This Before Building Your First Zoom App was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Shifting to the Right Place: My Transition Back from EM to Engineer

Zoltán Adamek — Thu, 13 Jul 2023 13:59:27 GMT

Photo by Vladislav Babienko on Unsplash

After a good ~2.5 years, I’m going back from being an engineering manager to being an engineer, and will continue in one of my teams as an engineer. This post is about why I reached this decision, and it has a few observations that could be interesting for anyone who’s considering becoming an engineering manager.

So, Zoli, what’s going on?

Not much, really. I told my boss that I would like to make this jump, and he made sure it happens well. I wanted to do this to reach a setup that will serve me better personally. Let me explain.

One shortcoming of mine is, I’m usually somewhat slow to understand how situations are affecting me emotionally, and it took a while to understand what the EM role really means for me on the personal level. (But lately, we had quite a lot of discussions with the Prezi EMs about roles and responsibilities, and that helped a lot.)

If we ever had the chance to talk, you might have noticed that I’m absolutely convinced that the Happiness Advantage is real, and you need to pay attention to it. It’s not really possible to give your best if you’re unhappy about too many things. This time again I had to walk the talk, and here we are… I kind of always got about myself that coding and solving (preferably real-life) problems through coding always gave me a lot of joy. Now I understand that taking that away from myself for too long is actually making me suffer. (BTW, I believe I’m a tooling guy — knowing that people out there use my code to achieve something makes me tick.)

In my head, I probably tried to downplay how serious the transition from an engineer to an engineering manager is going to be for me. In Prezi, we have this so-called tech lead manager role which is mostly reserved for engineers who are transitioning to management. While being a TLM, you manage a single, small team and you still code. Compared to that, engineering managers at Prezi have more reports (often two teams), and in general, they don’t really code much anymore. For a while, I was a TLM, but even later, circumstances allowed me to keep postponing letting go of the tech tasks. Sometimes, I felt that it was necessary that I keep doing some tech, but it’s very hard to isolate how much of that was really necessary and how much was my desire to keep coding. But now that I know more about how we envision the transitions between roles in Prezi Engineering, it’s very evident that being a TLM should be much more limited in time, while you can prepare for being an EM gradually. By the end of that, it’s better if you’re ready for letting go of technical tasks. It’s probably not healthy (not for the individual and nor for the team) to keep the TLM mindset for an extended period of time, but that’s exactly what I got myself into…

(It’s important to note that I’m far from being the first person who ran into trouble with the tech lead manager role, and it’s not hard to find some more negative opinions about it:

I think at least a part of the issues with the TLM role can be mitigated by enforcing that being a TLM is really a transition: putting an end date on it could certainly help. I’m yet to see how that works in practice, though.)

Anyway, deep inside, I always knew that I would want to keep coding. (More precisely, I knew that there is an urge that I need to keep coding and building stuff.) I was trying to convince myself that this is not going to cause problems as an EM by telling stories to myself like “I will be able to keep rolling by doing hobby projects and stuff”. Well, this was a lie. Being an EM can drain your energies enough that you might not have the capacity to do those fun things anymore. I should have focused on where my urge is coming from and what that really means — you can break yourself by ignoring your desires for too long.

I can’t say I didn’t wonder about whether I’m still good enough as a coder. Well, the fact that there are other engineers who might be better than me doesn’t mean that my skills are completely trash, especially if I enjoy coding so much. As long as I do, I will find a way to improve further.

(I don’t know if I surprised you with the lines above. I think I have a quite healthy mixture of being insecure and being productive. I won’t forget that a good portion of the code I wrote at Prezi actually worked/works and helped/helps the company moving forward. But also, I don’t think I will ever be able to completely stop being insecure.)

Another shortcoming of mine (which can be sometimes a good thing), is that I can very quickly internalize other people’s expectations. I’m quite a bit slower with trying to figure out what is really coming from inside me. Because I thought my teams need me exactly as an EM, I was very busy with trying to make that work, and did not work enough on evaluating how I feel about it over time. I haven’t fully seen nor understood what kind of options are available. But you know, every now and then, I started to have this weird feeling that things would go better (both for me personally and for the team) if I would focus on the technical things. My teammates were absolutely loud and clear about when they needed more tech help, but my EM responsibilities were often getting in the way, and I remember feeling really bad about it every single time.

So now that I (kind of) completed this somewhat-over-2-years-long-trial of being a manager, let me share a few thoughts as a warning for people who are dreaming about becoming a manager some day:

You have to be really careful about what kind of problems you want to pick. What do you want to be anxious about? Are your emotional support systems sufficient enough to take that kind of anxiety? Missing to consider those points will make you feel miserable in no time. (This is true for any kind of career move I believe.)
I can’t emphasize this enough: managing software engineers is a completely different profession than being a software engineer. Being good at one doesn’t necessarily mean much on the other side. (Well, maybe apart from some extra starting points on the empathy side of the skill tree…)
Funnily enough, the habits and working style that you developed to be successful as an engineer could block you from working well on the management side. For example, your schedule will drasticallly change, and you will need to get used to that. You need to say hi to frequent context switches, being able to respond to urgent queries quickly enough, and you have to build supporting productivity systems which will enable you to keep your sanity (some black belt todo+calendar-fu skills are certainly welcome). I thought this was one of my stronger points, but I was still surprised how much of a mess it could become from time to time. This problem is further amplified if you have to manage two teams like I did, which is fairly common in Prezi.
Do you like your teammates? Are you, like, friends? If yes, beware. No matter how hard you will try, being a manager will always distance you from your team. (Or from another point of view, in certain situations, you will need to develop a psychological distance from certain people and their problems in order to do your job, help them and stay functional. This is not at all different from how therapists need to work, I guess: you simply can’t help someone if you’re sharing the emotions so much that both of you suffer.) Also, trying to represent the company and trying to represent your engineers at the same time means that if there’s any misunderstanding, disagreement or conflict, it will materialize first right inside you. It requires a lot of energy and bunch of support systems in place to be able to manage and cope with this properly. The emotional factors in this point are things that I personally can’t really afford to work with anymore. Maybe, at some day, I will reach a level of maturity and stability in my life when I can, but that’s not for now.
I remember feeling really good about being able to become an engineer manager because I thought this will enable me to help others to grow. But now I realize that what I really wanted is perfectly possible without the burdens of being a manager, and I might be actually more productive as a mentor and organizer if I’m NOT a manager.

And for all the others, who don’t want to become a manager:

Please, be really, really grateful towards your manager.
You might not feel this (yet), but it’s quite likely that they do work that you would not want to.

I’m extremely grateful for all the support that I got from Prezi for this move. I think being able to do such transitions so easily is truly exceptional, and all the love and great feedback I got about it so far makes it even more so. I hope you will find my story useful too.

Shifting to the Right Place: My Transition Back from EM to Engineer was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Keep the databases top-notch

Alex — Tue, 09 May 2023 12:58:13 GMT

Photo by ThisisEngineering RAEng on Unsplash

Databases are often the most critical parts of applications. Prezi is no exception to this.
We store and read persistent data from many different databases while serving requests from our users.

There is often some hesitation when it comes to updating databases. Taking down databases for maintenance usually takes down the services that use them.

Early this year, the SRE team started to work on the maintenance task for our AWS RDS MySQL instances. We started looking at our options.

The “maintenance window” approach

A very classical approach is to define a maintenance window and do the necessary actions within this window. This works but is not a very user-friendly approach. It is even worse if one operates at a global scale from a central infrastructure: which user do you want to hit with the downtime?

A benefit of this approach is that you create a clear state in the database before the action by disabling all writing operations at the beginning. In case the application can serve only read-only traffic, it is also possible to offer a read-only instance during the maintenance window.

As we looked into the services that would be affected by our maintenance, we discovered that two services within the critical path of our user experience were affected. Declaring a maintenance window and shutting down the service would have been a major degradation of our offered service.

We ran tests on snapshots of those instances and concluded that we would need a downtime of roughly 1 hour to be finished.

We said: this is only the way to go if we find nothing else.

The “poor man’s blue/green” approach

There are some strategies that can help to hide most of the downtime from your users. It is possible to access a database through a load balancer that is able to forward traffic only to a specific instance.

It is also possible to use a primary/secondary setup where you can promote an already patched instance to become the new primary instance.

Most of the time, the challenge in this approach is the synchronization between the instances — just to name some: transactions, foreign keys, and network latency between the different systems that can make transactions tedious.

And this design pattern must be incorporated from the beginning. Changing the setup from connecting to a single instance to an advanced setup needs downtime too.

Another downside of this approach is: You have to pay for both instances.

For our case, this approach wasn’t the way to go as we didn’t build most of our databases in such a design.

Enter: AWS RDS Blue/Green deployment

AWS introduced blue/green deployment at the Re:Invent late in 2022. There is this overlay at the top of the RDS console that keeps coming up and tells you to try blue/green deployment.

AWS Console

Blue/Green offers the possibility to create a database cluster on the fly from a primary instance (blue). The newly created instance (green) can be updated without affecting the primary instance. There is a logical sync from blue to green. Under the hood, AWS RDS creates a MySql replication that syncs contents from the blue environment to the green one.

The initial creation of the blue/green setup is transparent for the applications that use the database. So you can set up all the things without affecting your users.

A configured Blue/Green deployment

Once ready and running, the connections can be switched over from blue to green with a short downtime of approximately 1 minute.

Switchover summary

After successfully switching traffic from the blue to the green database, the old instance is retained. It can serve as a fallback if the new instance is somehow broken. Once confident that nothing was broken, it can be removed. So you only have to run 2 databases at a time for the time of maintenance tasks.

Ex-Blue/Green deployment

In the past 4 weeks, we used this approach for updating roughly 30 databases to the latest MySQL minor version. It worked like a charm. Even for our most critical databases, we were able to reduce the downtime to approximately 3 minutes.

Things we learned during that process:

There is no way back after you’ve triggered the switchover process. After moving the traffic over from blue to green, you’ll end up with two independent instances that are no longer in sync. The replication in between is gone.
Your old instance even has a new VPC endpoint. There is no way to sync those databases again.
This means: If you need the old instance again, a possible rollback can be a re-deployment of the application with a config that uses the old instance again.
Doing downtime-free database upgrades manually is a tedious, complex, and error-prone process. It involves a lot of clicking and a lot of waiting. It would be better to automate the steps.
Take care that you select the correct MySQL parameter group while setting up your blue/green deployment. You can’t change this on the green instance afterward. It cost us some time to rebuild the blue/green deployment.
Always do a manual and fresh snapshot before any actions. Even though we did not need it, it makes you more confident if you have a clear state to go back to.

Conclusion

The blue/green deployment makes database updates easy and very user-friendly.

The next step is building automation scripts around this, which prepare everything up to the point of the switchover. This will make the process less error-prone and it will require less engineer time.

Keep the databases top-notch was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

10 Things We Learnt While Building A Top-Performing Zoom App

Attila Vágó — Tue, 25 Apr 2023 15:01:48 GMT

A singular goal. To release as quickly as humanly possible. Laser-sharp focus. Make the most of Zoom’s new Layers API. A daring vision. To give Zoom users better meetings. The Prezi way. All of these, in many ways, are a testament to Prezi’s long history of wanting to turn the boring, the mundane, into engaging, expressive and fun. And last fall that’s precisely what we set out to do. A handful of people, a small cross-functional team with a working proof of concept, sat down to build the top-performing Zoom Marketplace app, and roughly six months later we know what got us here, but also what almost didn’t. Listicle as this may be, grab yourself a java ☕, this ain’t gonna be a short read.

Never underestimate the value of a robust POC…

Yes, we can, said Obama and said we at the end of the first sprint. Shame we didn’t have a microphone to drop. But jaws definitely did, when people saw a styled and animated web UI baked into the Zoom video output for the first time. Without any virtual camera, native companion app and socket-based workarounds.

There were far less API endpoints to work with back then compared to the long list that Zoom provides today, but whatever we could try, we did try. We had to know. It was more than a POC, it was a feasibility study. At that point, if engineering had said, “it can’t be done”, that would have been the end of it. That POC became the seed project for what’s today the Prezi Video for Zoom app on Zoom Marketplace.

Evolution is painful — literally!

Can’t speak for my team-mates, but I certainly had a few migraines induced by overnight API changes or less-than-stellar documentation. But here’s the thing. You can’t make an omelet without breaking a few eggs, so we knew that we were working with an evolving API and ecosystem. Things will inevitably break, either on our end or Zoom’s end. And they did, and it was frustrating, but also rewarding because many of those frustrations were met with solutions on both ends. I guess you only learn to really sail in choppy waters.

Every time the wave of unexpected bugs or changes hit us, we took a deep breath and remembered our goal, envisioned our vision and focused on solving the challenge at hand just like you’d solve a puzzle.

Automated testing to the rescue

But there is a difference between reactive and proactive development. We don’t believe either of these is healthy on its own, but combined, it helped us reach velocity with reduced chances for major incidents. I’ve written in the past about the testing pyramid and how engineers should take on more of the automated testing. It was time to put it all into practice. Every engineer on the team wrote their own Cypress regression tests for every feature. No test? No deployment.

In a team of four engineers, I know of at least three major incidents I avoided by having regression tests. Multiply that by four… and you suddenly see the value of a good testing culture. No, we’re nowhere near perfect, we’re behind on some unit tests, but having the security and peace of mind that our changes don’t break some other feature already live, is invaluable.

Look ma, it’s a web app!

If you’ve never built a Zoom app before, you’ll be surprised to hear that for the most part it’s a web app. The implied asterisk is probably not lost on you, though. There are a couple of exotic aspects to Zoom apps.

CEF in video context. Chromium Embedded Framework is something you might have never had to use before. It’s quirky, but useful. Debugging in Zoom, is where it gets a little less fun, but that’s expected. Pro-tip: send console logs and errors as messages from inCamera content to the inMeeting context, which then can log them in the regular browser console.

The inMeeting context is a native embedded browser. On a Mac, for instance, it’s Safari. Finally! You have an unavoidable reason to debug stuff in Safari! 😁

The inMainClient context is also a native embedded browser, so more Safari fun, if you’re on a Mac.

Context is key

You know those contexts I keep referring to? They’re basically the three pillars of every Zoom application. The easiest way I can explain this is by illustrating it with a house.

Imagine a house. It has three rooms: a bathroom, a living room and a bedroom. All have doors between them, but they’re all locked. It’s one house, but three rooms with locked doors. When you’re in one room, you only see and know what’s in that room. If you want to know what’s going on in the other room, you gotta pick up the phone and call the other room. Just like Sheldon and Missy (from Young Sheldon) communicated when they were kids. Same thing with the app. One app. Three contexts, and communication between them is the way one updates the other.

Zoom does allow transfer from one context to another, but that’s a whole topic in itself, so maybe check out their docs. Hint: you’re gonna need to rely on onRunningContextChange among other things.

Collaboration like never before

It’s tough enough as it is, finding ways to collaborate well with team-mates, especially in a remote environment where everyone is in different locations. Add to that the extra dimension of having to work closely with Zoom as they were developing their Layers API capabilities, and you find yourself having to adjust pace to fit product’s needs but also meet certain deadlines resulting from being selected for the Zoom Essential Apps bundle. It might seem obvious, but judging by how many companies and teams still struggle with communication, it feels worth highlighting just how key it was for us to rely on open, pragmatic communication throughout the entire project.

Agile taken to a whole new level

Whenever people mention Agile, the concept of minimum viable product (MVP) comes up. But guess what? Just using agile terminology in meetings won’t get you anywhere. Who knew, right? We had to live it.

As a cross-functional team, we agreed on exactly what that MVP looked like, and as soon as it was ready, we let no scope creep in, and went live. Then we took the concept further and started thinking in terms of minimum viable feature, and minimum viable implementation, and productivity skyrocketed. As many as 7 pull-requests in a single day from 4 engineers. All supported by a dedicated UX design expert and QA professional.

It’s not tech stack vs. delivery

I have worked for enough companies to know how often product and engineering can be at odds with each other. This was something we could not afford. Not at Prezi, not now. Far too much at stake. Perhaps organically, we all seemed to share a common goal–wanting to deliver a successful Zoom app as soon as humanly possible and keep building on that success for as long as it makes sense.

We put our individual professional egos aside, and took, as a team, pragmatic decisions around everything from using our monorepo, to the code we write, the language we write it in, unit testing, regression testing, where we use our own services, where we rely on 3rd party solutions, when and how we release. I call it product-driven development, where, as engineers, we were given a hard deadline, and we had to make smart technical decisions to make the most of the time we had.

A good design is worth its weight in gold

Just like you should never let an engineer name a product, you should also never let them design anything. Frankly, we got lucky for having one of the absolute best UX designers I, personally, had the pleasure to work with, and I’ve worked with many. Fully dedicated to our team, she took the minimum of minimum viable products and helped us turn it into something we all felt really proud of, both in terms of looks and UX. Can we still do better? Of course, and that’s where data comes in.

Data to rule them all

Do not underestimate the power of information. Having data on what our users do, how they use Zoom and our app within, was crucial. Within half an hour of us implementing Hotjar, my assumptions were categorically refuted about how the various Zoom app contexts were approached by people out in the wild. So, now you have product driven, data informed development. What’s that? PDDD? That’s a lot of Ds, but hey, as long as it results in success… 🤷‍♂️

At the end of the day…

I’m sure each one of ye will find something different to take away from our learnings, but there are a couple of aspects I believe that are universal to all teams, working on any software product, and funnily enough, the tech part of it is very much secondary. It really does boil down to having the right people on the team, putting ego aside and just genuinely wanting to see the product grow live, in front of customers. That dynamic will take care of the rest of the decisions.

Pragmatism will thrive, fearless experimentation will be the norm, and the result will be a quickly evolving product that the team enjoys building, and users enjoy using.

Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes and blogs. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! Read my Hello story here! Subscribe and/or become a member for more stories about LEGO, tech, coding and accessibility! For my less regular readers, I also write about random bits and writing.

10 Things We Learnt While Building A Top-Performing Zoom App was originally published in Prezi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.