<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Prezi Engineering - Medium]]></title>
        <description><![CDATA[The things we learn as we build our products - Medium]]></description>
        <link>https://engineering.prezi.com?source=rss----911e72786e31---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Prezi Engineering - Medium</title>
            <link>https://engineering.prezi.com?source=rss----911e72786e31---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 13 Mar 2026 04:27:03 GMT</lastBuildDate>
        <atom:link href="https://engineering.prezi.com/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[We Tried Spec-Driven Development So You Don’t Have To]]></title>
            <link>https://engineering.prezi.com/we-tried-spec-driven-development-so-you-dont-have-to-56d52231c19e?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/56d52231c19e</guid>
            <category><![CDATA[coding]]></category>
            <category><![CDATA[productivity]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[software-engineering]]></category>
            <dc:creator><![CDATA[Attila Vágó]]></dc:creator>
            <pubDate>Mon, 16 Feb 2026 09:07:53 GMT</pubDate>
            <atom:updated>2026-02-16T12:04:53.762Z</atom:updated>
            <content:encoded><![CDATA[<h4>We threw spec-driven development at four teams, and the results are both terrifying and exciting…</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5yu8c9w2_dNQvYS1Bt70lw.jpeg" /><figcaption>Istanbul, just like software development, has gone through a lot of change over the years. Photo by author.</figcaption></figure><p>Istanbul is a city of nearly 16 million people. Just switching metro lines can take as long as 20 minutes through a maze of tunnels and escalators. I got my 10,000 daily steps just walking through the airport and Istanbul metro tunnels on my way to the hotel. Taking a photo from the 31st floor the sheer size of the city becomes evident. Also daunting. Akin to a 15-year-old legacy codebase that has been touched by all the tech hypes and lows of the last two decades of software development. It’s called evolution. Istanbul throughout history has changed and evolved many times, and interestingly enough, code and software development practices aren’t any different. They also evolve, and I wanted to give us the chance of doing just that at our latest off-site event in Turkey. This time by introducing a paradigm-shifting new approach — spec-driven development.</p><h4>What is spec-driven development, and why do we need it?</h4><p>Need is probably a strong word for most things in software development. Apart from maybe security. But even something like SSL on a static brochure site, you could argue is overkill. We never needed JavaScript frameworks or libraries, but we have them, and they often help us get there faster, deliver more complex applications. We never needed CDNs, but they do help load our apps and sites faster. We never truly needed Agile, but in far too many cases to ignore, it helped us build software differently, and act on user feedback faster. I believe the same applies to spec-driven development.</p><p>Spec-driven development itself is an emerging term and as such its definition tends to be a bit fuzzy. The bottom line, however, isn’t and folks at Kiro (Amazon), GitHub and other industry leaders agree.</p><blockquote>With spec-driven development, your specification becomes the source of truth rather than the code.</blockquote><p>That is a massive paradigm-shift not only for engineers, but also for anyone adjacent to engineering teams. But it’s not as exotic an approach as one might think. We’ve — sort of — done this before. To cope with the initial shock of <em>“code is not the primary artefact anymore”</em> you can to an extent find similarities with TDD (test-driven development), BDD (behaviour-driven development) and even an oldie but goodie — MDD (model-driven development). I might even go as far and state that spec-driven development is the culmination of TDD, BDD and MDD in an AI-fuelled engineering org.</p><p>What you call spec, is really just documentation that’s useful to both machine and human. Call it a good prompt if you like. How many times have you had endless hours of planning and refinement sessions? How many times have you opened a Jira or Linear ticket only to find practically no useful information, prompting you to ping everyone but your mum about what actually needs to be done? How many companies call themselves Agile and haven’t written a single user story in the <em>”I as a user… when… then…”</em> format in 10 years? Exactly.</p><h4>Quite the paradigm-shift when applied in practice</h4><p>Spec-driven development changes that. Radically. It might seem at first as “change for change’s sake”, but once you give it a try you’ll — funnily enough — start understanding even aspects that made TDD, BDD and why not even MDD interesting and popular at least for a while. That was the whole point of hosting the workshop. Make ourselves think differently. That’s not just an Apple tagline. You really do need to put yourself sometimes in a position that triggers a different way of thinking.</p><blockquote>I find that letting people build things they’re truly passionate about allows them to explore new ideas better, adopt new processes more organically, and be more productive.</blockquote><p>The last thing I wanted was to make it all seem like a forced mandate. I lived through a few “from today we’re an Agile company” workshops. As fun as they <em>can</em> be — especially if you’re doing them with <a href="https://www.linkedin.com/in/danolsen98/">Dan Olsen</a> — the mere directive that it’s now the way forward for everyone, no discussion allowed, not only dampens the excitement but breeds suspicion and skepticism.</p><p>While we had a dedicated room for the workshop, I encouraged people to try spec-driven development with spec-kit anywhere they felt like they could be their most creative selves. We are a fully remote company anyway. One of my colleagues felt it was best to spend some quiet time in his room. Four hours later he emerged with a Workout Boss app! Others were done in just an hour or so, which brings me to how this all works.</p><p>For the purposes of this article and some 1:1 workshop sessions I’ve had with non-technical folks over the last week, I slimmed it all down to a 6-step process:</p><ol><li>Define the constitution — essentially a prompt that fills in the constitution.md template with mostly technical constants/constraints of your application. Ideally, you don’t touch this often.</li><li>Specify is where you need a lot of help from your product manager, or you need to act like one. The more detailed the information, the better. Ideally, this is not just a few sentences, but many long paragraphs. The output will be in the spec.md file in the form of “given &gt; when &gt; then” stories, all prioritised. This step alone could save tons of time for product managers. Whether you want those stories to be automatically delivered to Jira or Linear is of course up to you. I think in the future, just like code, stories will be far more ephemeral than to save them all in another tool, and they’re in your source code anyway—because part of your source code now is also the spec with all the stories!</li><li>Self-solve anything that requires clarification. For brevity’s sake, you can use this step, especially if you’re just prototyping.</li><li>Plan. This sort of rounds up your strategy for the app, and some engineering input is again useful/necessary. Here’s an example prompt:<em> I am going to use plain React.js with no databases, data is embedded in the content for the mock content. Site is responsive and fully ready for mobile. Also, available in multiple languages like English, Hungarian, and Finnish.</em> It will trigger some changes in the spec, and that’s OK.</li><li>Break effort down to tasks. At this point, it’s all about implementation. You’ll see several tasks as you would in a nicely broken-down story. It’s kind of funny to see AI do a much better job than us at this stuff.</li><li>Implement. Sit back and maybe hit approve a few times when Cursor wants to run some commands.</li></ol><p>It was very intriguing to see some of my colleagues have practically a first-time exposure to just how much AI can do in software development. I was quite surprised to see just how many people didn’t integrate AI as much I did into my day-to-day development work. Some tried changing things after the initial app was built. Another team-mate of mine migrated her entire app to Material UI in two minutes, and while you could argue it wasn’t a large application, it did replicate solving a real-life engineering challenge. In a different company, seven frontend teams did that over nearly 12 months!</p><blockquote>With spec-driven development, your code merely becomes the output of your work. It’s like the rendered MP4 file in a video project. You want a change? You edit the project, not the pixels in the rendered video.</blockquote><p>And that’s quite tough to swallow at first. What do you mean, code isn’t the centre of the universe anymore? Review markdown files instead of code? What sort of nonsense is that, right? Well, it’s not, and let me bring TDD back into the mix. As a concept, I always liked it. The problem was never TDD, the problem was that Product often doesn’t provide us enough information and in a structured enough manner to easily write the tests upfront. But writing the spec which then results in acceptance criteria which then you can verify with tests is a form of TDD. The difference is that now all those requirements are met with machine-generated code and verified by machine-generated tests.</p><h4>A mechanism for change</h4><p>I am of the opinion that if you throw the wildest idea out there, you’ll always find someone who will attempt to make sense of it. Visionaries, and people who don’t overthink, but act, frequently do this. As <a href="https://medium.com/u/26e121e22f50">Manuela Olivero</a> puts it: <em>“</em><a href="https://medium.com/@manuelaolivero/why-smart-people-often-dont-succeed-e2212e2e36b4"><em>They don’t start with answers. They start with the assumption that answers are findable.”</em></a> And I strongly believe that throwing spec-driven development at your team or even your entire engineering org will produce change, will initiate the right kind of conversations. Let me show you a few interesting ones that came up in our teams.</p><ul><li><strong><em>A potential for scaling specs problem over time and project size.</em></strong> Yes. Creating mess in Markdown format is just as easy as it is in any other language. The good news is, you can apply spec-driven development at all levels.</li><li><strong><em>Nice development stages, good experience overall.</em></strong> It turns out that engineers do like processes when they make sense. Who would have thought?</li><li><strong><em>Works great with Cursor</em></strong> (we had people in the room for whom this was their first time using Cursor), but folks did run into issue using other tools. Running out of Claude Code tokens was the most common issue. On Cursor, we didn’t have that problem, as it switches between models based on what’s more appropriate.</li><li><strong><em>It may be frustrating for non-technical people.</em></strong> Indeed, you might find that some targeted onboarding is useful for less technical folks. Using spec-driven development itself isn’t technically challenging, it’s more the tooling they’re not familiar with (think Git, Xcode, Node, etc.). Maybe we should make it a rule in tech companies that everyone gets these tools installed by default on their machines. Even the CEO.</li><li><strong><em>For some work, it feels like an overkill going through all the stages, reading all the markdown files.</em></strong> This is not untrue. I would argue however that this feels overwhelming because we’ve all gotten so used to long meetings, back-and-forth in Slack to get all the information for a story or a task that we have forgotten how a properly written epic should look like, and at the initial stage your spec might be an entire epic. All subsequent changes will be much smaller, and one would review them as they review <em>sensibly sized</em> code PRs.</li><li><strong><em>A good framework for decision-making.</em> </strong>And I must agree. It feels a heck of a lot less chaotic, it establishes stages, lingo, removes friction.</li><li><strong><em>It generates a lot of things very quickly; it can be overwhelming.</em></strong> And that’s AI-driven development for you, especially on a greenfield project that you’re just starting. This is something we’ll have to learn to manage.</li><li><strong><em>Interesting, it felt like being a PM!</em> </strong>This, coming from a developer. And I love that because it proves that being T-shaped can go both ways. It empowers an engineer to act like a product manager, and it also empowers a PM or a designer to act like an engineer.</li><li><strong><em>Very curious where it goes beyond the workshop and in general.</em></strong> And this is actually quite important. Spec-driven development is for all intents and purposes in its infancy, and I myself am very curious to see what this will look like a year or two from now. Kiro has it built into their IDE. Spec-kit is essentially two folders in a project. The possibilities are endless, though, and we’re already looking at how this could be used in our CI — at least in an experimental form.</li><li><strong><em>How do I protect code changes?</em></strong> You don’t. At least that’s the intent. While you can touch the code, it is an antipattern, and if you’re planning to measure adoption, this is something you should pay attention to. If developers default back to code-first development, spec-driven development becomes largely pointless. Just like writing features first in TDD.</li><li><strong><em>Love the </em></strong><strong><em>constitution.md file. Great concept.</em></strong> I can only agree. It gives a sense of security and stable baseline. It also becomes part of the application’s “memory”, so it’s something it will continually refer back to when you change the spec.</li><li><strong><em>Many thousands of tokens hurt.</em> </strong>There’s an investment cost. No doubt about it. I would say though that this also depends on how you use tokens. As stated before, those of us on Cursor tended not to run out of tokens, while those using Claude Code, did. Not every step needs the same model. That said, the cost of tokens is also coming down. Spec-driven development may burn through tokens, but if a $200 monthly subscription doubles productivity, that’s still infinitely cheaper than hiring another you.</li><li><strong><em>Concerns around it being fit for something beyond a prototype.</em></strong> Spec-driven development can be used on existing projects. <a href="https://youtu.be/SGHIQTsPzuY?si=7FbywyvZGw1Z42sm">Watch Den Delimarsky’s video</a>. That’s actually the most important benefit of it. Otherwise, I myself would call it just a smarter scaffolding tool.</li><li><strong><em>Some commands can be redundant. The first two stages basically got everything done.</em> </strong>This is something I noticed as well on small projects. The more elaborate my spec was, the more it stuck to the defined stages. I also found that following the steps regardless, did sometimes make changes that would later be useful.</li><li><strong><em>Better than vibe coding! Great at iteration! Great for exploring! Forces you to think about the actual product you build. Works really well with TDD. </em></strong>Most definitely, and I think spec-driven development is precisely the kind of guardrails vibe coding needs to produce production-ready output.</li><li><strong><em>The </em></strong><strong><em>/clarify step felt very useful.</em></strong> Not a step I, personally, needed, but it is there, and some people made good use of it.</li><li><strong><em>Managed to get it stuck.</em></strong> No software without bugs, right? But it is easy (low cost) to reinitialise the project.</li><li><strong><em>Not convinced how well it works at scale, modularising might be needed.</em></strong> Not a negligible remark at all. It remains to be seen in what project how it works. Do you have a spec for the whole app? Do you have one per feature? One for frontend? One for backend? Best-practices have to be developed over time, but the only way to develop those, is to try different approaches.</li><li><strong><em>Spec-kit itself feels like getting in at the deep end of spec-driven development.</em></strong> Also a valid statement. Some will argue that Kiro is a smoother avenue, at least as a first look at spec-driven development. The reason I still vote for spec-kit is that it has potential beyond the user’s machine, so it’s relatively easily translatable to the CI, and one of my colleagues did already give that a go, and I know some other companies use it in the CI as well.</li></ul><p>So, as with every tool, many pros, and cons. But it started conversations we never had at this scale before. It got people exploring and discovering the power of AI in their development tasks. Some people got introduced to Cursor, others started getting ideas on how to take this further — and some already did just a few days later. Spec-driven development isn’t necessarily about adopting spec-kit or a very specific set of tools, but inspiring teams to adopt automation where and when it makes sense, to push AI capabilities to the extreme. See what works and why, see what doesn’t and why not. Iterate, measure, iterate again.</p><h4>The future is now</h4><p>A year ago, I would have told you that hitting enter, then going to meet my food delivery guy, only to come back to a fully developed feature 5 minutes later was borderline insanity. But trying spec-driven development out has a high potential of changing how you look at software development. It’s both terrifying and exciting.</p><p>That’s what software development in 2026 looks like. It’s very different than it was just a year or two ago, and if you’re not feeling the differences, I must warn you, you might be falling behind. We’re way past code-completion. This isn’t about “tab-tab-done”. We’re living in a reality where tools like spec-kit are getting adopted, getting integrated into CIs. A world where developers stopped fighting about how to write the best CSS, now we’re all thinking about how to deliver the best application for the user in the most efficient way. How to try ten things instead of two or three in a quarter. You cannot keep up with that demand doing things in a chaotic, disjointed way.</p><blockquote>Spec-driven development enabled the daredevil developer in me, where every day I wake up excited to see how far I can push it without creating chaos around me.</blockquote><h4>Resources you might find useful</h4><ul><li><a href="https://www.youtube.com/watch?v=a9eR1xsfvHg">Spec-kit for new projects</a>: YouTube video.</li><li><a href="https://www.youtube.com/watch?v=SGHIQTsPzuY">Spec-kit for existing projects</a>: YouTube video.</li><li><a href="https://developer.microsoft.com/blog/spec-driven-development-spec-kit">Spec-driven development by Microsoft</a>: blog post.</li><li><a href="https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/">Spec-driven development by GitHub</a>: blog post.</li></ul><p><em>Attila Vago — Software Engineer, improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. </em><a href="https://www.goodreads.com/book/show/205716390-it-s-cold-ma-it-s-really-cold"><strong><em>Author</em></strong></a><em>. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! </em><a href="https://attilavago.medium.com/my-200th-article-hello-its-time-we-met-3f201ad1303"><strong><em>Read my Hello story here!</em></strong></a><strong><em> </em></strong><a href="https://attilavago.medium.com/subscribe"><strong><em>Subscribe</em></strong></a><strong><em> </em></strong><em>for more stories about </em><a href="https://medium.com/@attilavago/list/lego-all-the-things-083f80bd3c51"><strong><em>LEGO</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/technology-tech-news-a2d2d509b856"><strong><em>tech</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/coding-software-development-d123369e3636"><strong><em>coding</em></strong></a><strong><em> and </em></strong><a href="https://medium.com/@attilavago/list/accessibility-4b67c1d08ef3"><strong><em>accessibility</em></strong></a><em>! For my less regular readers, I also write about </em><a href="https://medium.com/@attilavago/list/the-random-stuff-96bfc5a222e5"><strong><em>random bits</em></strong></a><em> and </em><a href="https://medium.com/@attilavago/list/writing-writing-tips-f83ef5e79de5"><strong><em>writing</em></strong></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=56d52231c19e" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/we-tried-spec-driven-development-so-you-dont-have-to-56d52231c19e">We Tried Spec-Driven Development So You Don’t Have To</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Should Software Engineers Have Good Presentation Skills?]]></title>
            <link>https://engineering.prezi.com/should-software-engineers-have-good-presentation-skills-2e1aec3240de?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/2e1aec3240de</guid>
            <category><![CDATA[technology]]></category>
            <category><![CDATA[communication-skills]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[passion]]></category>
            <category><![CDATA[presentations]]></category>
            <dc:creator><![CDATA[Attila Vágó]]></dc:creator>
            <pubDate>Mon, 08 Dec 2025 09:54:56 GMT</pubDate>
            <atom:updated>2025-12-08T20:44:54.038Z</atom:updated>
            <content:encoded><![CDATA[<h4>Spoiler alert: yes. But not for the reasons you might think…</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*91w7HrmspLhfb5nMWfVCIw.jpeg" /><figcaption>Sherlock Holmes doing a presentation. I mean it’s a murder-board, but when you have to explain it all, it’s like an engineer presenting on architecture. Photo by author.</figcaption></figure><p>Why did we become software engineers? Naturally, to have a great career, make lots of money, get free pizza and beer all day every day and retire early. Right? Well, that’s what the cynics would say. In reality, software engineering is one of those professions many of us got into less for the love of money — there are plenty of jobs that pay better — and a lot more for the love of coding and problem-solving.</p><p>I remember when I first caught the bug. I was just 21, and I somehow got it into my head that I wanted to understand this thing called HTML, and wanted my own website out on the internet. Thought followed action and as I was literally immigrating to the UK, I read a 300-page book on how to write HTML. In an airport.</p><blockquote>My first thought was: “people get paid for knowing this?!? This is pretty simple stuff.”</blockquote><p>And sure enough, as I settled in my new adoptive home, within the week I built a website the good ol’ fashioned way with just HTML. Using tables, no less. If you think debugging nested div tags today is annoying, trust me, dealing with tables two decades ago was far more fiddly. But it didn’t matter because I genuinely enjoyed it. HTML was followed by CSS, Javascript, PHP, C, Python, Ruby, and the rest is history.</p><h4>Making sense of it all, gaining expertise</h4><p>I don’t know how everyone else learnt software engineering, but my journey was a hot mess of self-taught courses piled on top of each other. Not because I’m disorganised, but rather because I wanted to learn everything. Not because I wanted to know everything, but rather because I wanted to understand what part of software engineering was interesting to me.</p><p>C felt hard-core, but having to think of memory management all day every day, wasn’t something I felt overly attracted to. Pure backend development felt dry and joyless, mobile development was far from where it’s at today, and game development, for someone who was never a real gamer, didn’t make much sense. So full-stack development it was. A little bit of PHP, some HTML, some CSS, and some JS to round it off with. Oh, and some SQL because… databases are a thing when you build a website.</p><p>However, that was just the baseline. It took another couple of years to truly understand where my true passion lied. The more I spent time on the front-end, the more obvious it became — I loved working in the browser. I didn’t mind the browser wars, they were but a fun challenge. I didn’t mind the pixel-perfect styling requirements, I looked at them as useful guide wires. I didn’t mind having to make sure the page worked just as well with keyboards and screen readers, accessibility felt like the right thing to do. So I grew in those areas exponentially, I became the engineer other teams sought out. I finally had expertise to share.</p><h4>Sharing passion</h4><p>But as I quickly found out, I wasn’t just sharing expertise. In fact, half the time, I passed on more than knowledge: I managed to get others excited about the things I was passionate about like frontend development, accessibility, microfrontends, AI tools for developers and increasingly software architecture and documentation.</p><p>The vast majority of engineers shy away from the opportunity to present. And honestly, very often they have a right to be because far too often it’s presented as a career-climbing strategy. Few things in software engineering convey visibility more than hosting a Zoom meeting for 100 people, walking onto a stage at a company all-hands or at a tech conference. And many engineers aren’t ladder-climbers. They want to solve problems. That’s why most of us got into engineering. Need a good example? Have a chat with Steve Wozniak or Linus Torvalds. But they did present on various occasions, and every single time they inspired people in the audience. Why? Because they shared the things they’re passionate about.</p><blockquote>Good presentations aren’t an information transfer mechanism. Their goal is to express passion, to inspire, to trigger conversations. They must have a multiplier effect. Otherwise, it’s a boring monologue.</blockquote><p>Go to any tech meetup, and you’ll find that everyone is passionate about something. Sure, there’s the odd showoff who really is just there to grow their network without having much of a clue about anything in engineering, but chances are you’ll find many who will inspire you with their passion. Whatever they just inspired you with could have been a presentation.</p><p>I see this often on LinkedIn as well. Engineers write lengthy, passionate posts and comments on all sorts of engineering topics they genuinely care about. My thoughts are always: <em>“this could have been a Medium article or a presentation.”</em> Why is it not? God only knows. Fantastic engineers spend hours and days every month creating valuable, inspiring content that gets lost in a Reddit or StackOverflow thread. Such a waste.</p><h4>Presentations aren’t boring</h4><p>Contrary to general belief, presentations aren’t really boring. I blame PowerPoints for making everyone think they are. But you do have options. While tools don’t make a pro, good tools can help you get there faster and make a bigger impact when delivering your presentation. Creating <a href="https://prezi.com/gallery/">a Prezi presentation</a> is one way to achieve that, but I have also seen engineers create dedicated, jaw-dropping websites — I mean, we always like building new things, right?</p><p>I have sat through incredibly boring presentations, though, regardless of what tool has been used. If I pick up on the speaker’s lack of passion, you lost me within 2 minutes. If you’re not passionate about the topic, do not present, do not write about it. Nothing good will come of it.</p><p>In the Prezi Engineering organisation, we do these events called Pragma. It’s usually a 1-hour affair where an engineer presents on a topic they care about. It’s entirely voluntary, they have full control over it, myself and another colleague of mine, we just help organise it and provide pointers if they need it. This year, however, I decided it was time to host the mother of all Pragma sessions, and invited each of the four tech stacks to find someone who has something to present. The topic was simply: “Aha!” — sharing an “aha moment” of 2025.</p><p>To my surprise, we ended up not with four speakers, but 11, as essentially every team had someone with something inspiring to share. That was the main requirement, while nudging those who haven’t done a presentation this year, to contribute. It’s a 7-minute talk at most. Lightning-talks is what some would call these. But you often don’t need more to make an impact, to inspire. It also helps speakers get to the point faster.</p><blockquote>A few minutes long presentation is long enough to start a conversation, light the spark.</blockquote><p>You can deliver a lot of value in just a few minutes. One of my most read articles is a 3-minute read I wrote while being incredibly frustrated with CocoaPods on Apple Silicon CPUs, and once I sorted the problem for myself, I decided to share it <a href="https://medium.com/p/6abe3736c221">in the form of an article</a> rather than an obscure comment or post on social media. It wasn’t about showing off, it was about sharing a small Eureka moment. And judging by the stats, 71,000 engineers needed me to do that.</p><h4>Presentation skills aren’t the point</h4><p>The point is ultimately to find your passion as a software engineer. Once you found it, you’ll become better and better at it, and you’ll start wanting to talk about it. I’ll talk about accessibility, automated testing, frontend architecture, and of course LEGO as well to anyone who’ll listen. And every so often I’ll pour that passion-led expertise into an article or a presentation. And I’m not even going to pretend I am a great presenter because that was never the goal, or at the very least it was always secondary, and it’s something that I keep refining over time, organically.</p><p>So next time someone asks you to deliver a presentation on something, don’t tell them to shove it, tell them what you’re passionate about, tell them what you want to share, the thing you would like to inspire with, and trust me — and yourself — it will be a killer presentation. Not because you have presentation skills, but because you’re sharing something you’re deeply passionate about, and that makes all the difference.</p><blockquote><em>P.S. This could have (also) been a presentation…</em></blockquote><p><em>Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. </em><a href="https://www.goodreads.com/book/show/205716390-it-s-cold-ma-it-s-really-cold"><strong><em>Author</em></strong></a><em>. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! </em><a href="https://attilavago.medium.com/my-200th-article-hello-its-time-we-met-3f201ad1303"><strong><em>Read my Hello story here!</em></strong></a><strong><em> </em></strong><a href="https://attilavago.medium.com/subscribe"><strong><em>Subscribe</em></strong></a><strong><em> </em></strong><em>for more stories about </em><a href="https://medium.com/@attilavago/list/lego-all-the-things-083f80bd3c51"><strong><em>LEGO</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/technology-tech-news-a2d2d509b856"><strong><em>tech</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/coding-software-development-d123369e3636"><strong><em>coding</em></strong></a><strong><em> and </em></strong><a href="https://medium.com/@attilavago/list/accessibility-4b67c1d08ef3"><strong><em>accessibility</em></strong></a><em>! For my less regular readers, I also write about </em><a href="https://medium.com/@attilavago/list/the-random-stuff-96bfc5a222e5"><strong><em>random bits</em></strong></a><em> and </em><a href="https://medium.com/@attilavago/list/writing-writing-tips-f83ef5e79de5"><strong><em>writing</em></strong></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2e1aec3240de" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/should-software-engineers-have-good-presentation-skills-2e1aec3240de">Should Software Engineers Have Good Presentation Skills?</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[There’s Two Sides To Every AI Tool Adoption Story]]></title>
            <link>https://engineering.prezi.com/theres-two-sides-to-every-ai-tool-adoption-story-ddb2118686d1?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/ddb2118686d1</guid>
            <category><![CDATA[productivity]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[coding]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[software-engineering]]></category>
            <dc:creator><![CDATA[Attila Vágó]]></dc:creator>
            <pubDate>Tue, 28 Oct 2025 08:49:44 GMT</pubDate>
            <atom:updated>2025-10-28T10:04:10.409Z</atom:updated>
            <content:encoded><![CDATA[<h4>How AI helps engineers get more excited about software engineering again…</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SVsWn2BPKEO8dFsUfjQm_Q.jpeg" /><figcaption>Wall-E and his two new friends, Superman and Gwen Stacy. Robots can help superheroes too. Photo by author.</figcaption></figure><p>You submit your PR and then you wait. You either get that mostly useless but non-blocking “LGTM”, you get a very confident change request or the dreaded comment that’s neither approving nor asking you to change anything. It’s more often than not the beginning of yet another philosophical debate, a nitpick thread, or some question making you roll your eyes and wonder why you even became a software developer. Let’s not pretend we haven’t run into all of these at some point or another. But that’s yesteryear’s way of doing things. Now, in 2025, we have AI that can review our code, and that… changes things.</p><p>Full disclosure, I’m an AI skeptic. That does not mean I have a problem with AI or automation. Fun fact, I actually studied automation for four years. It’s fun, if you like that sort of thing. I did. So, AI to me is but another piece of automation. But like all automation, I strongly believe that it needs to make sense, it needs to make my, our lives easier and perform a task that has a measurable impact.</p><blockquote>If we ask humans to have a measurable impact in the workplace, we must treat our tools with the same scrutiny and expectations. That includes AI.</blockquote><p>And frankly, AI only really became useful in 2025. At least from a software engineering perspective. That usefulness changes the landscape quite a bit. Suddenly, it goes from a silly fad that gets things right sometimes, to a tool that works most of the time, or at least often enough that its impact is measurable and becomes a net positive. AI is finally exciting, and with that, so is software engineering.</p><h4>But selling AI is still not easy</h4><p>Across the industry, the first and biggest hurdle you’ll run into is “developer pride”. If you’re a bottom-line person, and you care about numbers and nothing more, it’s easy to dismiss, but just like in many other creative and intellectually intense professions, engineers tend to care not just about the work getting done, but how it’s done, the value they bring as humans into the work they do, and anything that threatens that, can cause pushback.</p><blockquote>AI-generated code is an unmaintainable heap of mess. — every 2nd software engineer out there</blockquote><p>Translation? We believe we can do it better, and our code will outlive AI-generated code. Except that’s a requirement that is less and less the case, as I explained in “<a href="https://medium.com/gitconnected/abandonware-is-the-new-software-9e088bcf5bb2?sk=6057ed1cfc8536bd9fd63a39aca606ba">Abandonware Is The New Software</a>”. That doesn’t mean however that this behaviour passed on from generation to generation is easy to recalibrate, so no wonder many — even the curios engineers — will treat AI tools like Cursor or Copilot with skepticism, and will underutilise their capabilities. Many will refuse to go beyond code completion, and using agentic AI for <a href="https://medium.com/gitconnected/why-ai-doesnt-change-the-fundamental-truth-behind-coding-f3bd37e67b9b?sk=5cc5b92d1bf835f92d8721192744740a">conversational programming remains out of the question</a>.</p><p>Introducing tools like CodeRabbit for code reviews might even ruffle some feathers — surely a “wobot” can’t do a better job at reviewing our PRs than a human. Well, I’d argue the opposite. It often can. Partly because it has access to more information, and partly it’s always available to review your code within minutes of submitting the PR. The resulting code might not be 100% quite there, but it’s a lot cleaner PR for when a colleague does find the time in their busy schedule to review it. Call me “another bean counter engineer” if you like, but saved time is saved money, and saved money can mean a healthier engineering organisation, and a growing business.</p><blockquote>Transparency, well-defined standards and expectations are all key to getting engineering organisations onboard with AI.</blockquote><p>But even if you’re OK with all the above, you’ll find the odd engineer who is afraid to use these tools — because they can feel like cheating. What will the other engineers say if they find out 90% of the code was AI-generated? Well, chances are some of them are doing exactly the same thing already. Chances are, some would like to use these tools, but are worried their peers will tell them off for doing so. Well, that’s the moment we ought to talk about AI, and educate our teams on what healthy and innovative use of AI in software engineering looks like.</p><h4>Beyond the hump of fear and disbelief</h4><p>The other half of software engineers — some quietly, some less so — will have commenced using AI already, and the moment you open the conversation up, it becomes really obvious how far many engineers have already gotten in their journey of AI-assisted development. They don’t just use some tools, they have compared a host of them, have tips and tricks ready, and will often even be able to present you with an ad-hoc cost-benefit analysis. For instance, if you ask me, Cursor is better than VSCode with Copilot, but Kiro is a tool worth keeping an eye on, as it might just become a favourite among product development teams. The bottom line is, many engineers are already excited about software engineering with AI, and no matter how you put it…</p><blockquote>A capable software engineer using AI will outperform one who does not. That is a fact.</blockquote><p>That excitement — surprisingly enough — for many of us doesn’t come from programming changing in any meaningful way. <a href="https://medium.com/gitconnected/why-ai-doesnt-change-the-fundamental-truth-behind-coding-f3bd37e67b9b?sk=5cc5b92d1bf835f92d8721192744740a">Ai does not change the fundamental truth behind coding</a>, but it does change the level of effort we need to put into achieving the desired result. Someone told me once, <em>“the best programmers are the lazy ones”</em>. Just a few years ago, that would have been translated to robust code, that can live unchanged for years to come. Today, it gains a new meaning: achieving the best result with the least amount of effort. It’s not merely about lines of code per minute. Line-counting is silly, don’t let LinkedIn tell you otherwise. Getting there faster isn’t about typing more code, it’s about finding out where the code needs to go, what’s the most efficient solution, seeing the connections and dependencies in a system without having to spend hours or days doing so.</p><p>The other day, I had to add a new property to an iFrame-based dialog in a home-grown framework of ours. I am only semi-familiar with the codebase; on average, I touch it maybe six times a year. I could have gone the old-school way and find the component, try to understand how it was developed 5 years ago, what the dependencies are, and type some code until I got no compilation errors and the dialog worked the way I wanted it to. But instead, I asked Cursor where the component was, explained what I needed to achieve with the new property and let it do the work. Five minutes later, I had a working dialog with the property I needed. I reviewed the changes, made a couple of manual edits to make the linter happy and after all the regression tests confirmed nothing broke, I submitted the PR. A few hours later, the team owning the component library approved it and case closed.</p><p>Was this a complex task? Not really. And if we’re honest with ourselves as software engineers, we don’t solve complex tasks all the time. Typically, we actually don’t. And even when we do, it’s often the messy code — and it wasn’t AI that made it messy, it was us — that makes it complex. All that to say that…</p><blockquote>The argument that AI cannot generate complex applications holds very little water if at all. That was never the requirement. Not before AI, not after AI.</blockquote><p>MVPs by definition are meant to be simple, and if you’ve done Agile development for more than a day you already know, everything else gets bolted on as an epic, a story, a task — and this last one is what you tell AI to help you with. Even if the likes of Kiro will stand true to their promise and deliver apps from a set of requirements, the stages, and steps to build a piece of software up from nothing to something that you’re proud enough to put in front of a customer, still stand true.</p><p>Building “one-shot complex apps” is both delusional and impractical. You’d still end up writing a novel worth of prompts, which then would be broken down by AI into tasks, tests, reviews and the likes. It’s naive to think that “build me the next Facebook” will result in AI building you everything that is the Facebook website, apps, and everything else tied to it, but it’s a lot less so to ask it to do all the things that make your day a drag.</p><h4>AI all the things?</h4><p>Categorically no, and I fear this is partly to blame for a considerable number of software engineers still being standoffish about AI. When something is presented as the Swiss Army knife of everything, it never is. We can already see that with AI slop everywhere. AI can do countless things, but just because it can, doesn’t mean it should, doesn’t mean it’s useful, and it certainly doesn’t mean it’s pragmatically the tool that makes the most sense. But when it does…</p><blockquote>Look at your day, and identify all the “ugh” moments. Now check for how many of those moments you can say “there’s an AI model for that”?</blockquote><p>You can take this exercise even further. Map out your entire workflow as a team. Look at what’s annoying or time-consuming. Identify them. Check if AI could help. It may not always be the case, but just the other week I asked Jira’s AI to generate subtasks based on a story description, and it did. Given that I was building up an epic, it saved me tens of minutes, while only getting it wrong twice — which cost me just two clicks.</p><p>The fact of the matter is, a lot of our so-called skills are more about knowing how to use certain tools, how to deal with obscure frameworks and libraries, find stuff in legacy code, when in fact as engineers we just want to solve problems.</p><p>That is what I see in excited engineers’ eyes who use AI today, the empowerment of finally finding the time to develop their problem-solving muscles while delivering features and products that help users solve their problems. If that’s not exciting, I don’t know what is.</p><p><em>Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. </em><a href="https://www.goodreads.com/book/show/205716390-it-s-cold-ma-it-s-really-cold"><strong><em>Author</em></strong></a><em>. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! </em><a href="https://attilavago.medium.com/my-200th-article-hello-its-time-we-met-3f201ad1303"><strong><em>Read my Hello story here!</em></strong></a><strong><em> </em></strong><a href="https://attilavago.medium.com/subscribe"><strong><em>Subscribe</em></strong></a><strong><em> </em></strong><em>for more stories about </em><a href="https://medium.com/@attilavago/list/lego-all-the-things-083f80bd3c51"><strong><em>LEGO</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/technology-tech-news-a2d2d509b856"><strong><em>tech</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/coding-software-development-d123369e3636"><strong><em>coding</em></strong></a><strong><em> and </em></strong><a href="https://medium.com/@attilavago/list/accessibility-4b67c1d08ef3"><strong><em>accessibility</em></strong></a><em>! For my less regular readers, I also write about </em><a href="https://medium.com/@attilavago/list/the-random-stuff-96bfc5a222e5"><strong><em>random bits</em></strong></a><em> and </em><a href="https://medium.com/@attilavago/list/writing-writing-tips-f83ef5e79de5"><strong><em>writing</em></strong></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ddb2118686d1" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/theres-two-sides-to-every-ai-tool-adoption-story-ddb2118686d1">There’s Two Sides To Every AI Tool Adoption Story</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Somebody is in the room — did we just interview ChatGPT?]]></title>
            <link>https://engineering.prezi.com/somebody-is-in-the-room-did-we-just-interview-chatgpt-ab4e8dd5db28?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/ab4e8dd5db28</guid>
            <category><![CDATA[hiring]]></category>
            <category><![CDATA[interview]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[prezi]]></category>
            <dc:creator><![CDATA[Máté Börcsök]]></dc:creator>
            <pubDate>Wed, 09 Jul 2025 12:24:04 GMT</pubDate>
            <atom:updated>2025-07-09T12:39:14.089Z</atom:updated>
            <content:encoded><![CDATA[<h3>Somebody is in the room — did we just interview ChatGPT?</h3><p>My team just had an interview this week. It was weird.</p><p>He did well! In fact, a little too well. Whatever we asked, he could answer in detail. Too many details. But there were clues that someone was actively listening in the room.</p><p>One example: he introduced a system they built. If certain conditions are met, the users of the platform can claim rewards. He said, they obviously chose the microservice architecture with a DDD approach. One service for authentication, one for users, one for rewards, etc.</p><p>We had clarification questions whether having that many microservices was a good choice, did every service have its own database, how did they ensure transactions?</p><p>The answer felt smart and pragmatic. Of course the microservices have separate databases! And to make sure that transactions work, they used the SAGA Pattern.</p><p>Well, none of us heard about the SAGA pattern. I admitted that I wasn’t familiar with it, but instead of explaining the core idea or walking us through it, he just moved on. It felt like a missed opportunity, and overall, the answer was unexpected.</p><p>If you ask me, I’d say that we didn’t need transactions across the services, or we only needed transaction when saving the reward, maybe we set up a queue, and process these rewards that way, ensuring consistency at that point.</p><figure><img alt="A screenshot of a Zoom call with a blurred video background and an overlay window at the center showing a glowing assistant icon. The assistant message reads “anything else I can do for you, just let me know. I’m here to help!”" src="https://cdn-images-1.medium.com/max/1024/1*b3B3Jb8PtY-jm7_C5NcUng.png" /><figcaption>Approximate recreation of the candidate’s Zoom setup</figcaption></figure><p>So I replayed this part of the interview with ChatGPT, and guess what? It also suggested the SAGA pattern.</p><p>This made me do the same exercise for other questions. Sometimes I got the same hallucination, the same words from ChatGPT that the candidate had used. I couldn’t understand on the spot how they were relevant in that context.</p><p>The candidate seemed to have a professional setup, yet we still heard an echo. He was on speaker.</p><p>His eye movement was weird, some of his answers felt like he was reading a screen.</p><p>Sometimes his answers were very generic, but whatever we asked, he could go into the tiniest details. Oftentimes contradicting himself with previous answers.</p><p>After the interview, this was the first message on Slack:</p><blockquote>is it me or i think this guy is using AI to answer our questions?</blockquote><p>In general, I don’t expect anything disingenuous from people. We didn’t call him out during the call.</p><p>I wish I could share the Slack thread, everyone from the team could add a new clue, raising the suspicion more and more.</p><p>We are in new territory, and this interview left us wondering: how do we evaluate authenticity in the age of AI?</p><p>Personally, I don’t mind that we didn’t call out his suspected AI usage during the interview. And in fact, it doesn’t matter. The answers were about as great as unedited AI output without human oversight. The kind that sounds smart at first, until you realize it’s just a mashup of architecture buzzwords with no real insight.</p><h3><strong>Detecting AI Answers: Practical Tips for Interviewers</strong></h3><p>My article sparked some discussions about the topic internally. I reached out to our Senior Tech Recruiter, Monika Fourie, to share some experience.</p><blockquote>I had some people using AI during calls and it was done in an excellent way, so I believe sometimes they add their experience first and on the speaker they wait for answers. With the smart use of AI, its difficult to detect that they are using it.</blockquote><blockquote>I think the most important thing is to look at all of these signs as one — like diagnosing a disease — one symptom alone is sometimes not enough.</blockquote><h4>Signs to look for</h4><ul><li>Slight delays before responding</li><li>Their eyes going in the same direction before the response, naturally we have directions where we look towards when we’re thinking creatively, solving problems and remembering, but consistently looking in the same direction on the screen usually hints that their answers appear there</li><li>Overly polished language, sophisticated words or oddly phrased responses</li><li>Inability to go “off script” or rephrase ideas</li><li>Lack of deeper elaboration or personal experience</li><li>Slow, precise repetition of the question, spoken as if dictating it into a voice‑to‑text prompt</li></ul><h4>Questions to ask</h4><ul><li>Human answers are usually not very confident but more nuanced, asking examples “is there anything on this topic you feel less confident about” or “if you had more time what would you look into further and why”</li><li>You used the term “Modular monolith” can you confirm what it does and what it means here?</li><li>Another idea for interviews is having collaborative discussions whilst sharing a screen — this is more technical : ”Let’s solve this problem together — can you share your screen and walk me through your approach?”, or “Can you sketch a high-level architecture for that idea?”</li></ul><h4>When you feel suspicious</h4><p>Calling out AI usage is difficult if not done right. It can get us into difficult situations. I suggest to listen carefully, collect clues, use the techniques mentioned above. Only raise the issue if you’re absolutely certain, because doing so will likely end the interview right there.</p><p>This was the first time our team encountered a situation like this. I’m sure it won’t be the last. As AI assistants continue to evolve, detecting their presence will only get harder.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ab4e8dd5db28" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/somebody-is-in-the-room-did-we-just-interview-chatgpt-ab4e8dd5db28">Somebody is in the room — did we just interview ChatGPT?</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[You’ll Rebuild Everything Every Four Years Anyway]]></title>
            <link>https://engineering.prezi.com/youll-rebuild-everything-every-four-years-anyway-b31ab0dcc17e?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/b31ab0dcc17e</guid>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[prezi]]></category>
            <category><![CDATA[software-architecture]]></category>
            <category><![CDATA[web-development]]></category>
            <dc:creator><![CDATA[Attila Vágó]]></dc:creator>
            <pubDate>Fri, 04 Apr 2025 03:12:34 GMT</pubDate>
            <atom:updated>2025-04-04T03:12:34.299Z</atom:updated>
            <content:encoded><![CDATA[<h4>To refactor, or to rebuild? That is the question…</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4gf5UDs9E97yw0c2E44Cag.jpeg" /><figcaption>Photos and edits by author, speech bubble asset by <a href="https://commons.wikimedia.org/wiki/User:Kaldari">Kaldari</a>.</figcaption></figure><p>The headline is a direct quote from a colleague of mine many years ago. We were on a Tiger Team rebuilding the frontend architecture of the company’s main product. Being a contractor, he had very little skin in the game. He came in, helped us six months — one of the best engineers I have ever worked with — then moved on. But his casual remark stayed with me. Do we really rebuild that often? If so, why? And when we don’t, why don’t we? Finding myself looking at the prospect of doing yet another major architectural migration, the questions are especially poignant.</p><p>It’s probably no surprise to anyone, at this point, that software typically tends to evolve over time. This evolution is more often than not driven by product teams who — naturally — want the software to cater to the users’ needs, to entice more users and keep them for a long time. That comes at a cost. Some of it avoidable, some of it, not so much. Some of it dependent on engineers, while other aspects a lot less. Long story short, it gets complicated, and it does so quickly.</p><blockquote>Clean code and successful products don’t always go hand in hand. This is a software engineering inevitability.</blockquote><p>When jQuery became the hottest new kid on the block, we all jumped at the opportunities it presented us with. Even though technically, it was still JavaScript, just enough of the complexities and tediousness of the language was abstracted away that everyone started building JS-heavy apps. That quickly ballooned into jQuery plugins — really just more JavaScript files added to the head of your pages — and you found yourself having a frontend monolith, unless you called it what it really was — a gigantic pile of spaghetti code.</p><p>Angular with its MVC architecture was supposed to solve that — and other things — but then it didn’t. Nor did React. Nor did Vue. Or Svelte. Or whatever you can think of. Given enough time, you’d keep finding yourself dealing with the same unintelligible mess, grinding yourself to a halt, wishing for yet another “rebuild”.</p><h4>The problem with rebuilds</h4><p>It’s a surprisingly common engineering request. If the app was built in Angular, you’ll surely find a group of passionate engineers who will want to rebuild it all in React. If it’s a React app, you’ll surely find some hard-core Svelte fans who’ll jump at the opportunity to migrate everything to Svelte. No matter what library or framework an “old” codebase is built with, there will be a group of engineers ready to kill it, and start from scratch.</p><blockquote>There is a false sense of security in rebuilding an existing product in a new architecture, language or framework. It’s meant to solve everything, while often it fails to solve much, if anything.</blockquote><p>Of course, before you even get the opportunity to rebuild, you already have a massive blocker to overcome — <a href="https://levelup.gitconnected.com/how-to-sell-engineering-needs-to-product-managers-2a4f379103b6?sk=60f7bf95b768bc5dbdcd463bddf56e84"><strong>selling it</strong> to the product teams and the business</a>. I have yet to meet a product owner or manager who gets excited about <strong><em>not</em></strong> <a href="https://engineering.prezi.com/a-rare-insight-into-the-daily-challenges-of-an-experiments-team-349a94960b4f">delivering features or experiments</a> for 6–12 months, or a business that proactively wants to invest in getting the exact same thing a year later, for the cost of an entire year’s development time. Selling a major refactor or rebuild is perhaps one of the most difficult challenges an engineering team will face, as for it to make any sense it has to be tied to performance, security and/or scalability, and that isn’t always easy case to make, especially if we’re talking about an application built 3–4 years ago.</p><p>Another trap that I often see engineering organisations fall for is what I call “<strong>inherited fallacies</strong>”. During its lifetime, all software tends to attract dead weight in the form of abandoned features, unresolved A/B tests, business complexities due to decisions being made at a certain point for potentially even legitimate reasons. Add to that spaghetti code that possibly ties all of it together, and when rebuilding, you’ll soon find yourself recreating the same monster you were hoping to get rid of in the first place. I strongly believe that rebuilds more often than not require product input, very pragmatic conversations as to what is kept and what isn’t. That said, watch out not to shed too much of those “inherited fallacies” as <a href="https://forums.macrumors.com/threads/sonos-ceo-steps-down-following-disastrous-app-redesign.2447308/">you’ll end up in Sonos’ shoes, and heads will roll</a>.</p><p>The final aspect worth keeping in mind when rebuilding is <strong>new technical debt</strong>. The — and I might add, wrong — assumption is that a rebuild is a clean slate, and thus technical debt gets reset to zero. In my experience, that’s far from reality and an overly naive and quite dangerous assumption. All rebuilds come with their own set of technical challenges, some of which will end up in the backlog.</p><p>Documentation is also something I often see being left for last, alongside less important feature enablements. You might also find that certain nice-to-haves developers were used to in their day-to-day are also missing. I remember the first time we handed over microfrontends to the teams in a previous company I worked at, half the DX (developer experience) features were missing. It took another year for a colleague of mine and me to develop a robust CLI tool, which to this day is being praised for being a tremendous help for developers every day.</p><h4>The problem with cleanups</h4><p>It’s difficult to bring up the conversation of rebuilds without cleanups and refactors being brought up as well. And for a good reason. Few businesses have an infinite number of resources for constant rebuilds, especially when in the same breath engineering teams keep harping on about clean code, software engineering best-practices, and various programming paradigms being enforced. It’s a conversation that gets contentious very quickly. How does code even get to such a state if engineers care so much about code quality, right? Regardless of what the answer might be, cleanup and refactors come with their own risks.</p><p>As messy and tangled legacy code may often be, it’s working code, and that needs to be remembered. With that, anytime you refactor, you put that working state at risk, and when tests are inadequate or nonexistent, any sort of cleanup could end up in disaster. Sure, there’s always the revert button to save the day, but what’s most important to take away from this is that effective refactors and cleanups require robust testing to be in place.</p><blockquote>Effective refactors require robust testing, and that, unfortunately, isn’t commonplace.</blockquote><p>Another unfortunate reality is just how low developer interest is for refactors and cleanups. The vast majority of engineers are far more interested in building new things, greenfield stuff, rebuilds. Refactoring existing code is also not for the faint-of-heart and teams tend not to want to use their senior engineering resources for cleanups. And if that wasn’t enough, you also need to contend with the fact that engineers don’t always see eye-to-eye on what an effective refactor looks like, and you’ve got yourself <a href="https://medium.com/gitconnected/code-review-etiquette-for-the-modern-developer-3fb5e1ad62d0?sk=ccda3532a86e90d3576ef3ce7a705f32">endless conversations in code reviews</a>.</p><p>Finally, it has to be said out loud that a lot of code complexity is also due to certain product and business decisions that have been made over the years. Some were likely made for business survival, or a reaction to the market. Years later those may make very little sense, but for a meaningful cleanup, product teams need to be involved in the decision-making, which makes many refractors more than just an engineering exercise. And we all know how this works — the more stakeholders, the <a href="https://medium.com/@jchyip/guiding-principle-consent-over-consensus-8aee08540d62">more difficult agreeing on something gets</a>.</p><h4>A more pragmatic approach</h4><p>So what do you do when neither of the above avenues seem particularly ideal? You pivot! Nah, I’m only joking. 😄 I think anyone who has worked in a tech company for more than a year has PTSD from all the pivots they’ve experienced. You need something that you can sell to product as an enabler, excites engineers and has business potential. In our case, that’s a new architecture that allows for a staggered departure from a highly interconnected setup to a much more modular one, where microservices are loosely coupled with microfrontends.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lxbNBf6uFNQ0Vr6yEIQIFg.jpeg" /><figcaption>A high level architecture diagram of the direction we’re taking.</figcaption></figure><p>Take for instance a Django-based site where over time you may have combined your templates and views with some modular React applications. If you examine closely the historical context in which these decisions were made, they’ll all make sense. Unfortunately, that also means you’re dealing with an overly complex setup where even developing locally becomes a pain and slows delivery down to a grinding halt over time. Your option could be to throw all the Django out, and start fresh — aka a complete rebuild. Or, instead, you can return JSON instead of views, remove the need for routing on the backend and apply something like Single-SPA on the frontend. On the server side the refactoring is far less risky, while on the frontend the rebuild is straightforward, yet staggered, as you’ll only rebuild what you need, when you need it.</p><p>More importantly, this answers the question I started with. Do we rebuild every four years? There is certainly an industry tendency to do so, and the reasons vary from technical to business and anything in-between. But if you ask me whether we need to, I think not. Not if the architecture we set up for ourselves allows for staggered migrations and an organic evolution, where you even have the option to build throwaway applications that satisfy a business goal for a limited amount of time without having long-term detrimental effects on the overall state of the code and your architecture.</p><blockquote>The best software architecture is the one that allows for change, where rebuilds are rare, and cleaning up just means throwing stuff out.</blockquote><p><em>Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. </em><a href="https://www.goodreads.com/book/show/205716390-it-s-cold-ma-it-s-really-cold"><strong><em>Author</em></strong></a><em>. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! </em><a href="https://attilavago.medium.com/my-200th-article-hello-its-time-we-met-3f201ad1303"><strong><em>Read my Hello story here!</em></strong></a><strong><em> </em></strong><a href="https://attilavago.medium.com/subscribe"><strong><em>Subscribe</em></strong></a><strong><em> </em></strong><em>for more stories about </em><a href="https://medium.com/@attilavago/list/lego-all-the-things-083f80bd3c51"><strong><em>LEGO</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/technology-tech-news-a2d2d509b856"><strong><em>tech</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/coding-software-development-d123369e3636"><strong><em>coding</em></strong></a><strong><em> and </em></strong><a href="https://medium.com/@attilavago/list/accessibility-4b67c1d08ef3"><strong><em>accessibility</em></strong></a><em>! For my less regular readers, I also write about </em><a href="https://medium.com/@attilavago/list/the-random-stuff-96bfc5a222e5"><strong><em>random bits</em></strong></a><em> and </em><a href="https://medium.com/@attilavago/list/writing-writing-tips-f83ef5e79de5"><strong><em>writing</em></strong></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b31ab0dcc17e" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/youll-rebuild-everything-every-four-years-anyway-b31ab0dcc17e">You’ll Rebuild Everything Every Four Years Anyway</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How using Availability Zones can eat up your budget — our journey from Prometheus to…]]></title>
            <link>https://engineering.prezi.com/how-using-availability-zones-can-eat-up-your-budget-our-journey-from-prometheus-to-be8a816f7efe?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/be8a816f7efe</guid>
            <category><![CDATA[monitoring]]></category>
            <category><![CDATA[prometheus]]></category>
            <category><![CDATA[victoriametrics]]></category>
            <category><![CDATA[grafana]]></category>
            <category><![CDATA[kubernetes]]></category>
            <dc:creator><![CDATA[Grzegorz Skołyszewski]]></dc:creator>
            <pubDate>Mon, 09 Dec 2024 16:31:05 GMT</pubDate>
            <atom:updated>2024-12-09T16:31:05.611Z</atom:updated>
            <content:encoded><![CDATA[<h3>How using Availability Zones can eat up your budget — our journey from Prometheus to VictoriaMetrics</h3><h3>Intro</h3><p>By 2024, Prezi’s monitoring system, built around Prometheus, was becoming outdated. It was already 5+ years old, running on a deprecated internal platform and accumulating a significant amount of costs every month.</p><p>At the beginning of the year, we decided to deal with the “future problem” and modernize our metrics collection and storage system. Our goals were to run the monitoring system in our Kubernetes-based platform and reduce the overall complexity and costs of the system.</p><p>We achieved these using VictoriaMetrics. This post describes our journey, the challenges we faced, and the results we achieved from the migration.</p><h3>Previous state</h3><p>Our Prometheus-based system wasn’t <strong>that </strong>problematic by itself — we ran a pair of instances, to achieve high availability, for each of our Kubernetes cluster. We also had one extra pair for non-Kubernetes resources, and one for storing a subset of metrics with longer retention. You can see the high-level architecture of the system in the diagram below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tGT2rOPOAoGAiRKAbgc1rg.png" /><figcaption>Our Prometheus-based system architecture</figcaption></figure><p>Just before the migration, we had 5 Million active series at any given point in time. It’s also worth noting that our microservices ecosystem was already instrumented for producing metrics in Prometheus format, and it was something that we didn’t want to change — it’s at this stage de-facto the standard (although it is slowly becoming superseded by OpenTelemetry).</p><p>There are some challenges when operating such a system:</p><ul><li>Exploring metrics or configuring rules must target specific installations. This made dashboarding and alerting more difficult, and it already is difficult for most non-SRE folks in general.</li><li>The instances Prometheus ran on had to be <strong>really</strong> <strong>beefy</strong> to handle our load.</li><li>As mentioned in the introduction, the instances were running on the previous version of the Prezi platform that was already deprecated. We really wanted to move off.</li></ul><h3>The options</h3><p>Now that you know what we were dealing with, let’s look at what we could have done with it. We set out to explore our options, considering both managed and self-hosted solutions. We quickly realized that we couldn’t afford to ship our metrics to any of the vendors out there. We would have to spend at least 2x the current cost, and, the perspective of modern self-hosted solutions being even cheaper, led us to drop that path.</p><p>On the self-hosted end of the spectrum, we had:</p><ul><li>Thanos</li><li>Mimir/Cortex</li><li>VictoriaMetrics</li></ul><p>Some members of the team were already familiar with Thanos and Cortex, so these were the biased first-choice tools that we first tried to understand. But we didn’t stop there and made a complete comparison for the concerns that we cared about. You can see the table from one of our exploration documents below.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*E58dUytQXpOhzGjf" /><figcaption>Differences between Mimir, Thanos and VictoriaMetrics, taken from our exploration documentation.</figcaption></figure><p>We initially thought that using <em>block storage </em>may be a downside of VictoriaMetrics. Nothing more wrong — while it’s tempting to use the infinitely-scalable object storage (like S3), the good old block storage is just cheaper and more performant. Given that cost control was one of the priorities, we saw an opportunity to run the system cheaper, and quite possibly — with less complex architecture. For example, thanks to using block storage, VictoriaMetrics no longer needs any external cache subsystem, as is the case with the other two.</p><p>In the process of exploring what VictoriaMetrics has to offer, we also took a small detour and talked with the good folks at VictoriaMetrics to see if buying an Enterprise license for self-hosting, which enables some features that we could have wanted, is within our budget. Turns out we didn’t really need these features, but buying the license wouldn’t break the bank for us either. And, there’s nothing wrong with asking for a quote!</p><p>VictoriaMetrics stood out thanks to its simplicity and cost-efficiency, which we tested in a Proof of Concept.</p><h3>VictoriaMetrics Proof of Concept with some challenges</h3><p>We jumped into the implementation of a small proof-of-concept system based on VictoriaMetrics, to see how easy it is to work with (what’s good from the most cost-effective system if you can only get there after 3 months of tuning it back and forth?), how it performs, and to extrapolate the cost of the full system later on.</p><p>VictoriaMetrics allows you to install VictoriaMetrics Single — all-in-one, single executable, which acts almost exactly like Prometheus. It can scrape targets, store the metrics, and serve them for further processing or analysis. We knew from the start that we wanted to use VictoriaMetrics Agents to scrape targets, as that allowed us to host a central aggregation layer installation and distribute the agents — all of them contained, collecting metrics only within their environments (be that Kubernetes cluster, or AWS VPC).</p><h4>The initial idea</h4><p>We wanted to host the tool on Kubernetes at the end, so it made sense to rely on the distributed version of the system— for high availability and scalability, it just sounded good. We took the off-the-shelf helm chart for the clustered version — one, where VMInsert, VMStorage, and VMSelect are each separate components.</p><p>The concept is fairly simple — VMInsert is the write proxy, VMSelect is the read proxy, and VMStorage is the component that persists the data to underlying disks. On top of that, we also installed VMAlert — the component used for evaluating rules (Recording and Alerting).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*mlx0uiMeMCx-YIuk" /><figcaption>High level overview of VictoriaMetrics Cluster architecture, taken from our exploration documentation.</figcaption></figure><h4>We didn’t want to test agent options yet</h4><p>We initially used Prometheus servers with <em>remote_write</em> for testing but quickly found that VictoriaMetrics Agents were far more performant for our needs. Even though we had a lot of headroom on the instances, the Prometheus was just too slow to write to VictoriaMetrics.</p><p>Installing VictoriaMetrics Agent was easy with the already existing scraping configuration. We simply replicated the configuration — that was enough to make the Agent work.</p><h4>The cost and the performance</h4><p>We managed to create a representative small version of the system. That allowed us to test the performance of reads and writes, and see how much resources (CPU time, Memory, and storage size) the system used. We were absolutely delighted. We found queries that were timing out after 30 seconds in Prometheus, returning data in 3–7 seconds in VictoriaMetrics. We didn’t find any queries that were performing significantly worse.</p><p>We also found that the resource usage footprint was minimal. The data is efficiently stored on the disk, and compressed, and the application uses very little CPU time and Memory. Our estimations at the time showed: 70% less storage, 60% less memory, and 30% less CPU time used. This, together with bin-packing in Kubernetes made us excited about saving a significant amount of money spent on the system.</p><p>Well done, VictoriaMetrics!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/724/1*Zn5eXdybRj1JGByXeE7QYg.jpeg" /><figcaption>skynesher/E+ via Getty Images.</figcaption></figure><h4>Too good to be true, or how using Availability Zones can empty your wallet</h4><p>So it was working, and it was working well. We were scraping metrics and using <em>remote_write</em> to store them. We could query the metrics in Grafana (added as Prometheus data source, because VictoriaMetrics’ <em>MetricsQL</em>, the query language, is a superset of <em>PromQL</em> — which is fantastic!), we even added some alert rules and saw them trigger. That was so smooth. Too smooth.</p><p>A couple of days later, we found that we had accumulated a significant amount of dollars, which was attributed to the network traffic in our environment. Turns out that running a distributed metrics system, where each time you query or write a metric, you get an extra hop (VMSelect or VMInsert to VMStorage), can be costly when you put that in the context of inter-zone traffic in your hyperscaler (AWS for us). Not only were typical metric writes and reads subject to that , but evaluating rules (and we have some really heavy recording rules) also used the same route. That was concerning and made us stop and rethink our approach.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/724/1*Hlq1Yf1XG-sB65WF8VR72Q.jpeg" /><figcaption>DjelicS/E+ via Getty Images.</figcaption></figure><p>We needed to figure out something else.</p><h4>Back to the roots</h4><p>If you scrolled up to the previous state diagram, where I showed how we used Prometheus, you might see that we used a pair of instances for HA. We decided to keep that approach for our new system. Instead of using the clustered version of VictoriaMetrics per Availability Zone, we tested the installation based on two separate VictoriaMetrics Single instances, each in a different AZ. We went into “save as much as possible mode” at that time, and we traded local redundancy for a global redundancy — since a single cluster with distributed components would be enough for us, reliability-wise — two instances in a <em>hot-hot</em> setup would also do it!</p><p>Installing two single-replica Deployments of VictoriaMetrics Single worked flawlessly for us (spoiler — it still does work flawlessly more than a half year later 🚀). We no longer cross Availability zones with our extra hop traffic.</p><p>We added a pair of VictoriaMetrics Alert instances next to each VictoriaMetrics Single instance, operating in the same Availability Zone.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/362/1*wYvudevkjcPkwSmiLt-L-g.png" /><figcaption>Aggregation Layer overview based on VictoriaMetrics Single instances.</figcaption></figure><p>We set up a load balancer in front of the instances for reading the metrics, mainly used by Grafana. Occasionally, one of the VMSingle instances goes down — then the traffic is sent to the other one. When the instance is unavailable, we don’t lose data — agents buffer it, and while we may skip a couple of recording rules evaluations, <a href="https://victoriametrics.com/blog/rules-replay/">VictoriaMetrics provides a neat way to backfill rules using vmreplay</a>.</p><p>The only time the traffic goes across AZs now is when an agent is not hosted in the same zone as the target VictoriaMetrics Single instance. This is something that can not be worked around, as long as we want two agents to write the data (which is then deduplicated smartly by VictoriaMetrics).</p><h3>The final architecture and other notable mentions</h3><p>Finally, our architecture looked like below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LRgN6cOsTi0Mg9T-4B1uXQ.png" /><figcaption>VictoriaMetrics-based system architecture</figcaption></figure><p>(Yes, the diagram looks a bit more convoluted than the diagram for the previous system. This is the price you pay for having a more-performant and cost-effective system with a better user experience 🙃)</p><p>There are also other use cases, which I haven’t touched on above — the long-term storage, and using VictoriaMetrics Operator to scrape non-Kubernetes and improve system configuration capabilities. I want to expand a bit on these and one extra special thing below.</p><h4>Long-term storage</h4><p>We also wanted to migrate our long-term storage installation of Prometheus. When exploring VictoriaMetrics, using an enterprise license to have different retention configurations for series was tempting, but we checked and it wasn’t the most cost-effective way to do it.</p><p>We also had a brief episode of sending these metrics to Grafana Cloud, where we have 13 months of retention. That cost us pennies, but at the time of adding it, we had two Grafana installations — self-hosted, and Cloud instance.</p><p>Having both short-term and long-term metrics in one Grafana would require us to add the Grafana Cloud Prometheus data source in our self-hosted instance. That’s nothing, but we found something better — we just set up yet another VMSingle instance with a different retention setting. We not only pay even less but have 100% of metrics in our infrastructure.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/684/1*TNEQZneLoIx7X29NwzrZjg.jpeg" /><figcaption><a href="https://www.gettyimages.com/search/photographer?photographer=Michael%20Blann">Michael Blann</a>/DigitalVision at Getty Images.</figcaption></figure><h4>VictoriaMetrics Operator</h4><p>Our scraping and rules configuration for the previous system was overly complicated, with a baggage of tech-debt — neither we nor our users understood how to configure the system, sometimes. We wanted to change that.</p><p>We chose to install and configure VictoriaMetrics using the Kubernetes Operator. All of the components are managed by the Operator, as well as the configuration of the system. That allowed us to distribute the configuration concerns to our users — our product teams can now configure alerting for their services from their repositories. If you want to know how we pulled that off, let me know — that would definitely be material for another post.</p><h4>Scraping non-Kubernetes resources with VictoriaMetrics Operator</h4><p>When we were setting up the system in production, VictoriaMetrics Operator was still in its early days. There was no support for Service Discovery of non-Kubernetes targets (now there is one), and there was no way to install VMAgent (Operator-managed Custom Resource) that wouldn’t be injected with the same configuration as the other VMAgents in the cluster (at least not an easy, maintainable way).</p><p>To overcome these and still collect metrics from our other workloads, we chose to install an additional VictoriaMetrics Agent using the helm chart and configure it statically. This works for us because the targets don’t change that much and are mostly infrastructure-related, so the people configuring the scraping are more familiar with Prometheus/VictoriaMetrics than, say, a Python-focused Software Engineer.</p><h4>Single pane of glass in Grafana Cloud with self-hosted metrics</h4><p>Lastly, the very recent change that is worth mentioning — consolidating our Grafana instances. We now have only one instance of Grafana, thanks to a smart solution offered by Grafana Labs — Grafana Private Data Connect. We install the agent next to our VictoriaMetrics, which sets up a SOCKS5 tunnel between our and Grafana Labs’ infrastructure. That allowed us to add a self-hosted VictoriaMetrics as a data source in Grafana Cloud. What’s more — it’s free (except for the network traffic)! Neat! Well done, Grafana Labs! 💪</p><p>Note: We are a happy customer of Grafana Labs and their Cloud offering, as you may know from <a href="https://engineering.prezi.com/how-prezi-replaced-a-homegrown-log-management-system-with-grafana-loki-15111174ff91">How Prezi replaced a homegrown Log Management System at Medium</a> or <a href="https://bigtent.fm/s2/2">Grafana’s Big Tent Podcast S2E2</a>, where Alex first explained how we landed on Grafana Loki for our Log Management, and then explained how we use Grafana IRM for our Incident Management. Check these out!</p><h3>What have we gained from migrating our system?</h3><p>The benefits can be summarized as follows:</p><ul><li><strong>Cost Efficiency</strong>: Saved ~30% on system costs.</li><li><strong>Performance</strong>: Query speeds improved significantly, with heavy queries completing in 3–7 seconds (vs. 30+ seconds).</li><li><strong>User Experience</strong>: Streamlined metrics access and configuration via Kubernetes-native tools.</li><li><strong>Scalability</strong>: The system is now future-proof for growing workloads.</li></ul><p>Lastly, working on the migration allowed us to learn a ton, and work on something interesting and challenging.</p><p>Migrating from Prometheus to VictoriaMetrics transformed our monitoring system, offering cost savings, performance gains, and an improved developer experience. If you’re considering a similar move, we strongly recommend evaluating VictoriaMetrics for its simplicity and efficiency.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=be8a816f7efe" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/how-using-availability-zones-can-eat-up-your-budget-our-journey-from-prometheus-to-be8a816f7efe">How using Availability Zones can eat up your budget — our journey from Prometheus to…</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How To Turn Red Energy Into Strategy And Migrate All Your Tests While You’re At It]]></title>
            <link>https://engineering.prezi.com/how-to-turn-red-energy-into-strategy-and-migrate-all-your-tests-while-youre-at-it-12b29c665ec5?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/12b29c665ec5</guid>
            <category><![CDATA[quality-assurance]]></category>
            <category><![CDATA[coding]]></category>
            <category><![CDATA[software-testing]]></category>
            <category><![CDATA[engineering-mangement]]></category>
            <category><![CDATA[software-development]]></category>
            <dc:creator><![CDATA[Attila Vágó]]></dc:creator>
            <pubDate>Tue, 26 Nov 2024 04:13:06 GMT</pubDate>
            <atom:updated>2024-11-26T13:12:25.106Z</atom:updated>
            <content:encoded><![CDATA[<h4>An in-depth look at migrating over 140 Ruby-based Cucumber tests to a Java-based test automation framework…</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hrjJ3SqpTiGLQxHHzsaBUQ.jpeg" /><figcaption>Photo edits by author. Ruby logo Copyright © 2006, Yukihiro Matsumoto, Java logo by <a href="https://logoeps.com/java-eps-vector-logo/40925/">LogoEps</a>. All assets used with permission.</figcaption></figure><p>One of the more major challenges a software engineering organisation tends to face at one point or another in their lifetime is technical debt that simply cannot be “paid back”. Even with the best of intentions, it does happen, and it can happen for a myriad of reasons, one of them being a stack change over time or a certain language or framework’s fall from grace over the space of a decade or two. Add to that some inevitable brain drain, and you have yourself a migration trifecta.</p><p>Over the last few years, an ongoing conversation between engineering teams was our hefty suite of Cucumber regression (E2E) tests written in Ruby. As the years have gone by, Ruby has slowly become the abandoned child of our stack. There was a lot of appetite for it initially, and fairly widespread skillset in the teams. The language was popular, Cucumber was popular, so writing tests in Ruby was also popular. Until it wasn’t. By 2021, whenever our Cucumber tests came up in conversation, you could feel the dread setting in. Everyone wanted to get rid of them, but nobody had the time, will, or energy to do it. After all, we were talking about roughly 200 tests.</p><blockquote>By the end of 2023 we had virtually no Ruby skills left in the company. Be that infrastructure or development side.</blockquote><p>Because it’s important to remember that regression tests don’t just run in a vacuum or on local machines. Writing them, updating them, is only half the equation. The other half is an entire infrastructure that enables those tests to run as part of your CI pipelines. At this point, it wasn’t just developers who wanted to — and I quote word for word —<em> “kill it with fire”</em>, our developer experience team (DX), who were tasked with maintaining the Ruby infrastructure were also getting exhausted by its costly and unsustainable maintenance, nevermind the risk of ending up in a situation where some dependencies would simply not be supported at all anymore, blocking the pipelines and thus critical releases to production of our products. I mean, just look at these gems, and I say that both literally and figuratively:</p><pre>ruby 2.5: release: release date: 2017-12-25, EOL: 2021-04-05 (latest version: 3.3.6)<br>google chrome 75:  release: 2019-06-04 (latest version: 131)<br>bundler gem v1.17.3: release: 2018-12-27 (latest version: 2.5.23)<br>cucumber 3.1: release: 2017-11-28</pre><p>As one of my DX team-mates aptly put, it was a time-bomb ready to blow at any time. The last time I heard that, I had to migrate an entire frontend from Angular 1 to React and do so while also <a href="https://medium.com/p/8373a6e67ac8">moving a monolith to microfrontends</a>.</p><p>But I’ll be honest, I also tend to be intrigued by challenges that keep not getting solved for a long time. Perhaps it’s a form of self-validation or just “red energy” as one of my therapist friends calls it.</p><blockquote>If you ever used anger to fuel positive change, you used red energy.</blockquote><p>By spring of 2024 it was decided. I am going to make it my personal goal for the year to once and for all migrate all the Ruby Cucumber tests to our Java-based E2E framework. I was hell-bent on doing whatever was necessary to get it done. Unbeknownst to me, Turu, a colleague of mine from the QA team had a very similar energy fueling a very similar goal. I know that 9/10 times the word “synergy” is used completely unnecessarily in conversations, and we’re all tired of hearing it, but this time the synergy was real. I was going to need the QA team’s support to some extent anyway, but seeing our goals intersect — love the boardroom lingo, aye? 🙂 — was a massive relief as it meant we were going to be able to share the load somewhat more evenly and accomplish — now our collective goal — faster. Believe it or not, sometimes throwing more people at the problem does help. As much as I love Fred Brooks’ timeless software engineering classic, it doesn’t always apply.</p><h4>A few words on strategy</h4><p>In short? Let’s call it the “80 days around the world” strategy. I could say we time boxed it, but that sounds boring, and tying our success somehow to Jules Verne sounds more fun. Regardless of what you call it, that aspect — especially in hindsight, and hindsight is always 20–20 — was crucial to getting this migration done.</p><p>I have learnt this doing a lot of proof-of-concept projects and hackathons. Creating an unmovable constraint — designers know this first-hand — inspires people. Creative ideas surface, people suddenly become more dynamic, adaptable, and start focusing on what truly matters — the outcome by a certain date. In this case, we really did give ourselves around 80 days with a singular goal: migrate everything.</p><blockquote>Migrate everything in 80 days. How? Doesn’t matter. Get creative. Stay pragmatic. Get. It. Done.</blockquote><p>Anyone who works in software development knows that prioritisation is a tricky business. A lot hinges on it. In this case, everything did. I ran all the Cucumber tests locally, and quickly realised we will have to be smart about what we migrate, when and why, so to make sure we stayed efficient:</p><ul><li>I reached out to teams to find out if they had any redundant or deprecated tests. Some did, so I marked them for deletion.</li><li>I looked at the currently passing tests, and created the first batch to migrate. These got priority because all of these tests were running on live software, used by millions of customers. If, for whatever reason, we would suddenly end up running out of time, we’d at least have the most important tests migrated.</li><li>Then I created a second batch, while my colleagues from QA already began giving a helping hand in migrating them to our own test automation framework (TAF). This second batch was all the flaky tests, the ones failing for whatever reason or the disabled ones.</li><li>Finally, there was a last set of tests that covered some of our A/B tests. Initially, I almost made the mistake of starting with these, but then I realised by the time we’re done with the migration, most of these A/B tests will have already been concluded. That turned out to be true, and out of 20 or so, we only had to write tests for 3.</li></ul><p>Once prioritisation was ready, the QA team (partially) and myself (full-time) got working on the implementation part. Tests after test, one by one, day after day, we could see the progress. We used a traffic-light system. Tests that we migrated, we marked with green 🟢, tests we were working on we marked with amber 🟠, and tests we found did not need migrating, we marked with red ❌. At all times, everyone involved knew who was working on what test. I decided to waste as little time with Jira tickets as possible, so we did most of the tracking in a Confluence doc.</p><blockquote>Were we ruthless with our time-saving measures? Perhaps. But did we deliver the work on time? You bet!</blockquote><p>Once all the tests were migrated, QA did a final review to make sure we tagged everything correctly, important test cases weren’t missed, and as an output, we created a log table that showed what Cucumber test ended up in what TAF test. Literally within days of migrating, we already had engineers making use of this log as they now had to find the old Cucumber test cases in their new home.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*r_PxScMLKn3kRXQfMUASdw.png" /><figcaption>A diagram of the entire process created by the author in Freeform.</figcaption></figure><p>The final step in the strategy was setting up the CI appropriately. We wanted to make sure these tests were parallelised, but in doing so, we had to keep infrastructure cost in mind. Our Ruby tests, while a pain in the neck in every other way, they used a fairly low amount of resources, while the Java tests were a tad more resource-hungry, but DX figured out a good resource to test ratio to keep costs in check. With that in place, I had the honour of pressing the archive button on the repository and announcing to the entire company that we have finally killed all our Cucumber tests.</p><h4>What ultimately enabled a successful migration</h4><p>Looking back, trying to run a retrospective in my head of what went well, and why we finally managed to pull this migration off, there are a few things that come to mind, and some of these I have come to consider key to any successful project going forward.</p><p><strong>We had a common goal.</strong> It cannot be understated just how important it is for everyone to row in the same direction. It empowers those doing the work to focus on it and do it well. So, the support of both my team and the QA team was crucial. Turu, our senior automated QA specialist had this migration as a personal 3rd quarter goal just as much as I did, so we were both heavily vested in getting the work done successfully.</p><p><strong>Zero wasted time.</strong> Apart from a few initial meetings with QA, my team and I had around what we wanted to achieve and some historical context, the only meetings we had were a weekly 1-hour sync between Turu and myself. That’s roughly a day’s worth of meetings over a 10-week project. That’s not to say that meetings are bad, but every so often they cost the project, and we couldn’t afford that.</p><p><strong>Keeping the goal in mind </strong>and the goal was clear: migrate all the tests as effectively as possible within the time we had. At times, that meant merging more test scenarios into one, or moving a test into another existing test as a scenario rather than a standalone test.</p><blockquote>For each test, we did whatever made most sense instead of sticking to a 1:1 carbon-copy approach.</blockquote><h4>Translated to tangible business outcomes</h4><p>But that’s the engineering (including QA) success story and as I mentioned in <a href="https://medium.com/gitconnected/how-to-sell-engineering-needs-to-product-managers-2a4f379103b6?sk=60f7bf95b768bc5dbdcd463bddf56e84">“How to Sell Engineering Needs To Product Managers”</a> we owe ourselves and the business as engineers to translate engineering needs to business needs. I’d be the first to shy away from work that makes no business sense. While I’m no CFO, nor do I intend to ever become one, any effort that doesn’t make any business sense doesn’t sit well with me. That said, no project will ever be done <em>“because it sits well with Attila”</em>, so let me translate this particular engineering need to a business need.</p><p>When you have tests written in a language that nobody knows or cares to learn, those tests will be either poorly written or not written at all. This increases the chance of customer-blocking bugs that could go unnoticed until customer support is alerted, at which point it’s already too late and costly. So, a more robust product results in less customer support calls, aka money saved.</p><p>The other downside of a severely outdated test infrastructure is maintenance. Ideally, a software company wants to spend as little money as possible on maintenance. Features or A/B tests are more interesting, and they make more money. Maintenance that costs 10 times more than it should, is a waste of finances, brings down morale and might even be the cause of being unable to hire new engineers. There’s only so much money in an engineering pot, and we much preferred spending it on new tools or perhaps even additional headcount than maintaining a severely aged infrastructure.</p><blockquote>Reducing complexity increases velocity. It really does come down to that.</blockquote><p>As our DX team repeatedly highlighted, we were sitting on a time-bomb. Waking up every day to the very real possibility that one of our Ruby-Cucumber dependencies gets nixed because of its age is not a great place to be in when the core functionality of your product — such as signup, payments, and analytics — depends on it. Such a situation would have caused severe disruption for Product, wasted A/B testing runtime, increased manual QA and customer support costs for weeks if not months, potential loss of customers and revenue. This is unacceptable, especially when you are on a growth trajectory.</p><p>Finally, this migration was also a massive enabler. Within weeks of completion, having all of our tests in one place, we were already able to identify areas where we can make our tests more efficient, spend less time in the CI, and be more confident in what is being tested — aka have a real and meaningful understanding of our coverage. This can only mean one thing: better velocity in 2025 and beyond, and if there is one thing that Product Managers love hearing, is higher throughput. 😉</p><h4>Closing thoughts on migrations, AI, and machine learning</h4><p>As QA and I were wrapping the migration up, I couldn’t help but reach certain tangential conclusions that, I feel, will be food for thought for many of us software engineers and quality engineers in the coming year(s).</p><p>While completing a migration like this is an exciting opportunity for some of us — myself included — it’s not something most engineers would volunteer for, and for good reason. Migrations can be a can of worms, you’re touching a lot of legacy code you’ve never seen before and have no historical context on. You’re likely going in a little blind.</p><p>Then there’s also the monotonous aspect of the job. Especially when it comes to writing E2E tests, once you have everything in the framework available to you, writing the tests themselves can feel like more of the same, which brings me to my next point and an interesting realisation.</p><p>At one point, by pure luck, I downloaded the latest version of <a href="https://www.jetbrains.com/help/idea/full-line-code-completion.html">IntelliJ that features Full Line code completion</a>. Within minutes, I started seeing the IDE suggesting my next line of code, be that a new page object or an assertion, and what do you know? It was often right! Often enough, that I saved 2–3 days’ worth of time over the course of the migration. This was machine learning in action, under human supervision, which made me think…</p><blockquote>If there is one job that I’d like generative AI to do in the future, it’s maintenance and migrations.</blockquote><p>It would have been great to feed a model our Cucumber and TAF tests, let it figure out what was missing, migrate those tests, run them and even deploy them with minimal human supervision. Now that’s something I could really get behind, and who knows, with another healthy dose of red energy it might soon become reality. 😉</p><p><em>Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. </em><a href="https://www.goodreads.com/book/show/205716390-it-s-cold-ma-it-s-really-cold"><strong><em>Author</em></strong></a><em>. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! </em><a href="https://attilavago.medium.com/my-200th-article-hello-its-time-we-met-3f201ad1303"><strong><em>Read my Hello story here!</em></strong></a><strong><em> </em></strong><a href="https://attilavago.medium.com/subscribe"><strong><em>Subscribe</em></strong></a><strong><em> </em></strong><em>for more stories about </em><a href="https://medium.com/@attilavago/list/lego-all-the-things-083f80bd3c51"><strong><em>LEGO</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/technology-tech-news-a2d2d509b856"><strong><em>tech</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/coding-software-development-d123369e3636"><strong><em>coding</em></strong></a><strong><em> and </em></strong><a href="https://medium.com/@attilavago/list/accessibility-4b67c1d08ef3"><strong><em>accessibility</em></strong></a><em>! For my less regular readers, I also write about </em><a href="https://medium.com/@attilavago/list/the-random-stuff-96bfc5a222e5"><strong><em>random bits</em></strong></a><em> and </em><a href="https://medium.com/@attilavago/list/writing-writing-tips-f83ef5e79de5"><strong><em>writing</em></strong></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=12b29c665ec5" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/how-to-turn-red-energy-into-strategy-and-migrate-all-your-tests-while-youre-at-it-12b29c665ec5">How To Turn Red Energy Into Strategy And Migrate All Your Tests While You’re At It</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Rare Insight Into The Daily Challenges Of An Experiments Team]]></title>
            <link>https://engineering.prezi.com/a-rare-insight-into-the-daily-challenges-of-an-experiments-team-349a94960b4f?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/349a94960b4f</guid>
            <category><![CDATA[a-b-testing]]></category>
            <category><![CDATA[prezi]]></category>
            <category><![CDATA[product-development]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[engineering-culture]]></category>
            <dc:creator><![CDATA[Attila Vágó]]></dc:creator>
            <pubDate>Tue, 09 Jul 2024 13:21:02 GMT</pubDate>
            <atom:updated>2024-07-09T13:21:02.170Z</atom:updated>
            <content:encoded><![CDATA[<h4>If you thought feature development was tough, try developing A/B tests all day, every day… 😉</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*n3VDMgH-u5OU5VAH" /><figcaption>Photo by <a href="https://unsplash.com/@jasongoodman_youxventures?utm_source=medium&amp;utm_medium=referral">Jason Goodman</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>I think I’d need four hands to count all the types of projects I have touched in over a decade. From small tech start-up to mid-size web agency and from mid-size established tech company to large tech corporations, I’ve seen them all and contributed to codebases more than I can or care to count. And yet, in Prezi — two years after joining — I found myself in an entirely new context: an experiments team, and my mind is blown. Every. Single. Day.</p><h4>What does an experiments team do?</h4><p>If you’re thinking greenfield projects, I’m going to stop you right there. This is the team that has little to no chance to such luxuries. It’s quite the opposite. In Prezi, the experiments team’s main purpose is to think of A/B tests, plan them, implement them, watch them perform and then, based on the outcome, release one of the variants, or scrap the entire test. That’s the gist of it, anyway. Why would we do that? One word: success. The success of Prezi as a whole and implicitly our customers’.</p><p>Our experiments team is actually called “Growth &amp; Monetisation” or GM — though that always makes me think of General Motors, which we have nothing to do with. As far as I know, the only cars we ever built in Prezi were all made of LEGO. 😄 We’re in the business of inspiring, unforgettable, impactful presentations and our team makes sure to drive that message home to more happy customers than ever. Naturally, you can have the best product out there, if the path to the product isn’t frictionless, there is a high churn. So that’s what we do: we try to understand based on data generated by A/B tests where potential customers drop off, why, and how we <em>might</em> be able to change that for the better.</p><p>I very deliberately used the word <em>might</em> there. Predicting success in software development is about as accurate in my experience as predicting the weather in Ireland. While there is not much we can do about Irish weather, we can test hypotheses in a software product <em>relatively</em> easily. You see what works, and armed with that knowledge, you make further decisions. But, you’ll also quickly realise success isn’t a given in an A/B test. In fact, the industry standard — which we’re also tracking — is 30%. That’s 7/10 A/B tests failing or being inconclusive.</p><blockquote>Running an A/B test implies a chance of failure. You have to accept that to ultimately succeed.</blockquote><p>But, with a healthy dose of pragmatism and informed optimism, you might also see the failed tests as a great learning opportunity. If nothing else, these will either stop you from investing in the wrong features, prevent you from entering a technical rabbit-hole or even inform and shape future A/B tests and get them that much closer to a successful outcome. And when you think like that, you realise that no time spent on A/B testing is lost time, as one way or another, it feeds directly into <a href="https://productcoalition.com/product-strategy-lessons-from-dr-house-f55872182164">your product strategy</a>.</p><p>Allow me to inspire you with a few examples. The first two examples will be successful experiments that ended up either increasing revenue or the number of registered users. The latter two will focus on failed experiments we learned a lot from.</p><p><strong>A privacy control call-to-action.</strong> A Prezi created by a free account is always public, and we made that abundantly clear as soon as the user entered the editor, before even getting a chance to use the product and get excited by the prospect of creating unique, engaging, multidimensional presentations or generate one with Prezi AI. Sure, the goal of the immediate — in your face — privacy notification was well-intentioned. We wanted users to be both aware of their presentations being public, and give them the chance to upgrade. This experience, however, was one of high friction.</p><p>Our theory was that we might be able to improve on this and not lose subscriptions in the process, maybe even win some more, so instead of the notification, we just added a visible call-to-action button signalling the document was public. On click, the user — as before — had the option to just acknowledge and keep it public, or upgrade to a paid membership and make the Prezi private. And guess what? Not only did we not lose subscribers with the new approach, but even gained some.</p><p><strong>Confusing SSL Callout.</strong> In our paywalls we wanted to give our customers peace of mind by calling out that transactions are SSL encrypted and 100% safe. One would think this is a great example of caring about your users. Except just like parenting can go very wrong when you become a helicopter parent, caring for users can also go sideways. In this particular instance, instead of gaining subscribers, we lost a considerable number of them because:</p><ul><li>It was confusing to see that message on a free trial start page.</li><li>It was placed at the beginning of the form, rather than the end, below the payment button.</li><li>There is a chance it’s quite a redundant message these days when it’s assumed and expected that all transactions are safe. Heck, most browsers won’t even load unsecured sites anymore, so you expect to see something like this more on scam sites than legit products.</li></ul><p><strong>One small change in the right place.</strong> A fantastic example of sometimes just how small but incredibly effective an A/B test can be, is adding a single line of text to our product selector. We expected that mentioning the new AI capability for creating presentations would perhaps get us a small increase in bookings. Turns out, we were just thinking small, and the real increase was significantly higher. Granted, we couldn’t have added that line without the teams. Granted, we couldn’t have added that line without the teams having built all the AI features, but it shows how important it can be to bring that to users’ attention where it makes the most impact in the flow.</p><p><strong>One big change in the wrong place.</strong> How many times have you seen complete redesigns being done on websites and apps in hopes to recapture users, only to find it had no effect? In fact, it might have even made things worse. We did the same with our business users but as an A/B test — why commit if you don’t have to, right? We expected a modest increase in bookings, and how wrong we were. Turns out, our business users got put off by the redesign, the features we thought were interesting for them, made them pause enough that we ended up with an unexpected negative impact on bookings! 😱 The good news is, we quickly learned how not to do a redesign and that we might want to highlight more relevant features to them in future iterations.</p><blockquote>A/B tests are the financially responsible way of developing software. Agile development on steroids if you will.</blockquote><p>So, A/B tests to the rescue, right? But in the history of software development there hasn’t yet been a solution that didn’t bring its own set of challenges and that’s what I really want to focus on, so anyone wanting to truly invest in experiments and improve their product, does so with eyes wide open. The rewards may be undeniable, but the challenges aren’t negligible either.</p><h4>And what does that mean for engineers?</h4><p>On the web, opinions are split on whether having specialised skills as an engineer is better than being T-shaped. As a staff engineer, I have come to the conclusion that just like having the right tools for the right job, having the right engineers within the team is also crucial to success. On our team’s page, we have a short but sweet table of what the ideal engineer looks like in terms of skills to feel comfortable. In contrast, a native apps team’s table would probably look very different. See? Right people for the right job.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3oiVf1g4ssIHAQSxzHAH4A.png" /><figcaption>Our team mindset table — <em>Hire</em> the right people, with the right skills and the right mindset — screenshot by author</figcaption></figure><p>I won’t however just leave you with a table up for interpretation, as I think all of those traits are worth a paragraph or two of clarification.</p><ul><li>Impact driven as opposed to being creative or technology driven. In over a decade, I have met more passionate engineers about technologies, coding paradigms and abstractions than those who just want to get stuff out there for the sake of learning and iterating. To be perfectly frank, you need both in an engineering organisation if you don’t want your product to become unusable and unmaintainable. Heck, even within our team, some of us care more about the architecture, software integrity and efficiency of the product than others, but <strong>we all share the conviction that whatever we do, must have a tangible benefit to us as a team, Prezi and ultimately, the user</strong>. This is not the team where you randomly get to try a new frontend-library and rewrite one of your services in Rust — as exciting as that may sound.</li><li>It’s important not to confuse being versatile with jack-of-all-trades. That being said, the ideal engineer on our team <strong>while may not necessarily be a hard-core full-stack engineer, they won’t shy away from jumping into either sides of the codebase</strong>. In our case, that means a wide range of front and backend libraries and frameworks. It sounds intimidating perhaps, but in reality it’s a lot less about the expectation of knowing everything, but rather the openness to discover it all over time.</li><li>Being an efficient engineer deserves an article — if not a book — of its own, but let me condense it into a couple of thoughts. Us, engineers, have the tendency to polish code, refactor to the point of giving the impression we’re not writing software but creating the Milo of Venus. <strong>In an experiments team, we’re more focused on creating meaningful stick figures. As long as we’re able to gauge from the experiment the data we need, the goal is achieved.</strong> The code doesn’t have to be optimised (unless it’s getting in the way of being able to run the test), and keeping implementation as simple as possible is a prime objective. As long as it’s testable and revertible, you have yourself a candidate for release.</li><li>Having a data driven attitude is key, and I think it drives a lot of the other traits. How often have we, engineers, developed useless features over weeks, months, maybe even years? It’s not uncommon. In an experiments team, however, you don’t have the luxury to do that. <strong>Unless there is data to support a code-change, a new feature, a variant of a feature, it simply won’t happen.</strong></li><li><strong>Being an avid learner goes hand-in-hand with being data-driven.</strong> The focus in an experiments team is on understanding what happened but more importantly why, as the answer will drive the next experiments and possibly a considerable part of the product strategy.</li><li>A competitive engineer, comfortable with bold ideas, doesn’t necessarily mean reckless. It also doesn’t mean a lot of “hacking stuff together”. <strong>It’s rather a fine-tuned skill of seeing through the technical challenges in such a way that they’re able to propose the shortest technically viable path to success</strong>, and that path doesn’t have to follow the status quo.</li></ul><p>On a personal note, I would argue that many of the above skills are worth picking up over time for any engineer. As one moves from company to company, from team to team, being able to adapt to different mindsets can very positively impact one’s career.</p><blockquote>If you find yourself having the opportunity to join an experiments team, go for it, learn from it, make the most of it. You’ll thank yourself later.</blockquote><h4>All fingers in all pies</h4><p>Before joining the GM team in Prezi, I was lead engineer on Prezi Video for Zoom, and later, on the first two waves of Prezi AI. Both, especially in the case of the former, meant that development was mostly spent in a couple of repositories, in very distinct areas of the product. Prezi Video for Zoom was a web app of its own, and Prezi Present — where Prezi AI was released — is mostly a self-contained entity as well, unless you start veering into service territory, but we have dedicated teams for that. In contrast, the very first day I joined the GM team, I found myself checking out not one, not two, but a list of repositories and as time passed, a few more. I have eight running at the moment in my development environment, and that still doesn’t cover all the possible flows a user could take on the Prezi website. Add to that Prezi Present, which we still contribute to with experiments, and you have yourself a context in which certain complexities are unavoidable.</p><p>You may wonder, why unavoidable? Can’t other teams run their own growth and monetisation experiments in their respective areas of expertise and ownership? I have no doubt that in certain organisations that is possible. And even in Prezi, for instance, we were able to do that with Prezi Video for Zoom. Our Infogram team can also operate similarly, as it’s a distinct product. However, when it comes to the rest of what Prezi essentially is — the Prezi website, Prezi Present and Prezi Video — one has to approach it holistically, and we must be able to own the experiment end-to-end, which conveniently brings me to what an experiment lifecycle looks like.</p><h4>Experiment lifecycle</h4><p>A picture’s worth a 1000 words and because this article is vertiginously approaching 4000, I’ll rely on a diagram to tell most of the experiment lifecycle story.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xWHtX8RDPaAfFzw_HSlaVw.png" /><figcaption>Experiment lifecycle diagram created by author in Apple Freeform</figcaption></figure><p>Releasing an A/B test is — quite literally — only half the work and half the story, but let’s see briefly what these 13 steps in the experiment lifecycle are:</p><ol><li>Ideation is a somewhat nebulous step, and it involves a lot of product manager/product owner (PM/PO) sorcery outside the scope of this article, but generally speaking, ideas will be based on market research, data, previous findings, user feedback, etc.</li><li>Ideas there may be many, but it’s important to keep focus on what moves the company goals forward in a viable context. Sometimes ideas can be really good, but other things need to happen before they become feasible.</li><li>Having a low-fidelity design — a rough sketch — of what the experiment and the user flow would look like can further validate the idea or uncover logical fallacies. At this point, you might already find engineers to be a great asset in the conversation.</li><li>Getting to the planning stage means this is now going ahead full-steam and gets into the upcoming sprint. In our case, we tend to work kanban style, so whoever is next willing and well-suited enough to pick the work up, gets to do so. Every so often you’ll find that the experiment is not just a story, but an entire epic, in which case several engineers might allocate their time to it being led by a project lead.</li><li>Development is as self-explanatory as it can be. It’s the coding stage, including writing automated unit, integration and regression tests, adding the feature switches and getting everything into a (or more) <a href="https://medium.com/p/3fb5e1ad62d0">pull request for code review</a>.</li><li>Our manual QA team member(s) ensure everything has been done to spec and execute some regression testing as well. Given the number of experiments we run, it’s a much-needed peace of mind to know at least one set of objective eyes checks everything.</li><li>Releasing deserves a section of its own, so keep reading. For now, let’s just say it involves setting the feature switch configuration up to the desired cohorts and enabling them. Once it’s released, a cleanup task is automatically generated for a later date (see step 12).</li><li>Spot checking ensures we’re on the right track with the experiment, nothing blew up, we’re not seeing any majorly negative results or collateral damage in signups or upgrades.</li><li>After a few weeks, the experiment is stopped, so no more new users are getting exposed to the test. At times, we might allow the users who have been getting the test variant to keep having access to it to further observe user behaviour. This usually lasts no more than another 2–3 weeks.</li><li>Evaluation is all about interpreting the data, understanding the learnings. This is the moment we may decide to release a variant (success) to all users or stick to the control (fail).</li><li>Rollout is essentially the outcome of the evaluation — all users get one variant going forward, which from that point on becomes the control.</li><li>Cleanup is another phase I deemed important enough to highlight in its own section, so do keep reading, but the short of it is, we ensure that all redundant code, tests, and feature switches are done away with. This triggers steps 4, 5 and 6, all culminating in the final step…</li><li>Everything is done. The variant is rolled out, the code is cleaned up, and we have either learned something (failed experiment) or achieved something (successful experiment).</li></ol><p>That’s the gist of the experiment lifecycle, but as I mentioned, there are a couple of stages there that are really worth digging into more to truly understand some of the complexities and challenges a team like ours can face on a daily basis.</p><h4>Dealing with feature switches</h4><p>Some will call them the best human invention since fire, while others, a necessary evil. I, for one, think it’s a very useful tool, but like every tool, it can be overused or misused. In our case, it’s invaluable to have the option of setting up a new feature switch for every experiment and variant. The more challenging part is keeping track of them all.</p><p>For context, we have 9 engineers on the team, and generally speaking, we aim for just as many experiments per sprint. Some quick maths suggests 160 experiments per year, but let’s go with a more conservative 100 experiments instead. Just assuming you have two variants per experiment already means 300 feature switches. 100 of those control the bucketing of the variants. If not handled correctly, things can get quickly out of hand, so we have devised some ways to avoid that:</p><ul><li>Adding a special prefix for feature switches that control the variants.</li><li>Using team-based feature switch prefixes.</li><li>By making sure each feature switch has a clear ownership marked — we use a unique team email address.</li><li>Each feature switch will have a link to the experiment note or the Jira ticket it refers to.</li></ul><p>This varies from organisation to organisation, but in Prezi, it’s mostly the software engineers who add, configure and clean up feature switches. We opted for this approach as it keeps the control of software integrity in engineering’s hands. We don’t have to worry about product owners inadvertently breaking regression tests by turning switches on and off at the wrong time.</p><h4>Releasing an experiment</h4><p>While releasing an experiment will ultimately come down to just flipping a switch — a feature switch that is — there’s a lot more to it and how much exactly, can vary from experiment to experiment. Some are a lot more involved than others. As I am writing this, I am working on an A/B test that involves three frontend bundles (think apps), and four different services. Even if you’re experienced and QA did a fantastic job making sure we haven’t broken anything, there are still a myriad of things that can fall through the cracks.</p><blockquote>To make sure releases go as smoothly as possible, we adopted an already standard practice from aviation and medicine — a checklist.</blockquote><p>Surgeons use Surgical Safety Checklists, and pilots rely on Pre-flight Checklists to ensure the best outcomes. We call it a release document, but it’s really a checklist as clearly stated in the head of each document:</p><blockquote>This document is meant to be used as a <strong>checklist</strong> for the person who’s driving the release to be able to do it in a calm, collected, professional way. Also meant to act as a document for others, so when troubleshooting is needed, all the information about what was happening during a release is recorded. — Prezi internal release document</blockquote><p>All such documents are signed off by at least one— but ideally two — senior or lead engineers on the team.</p><p>To some, this might seem excessive, and at times it really is, but in weighing the costs and benefits, as a team we concluded this approach gives us enough value and confidence to stick to it. Just to illustrate some of the items on the checklist, here’s what we look for:</p><ul><li>Have the relevant senior/lead engineers signed off on the plan?</li><li>What components are meant to be deployed and have they deployed successfully?</li><li>Have all relevant teams been notified about our intent to release the experiment?</li><li>What’s the feature switch configuration?</li><li>Is the testing scenario working on production as expected?</li><li>Is the A/B test distribution as expected on OpenSearch?</li><li>Any unexpected spikes in Grafana?</li><li>Are there any new relevant Sentry error logs?</li><li>What action(s) to take in case of needing to revert?</li><li>If all of the above OK, notify internal stakeholders of successful release.</li></ul><p>It’s a cross your “T”s and dot your “I”s kind of exercise, but out of it we get a log we can reference later and the assurance that anything that could have been prevented, has been prevented because, you know… Murphy’s Law. 😉</p><p>Having released, however, doesn’t mean we’re done. Far from it. There’s cleanup, and it’s such an important part of what our team does that I felt it deserved its own section, so without further ado…</p><h4>Cleaning up</h4><p>I hate doing the dishes, so by week’s end there’s a literal pile of them waiting to be washed. Now, remember those 300 feature switches? That’s precisely the pile we desperately want to avoid. Because feature switches as useful as they are, they quickly pollute the code to a point it becomes unmaintainable, which would result in us losing more and more velocity over time. As a team, you can easily grind to a screeching halt if code is not maintained, and as an experiments team, we’re particularly prone to having this happen if we’re not vigilant.</p><p>One way we’re working on preventing such a situation is by automatically creating cleanup tickets for each experiment. Jira isn’t so bad after all, aye? 😄 You see, once an experiment goes live, it will stay live for at least a couple of weeks. Gathering useful enough data to make pragmatic product decisions, doesn’t happen instantly, so usually a few weeks after the A/B test release a decision gets made. Either we stick to what we had before — aka we keep the control variants — or we keep one of the other variants. Often it’s just one, but there are times when an A/B test has a total of as many as four variants. Let me pseudocode an example:</p><pre>if(isActive(&#39;amazing-feature-variant-a&#39;)){<br>   &lt;ABTestComponentVariantA&gt;...&lt;/ABTestComponentVariantA&gt;<br>} else if(isActive(&#39;amazing-feature-variant-b&#39;)){<br>   &lt;ABTestComponentVariantB&gt;...&lt;/ABTestComponentVariantB&gt;<br>} else if(isActive(&#39;amazing-feature-variant-c&#39;)){<br>   &lt;ABTestComponentVariantC&gt;...&lt;/ABTestComponentVariantC&gt;<br>} else {<br>   &lt;ControlVariant&gt;...&lt;/ControlVariant&gt;<br>}</pre><p>Regardless of which one we keep, three of those have to go. You can imagine, of course, that often times an A/B test is far more involved than just showing a component or not, so cleanup can become quite an undertaking, as you want to make sure you understand the variants that have been added, the relationship with the rest of the codebase and the overall user flows, so cleaning up doesn’t result in collateral damage. This typically means editing tests as well.</p><p>You might wonder, in case of a lost A/B test where we end up sticking to control — to what we had before — can’t we just revert to the original PR? The answer is maybe, perhaps partially or not at all for the following reasons:</p><ul><li>You might be able to revert if the initial change was very clean, and other changes to those files haven’t been done since. In a high-traffic codebase, that’s quite unlikely to happen, though.</li><li>You might only be able to do a partial revert if some of the changes happened in a low-traffic codebase, while others in higher-traffic codebases. The A/B test I am working on right now touches several repositories. I could imagine one or two of those repositories seeing light enough traffic that I could just revert, but the rest would require a more involved approach.</li><li>If you’re only dealing with high-traffic codebases, you simply don’t have this option, but you will also find cases where you added code for one of the variants that’s actually going to be useful for future work. Maybe you wrote a nice utility function, or refactored some code as part of the A/B test to make your life easier. You surely don’t want to revert that.</li></ul><h4>When all is clean and done</h4><p>I won’t gaslight you into thinking we don’t deal with technical debt, awkward tech stacks, or breaking pipelines like every other team and engineering organisation out there. We do, and some of our challenges aren’t even new to many developers out there. It’s more like a unique flavour of what other teams deal with daily, and it’s unique enough that we found ourselves having to fine-tune how we do things, improve our processes, and continuously refine and shape ourselves as engineers into individuals reflecting the previously illustrated skills (mindset) table.</p><p>This is what has worked for us. This is what gets things done. For now. Just like we experiment with features, we experiment with ourselves as individuals and as a team. Sometimes that means we succeed, other times it means we learn and move on, or we learn <em>to</em> move on. It’s a journey, and it requires stamina, but ultimately, it’s well-worth the effort. So, yes, A/B tests for the win! 🎉</p><p><em>Attila Vago — Software Engineer improving the world one line of code at a time. Cool nerd since forever, writer of codes, blogs and books. </em><a href="https://www.goodreads.com/book/show/205716390-it-s-cold-ma-it-s-really-cold"><strong><em>Author</em></strong></a><em>. Web accessibility advocate, LEGO fan, vinyl record collector. Loves craft beer! </em><a href="https://attilavago.medium.com/my-200th-article-hello-its-time-we-met-3f201ad1303"><strong><em>Read my Hello story here!</em></strong></a><strong><em> </em></strong><a href="https://attilavago.medium.com/subscribe"><strong><em>Subscribe</em></strong></a><strong><em> </em></strong><em>for more stories about </em><a href="https://medium.com/@attilavago/list/lego-all-the-things-083f80bd3c51"><strong><em>LEGO</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/technology-tech-news-a2d2d509b856"><strong><em>tech</em></strong></a><strong><em>, </em></strong><a href="https://medium.com/@attilavago/list/coding-software-development-d123369e3636"><strong><em>coding</em></strong></a><strong><em> and </em></strong><a href="https://medium.com/@attilavago/list/accessibility-4b67c1d08ef3"><strong><em>accessibility</em></strong></a><em>! For my less regular readers, I also write about </em><a href="https://medium.com/@attilavago/list/the-random-stuff-96bfc5a222e5"><strong><em>random bits</em></strong></a><em> and </em><a href="https://medium.com/@attilavago/list/writing-writing-tips-f83ef5e79de5"><strong><em>writing</em></strong></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=349a94960b4f" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/a-rare-insight-into-the-daily-challenges-of-an-experiments-team-349a94960b4f">A Rare Insight Into The Daily Challenges Of An Experiments Team</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Prezi replaced a homegrown Log Management System with Grafana Loki]]></title>
            <link>https://engineering.prezi.com/how-prezi-replaced-a-homegrown-log-management-system-with-grafana-loki-15111174ff91?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/15111174ff91</guid>
            <category><![CDATA[logging]]></category>
            <category><![CDATA[sre]]></category>
            <category><![CDATA[technology]]></category>
            <category><![CDATA[prezi]]></category>
            <category><![CDATA[software-development]]></category>
            <dc:creator><![CDATA[Alex]]></dc:creator>
            <pubDate>Thu, 08 Feb 2024 15:26:37 GMT</pubDate>
            <atom:updated>2024-02-08T15:26:37.121Z</atom:updated>
            <content:encoded><![CDATA[<p>Prezi has quite a sophisticated engineering culture where solutions are built that do the job. Some solutions that have been built in the past stood out and aged well. In other areas, some solutions have lost traction compared to industry standards.</p><p>In the second half of 2023, we modernized one of those areas that was not market-standard anymore: the logfile management system of Prezi. This is our testimonial.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pmZClgRKPe7CZ0EI" /><figcaption>Photo by <a href="https://unsplash.com/@alvaroserrano?utm_source=medium&amp;utm_medium=referral">Álvaro Serrano</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>How it was before</h3><p>We traced the beginning of the existing solution back to 2014. So it is safe to say that it was a stable solution.</p><p>The following depicts the solution. Every workload Prezi ran was instrumented with a special sidecar that took care of handling all log messages. That sidecar was built on top of two open-source solutions: scribe (<a href="https://github.com/facebookarchive/scribe">https://github.com/facebookarchive/scribe</a>), a tool built by Facebook, archived on Github in 2022.<br>Scribe took care of receiving log events, aggregating them, and sending them downstream.</p><p>The second component, sTunnel (<a href="https://www.stunnel.org/">https://www.stunnel.org/</a>), took care of encrypted communication from the workload systems to the central system.</p><p>Prezi collected log events from all environments in one central place and made them accessible to engineers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*6embu5Y5a8PlAbE6" /><figcaption>legacy log management system</figcaption></figure><p>Yes, the picture is telling the truth: the consumption of collected log events happened a good part of 2023 over SSH and not over any UI.</p><p>That alone was a fact to reimplement the whole solution and develop it with current market-best practices in mind. Our goal was to make the user experience more accessible and the query results better shareable.</p><h4>Logshipping</h4><p>With that in mind, we started the project’s first iteration. Our first take was to provide a central system that could aggregate and display log events in a more user-friendly way. Also, we wanted to get rid of the sidecar to ease operational load: while a sidecar per se is not a bad thing and a very battle-proven design pattern, it comes with certain costs when running thousands of pods.</p><p>The sidecar solution was born in times when Prezis workload ran on Elastic Beanstalk which means just an additional container on a probably oversized EC2 instance.</p><p>With the shift to <a href="https://kubernetes.io/">Kubernetes</a> as the workload engine the oversized EC2 instance vanished but the sidecar remained. Also, Kubernetes offers a very standardized way to consume logs from containers: stdout and stderr of the container are written to file on the Kubernetes worker hosts by the container runtime. And files can be easily consumed.</p><p>We did exactly that and used one of the established tools in that domain — filebeat — which is capable of reading the mentioned files and enriching the resulting events with metadata from Kubernetes — e.g. pod and container name and namespace.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*kdRkH0zJRvSJmhWJ" /><figcaption>Details on the log shipping process</figcaption></figure><p>This was the first optimization. The second was how events will be sent to downstream systems.</p><p>Operating in cloud environments requires a fast shipping of events away from nodes as these nodes can vanish at any time.</p><p>A common design pattern for this is to use a message queue as the first persistent layer. This can protect downstream systems in case of event bursts. It also decouples the individual parts from each other which can be helpful for maintenance or even replacements of tools.</p><p>Most of the time, the message queue used for that is an <a href="https://kafka.apache.org/">Apache Kafka</a> installation that is capable of storing events at scale. As we already used a Kafka setup to store business events from multiple sources, we went that route without digging further into alternative persistent layers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8ObkuZnvyiNT9IKs" /><figcaption>Sending events to a message queue</figcaption></figure><p>Once the events are in the queue, they can be parsed and ingested into a central system.</p><h3>Parsing and Storing</h3><p>In our first take on this, we planned to set up the central log file management system inside our cloud environment. When doing that, there are 2 major options to go: do something with <a href="https://www.elastic.co/elasticsearch">Elasticsearch</a> or use <a href="https://grafana.com/oss/loki/">Grafana Loki</a> as the backend.</p><h3>The very first take</h3><p>We’ve started with <a href="https://aws.amazon.com/opensearch-service/">AWS OpenSearch</a> service as a backend and <a href="https://www.elastic.co/logstash">Logstash</a> to feed events from Kafka into our OpenSearch cluster.</p><p>As we run most of our software on Kubernetes, we also set up Logstash on Kubernetes and soon discovered all the joy of running a JVM inside containers. We suffered a lot of out-of-memory kills of that component.</p><p>Storing and indexing a massive amount of data into OpenSearch leads to massive indexes that soon have not been manageable anymore. This was caused by the vast amount of non-standardized fields in the application logging. A lot of heterogeneity in the fields and the contents leads to a lot of parsing errors. The most prominent example is the time and date format. Some applications have been logging unix timestamps, whereas others are using a string representation.</p><p>We discovered that if we don’t control the sources, a solution based on OpenSearch will not service us well. Going to control the source by evangelizing a common log scheme throughout all applications would have been the only way to make this work.</p><h3>The overhaul</h3><p>We started to look into an alternative to Logstash to get rid of the memory issues. We started to replace it with vector.dev which has a smaller footprint, a more flexible configuration, and it can also send to more possible backends. Logstash, without any modification, is tied closely to the OpenSearch ecosystem. But as spoiled above there is another major option to save log events: Grafana Loki.</p><p>With the replacement of Logstash, we got rid of the constant restart but not of the constant indexing errors.</p><p>Soon we started to look into Loki as an alternative. Also, we considered the hosting option as the running and maintenance of a log management system is not one of our core tasks. Running that system is more or less a commodity and takes away precious time that could be spent otherwise.</p><p>Focusing on our core tasks as the SRE team is also beneficial for customers of Prezi.</p><h3>Optimizing the central systems</h3><p>Looking at log management systems is in most cases also a buy or make (host) decision: Do one want to have the whole aggregation systems self-hosted or can this be offloaded to some 3rd party vendor?</p><p>Security and compliance concerns aside, this mostly boils down to the question of “How much can we spend?”.</p><p>With the security clearing to send logs to a 3rd party vendor and the budget to do so, we started to look at the hosted version of Loki. It turned out to be within our cost range and it can service us well: They had no issues with our ingestion rate. The way Loki stores log events as streams was perfect as it moved the problem Opensearch had with the variety of field contents from indexing time to later. With Loki, those differences surface at query time and can be tackled by predefined dashboards. This way we don’t lose any events by parsing errors.</p><p>The way to consume events that are stored in Loki is a very common user interface: Grafana, which is a well-known dashboarding solution and is already in use. With that, engineers can rely on existing tool knowledge.</p><p>Offloading logs to an external vendor also removes them from your direct control. To avoid any issues with any retention period, we also started to write logs additionally to S3 to archive them. That way, we have control over them and can use them in case we need them later.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*PbXorBPK4zfXl1l1" /><figcaption>Details on parsing and storing</figcaption></figure><p>With that last piece in place, we have been able to shut down the above-mentioned original log management system at the end of 2023.</p><h3>The result</h3><p>Looking at the completely new log management system, we went from a very homegrown solution to a modern stack:</p><ul><li>We consume logs via a standard API of the container runtime</li><li>Sending the events to Kafka enables us to consume them decoupled from the creation time. Kafka also stores events for a certain period, so any downtime of downstream systems does not cause data loss.</li><li>Vector enables us to feed events into multiple sinks. Even though not outlined before, it enables us to make certain parsing and routing decisions when parsing events. But that is part of another story.</li><li>Loki enables us to consume event streams via the well-known Grafana UI and query a vast amount of data in real time.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nioxAbvHaC3AJLta" /><figcaption>The whole process</figcaption></figure><h3>Lessons learned</h3><p>The whole project took us most of a year until we shut down the old solution. We took this amount of time to verify all is well set up, all engineers are onboarded and familiar with the solution, and the solution is capable of handling all different peak situations.</p><ul><li>Keeping the old system running has been a good decision. By that, we have been able to optimize the new system until it was able to handle the load and satisfy our needs</li><li>Starting to advertise a common logging scheme through a company is beneficial. That scheme makes collecting and analyzing events simpler. It gives a better user experience, too, because a timestamp is always in the same format for example.</li><li>Controlling log levels and the understanding of the various log levels are also crucial. What one engineer sees as debug another emits as info. Creating a common understanding is helpful.</li><li>Decoupling the different components from one another enables us to change them if we have other requirements or find better solutions. E.g. if we start to get unhappy with Vector, we can replace it without any hassle as the interface between the log source and Vector is Kafka.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/894/1*FBFUJS8_X6pBw2BPL60RkA.png" /></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=15111174ff91" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/how-prezi-replaced-a-homegrown-log-management-system-with-grafana-loki-15111174ff91">How Prezi replaced a homegrown Log Management System with Grafana Loki</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Prezi Serves Customer Traffic]]></title>
            <link>https://engineering.prezi.com/how-prezi-serves-customer-traffic-60fc9711702b?source=rss----911e72786e31---4</link>
            <guid isPermaLink="false">https://medium.com/p/60fc9711702b</guid>
            <dc:creator><![CDATA[Alex]]></dc:creator>
            <pubDate>Tue, 09 Jan 2024 09:56:32 GMT</pubDate>
            <atom:updated>2024-01-09T09:56:32.688Z</atom:updated>
            <content:encoded><![CDATA[<p>Prezi has a global audience that depends on the fast and reliable accessibility of its content. In this article, we look into the way Prezi serves content from a network perspective.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wAbH9ve3pibIrm0F" /><figcaption>Photo by <a href="https://unsplash.com/@dead____artist?utm_source=medium&amp;utm_medium=referral">Z</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>See this article as a general overview of how content can be served on a global scale. This is not the only and probably not the ultimate solution but one way to do it.</p><p>The overall flow of this is depicted in the following image. Prezi runs on AWS and uses AWS services to offer customer-facing internet endpoints.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/778/1*eddbE2b-YnTWyh9oNkoIeg.png" /><figcaption>Network diagram</figcaption></figure><p>The DNS zones and records are managed in Route53. Customer traffic goes through AWS Global Accelerator to decrease latency before it is filtered by AWS Web Application Firewall (WAF). The traffic is then terminated at the Application Load Balancer (ALB) and forwarded into the cloud environment where most of the workload runs inside Elastic Kubernetes Services (EKS).</p><p>Some of the customer traffic is going to AWS Cloudfront which is used to deliver media assets that benefit from being cached closer to the customer.</p><p>The following article will go over these components, see what they do, and discuss the benefits those components offer.</p><h3>Find the best path (AWS Route53 and Global Accelerator)</h3><p>Having customers worldwide and offering services over the Internet poses multiple challenges. One of them is to reduce latency. Cloudflare defines latency as the “amount of time it takes for a data packet to go from one place to another. Lowering latency is an important part of building a good user experience.” (source <a href="https://www.cloudflare.com/en-gb/learning/performance/glossary/what-is-latency/">https://www.cloudflare.com/en-gb/learning/performance/glossary/what-is-latency/</a>)</p><p>That said, the challenge in having customers worldwide is the heterogeneity of the network normal people call “the internet”.</p><p>When we look at the lower network layer at the internet topology, we can see many different networks peered together.</p><p>The following image shows parts of the peering connections in Latin America that form the internet’s backbone. For a data packet, going from South America to Miami means traversing through multiple networks and every network adds a little bit to the complete travel time.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*5spQvhNu27Ddsual" /><figcaption>taken from <a href="https://global-internet-map-2022.telegeography.com/">https://global-internet-map-2022.telegeography.com/</a></figcaption></figure><p>Going back to the challenge of controlling latency for customers there are generally speaking 2 options:</p><ul><li>Offering services close to the customer to avoid far network travels</li><li>Offer a fast path from the customer to the place where services are offered.</li></ul><h4>The best path for most of the world</h4><p>Prezi uses the second option by offering a fast path to services via AWS GlobalAccelerator. This service enables customer traffic to be routed most of the time via the global AWS network instead of the public internet.</p><p>This routing reduces latency. In experiments from my local machine, optimized requests traveled 200ms faster than the not-optimized ones. The total time until I got an answer went down from 800ms to 600ms. <br>Loading the Prezi dashboard when logged in needs at the moment roughly 150 individual requests which all benefit from the decrease of 25% in latency. <br>Please keep in mind that the real percentage of acceleration depends on multiple factors like location and current routing situation.</p><p>Whenever a customer sends requests to prezi.com, those requests are routed to the closest AWS network endpoint and then transferred inside this global network.</p><h4>And the best path for inhabitants of Virginia</h4><p>As stated in the headline of the previous chapter, most Prezi customers go to Global Accelerator except those who reside in Virginia. Those customers are already close enough to the service endpoint and are routed directly to the following components.</p><p><em>Note</em>: the network diagram above does not show this route to avoid being too complex.</p><h3>Implementation</h3><p>To achieve this, Prezi leverages geo-balanced DNS queries in Route53 so that different IP addresses are returned depending on the location.</p><p>The following screenshot for a practical example. The first lookup is executed from a local machine in Europe, and the second one with an activated VPN from Virginia.</p><p>The first DNS query returns the endpoints for the Global Accelerator, and the second query from Virgina returns the endpoints of some AWS load balancer (see the following chapter).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*er-snPt_ZVhdB3Xf" /><figcaption>Terminal showing different DNS lookup results depending on location.</figcaption></figure><h4>Alternatives</h4><p>The alternative to this network-based approach is to move offered services closer to the customer. This can be achieved for example by deploying instances into selected cloud regions. To achieve this, the whole application stack needs to be deployed and some backend synchronization is needed — as part of the Prezi, suite enables collaboration in-between multiple users.</p><p>Serving from a single region reduces the complexity and streamlines deployment.</p><h3>Protection (AWS WAF and Shield)</h3><p>While the internet is a wonderful place to connect, collaborate, be creative, and a lot more, it is at the same time also a place that attracts bad actors. It is widespread that public and well-known endpoints are the target of distributed denial of service (DDoS) attacks. Prezi leverages the combination of AWS Web Application Firewall (WAF) and Shield to protect the downstream infrastructure from these threat vectors.</p><p>Every request that needs to reach Prezi infrastructure is evaluated through these components. Certain endpoints are protected via a specific rate limit to make sure they are not hammered.</p><p>For example, it does not make sense to send multiple requests for the login endpoint within a small amount of time. To protect sensible endpoints, the AWS WAF can respond with HTTP/429 (<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429">https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429</a>). <br>See the following screenshot of how a triggered rate limit looks in the browser console:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wRfL6hOiUq5NUAKw" /><figcaption>Chrome Developer tools show one HTTP response</figcaption></figure><p>On a bigger scale, the traffic flow is monitored by AWS Shield and blocked when a DDoS attack is detected. When Shield detects a DDoS from multiple traffic sources, those sources get blocked.</p><h4>Alternatives</h4><p>Offering services over the Internet without any protection is a bad idea. Any public-facing IP is attracting traffic and if a company has reached a certain scale it attracts bad actors. There are alternative solutions and vendors like Cloudflare or Akamai that can offer the same protection service. As we run our workload on AWS the natural choice is AWS WAF as the integration is easy.</p><h3>Access (ALB, EKS, API Gateway)</h3><p>Requests allowed to enter reach the AWS-managed load balancer fleet that keeps track of routing those requests into the VPC environment that hosts the actual workload. The load balancer uses our public TLS certificate to offload HTTPS connections from customers.</p><p>The application load balancer (ALB) is used for routing based on the HTTP Host header. This means that based on the domain used, ALB can forward traffic into the isolated workload environment.</p><p>Running inside the Kubernetes fleet is a self-written API gateway. The purpose of this component is to build more detailed routes based on request paths or other identifiers. Most of the backends are based on Python and Scala. Those pods run inside the Kubernetes offering of AWS: Elastic Kubernetes Service. <br>Traffic is routed into these pods either by a WSGI conform application server in Python land or directly by the JVM for Scala services.</p><p>As the mentioned API gateway runs also inside Kubernetes, it can forward traffic to the target backend services based on different routing guidelines within the cluster network. The API gateway offers the flexibility to do advanced routing to the microservices based on configuration by the developers.</p><p>When you think back about the scope of AWS WAF usage, there was no check for malicious content and requests. We use a different web application firewall to check for bad requests and protection against cross-site scripting, injections, and other things that might harm Prezi — or our customer.</p><h3>Content delivery (CloudFront)</h3><p>Prezi’s main purpose is to deliver amazing presentations that most of the time contain visuals like images and gifs. They can be served via a content delivery network (CDN) that can reproach content closer to the customer.</p><p>Loading resources from a CDN does decrease the time in which the user waits for the resources to appear.</p><p>Also on the cost focus, it is cheaper to serve content from CloudFront instead of serving it every time from the backend. This applies especially to assets like images that don’t change often.</p><p>Due to the deep integration into the ecosystem, in our setup, there is no other choice than CloudFront. Technically, it should also be doable with CloudFlare or any other CDN vendor.</p><h3>Wrap up</h3><p>The article above describes the architecture Prezi uses to serve content to a global audience.</p><p>There are multiple different ways to serve traffic — even if running on AWS.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=60fc9711702b" width="1" height="1" alt=""><hr><p><a href="https://engineering.prezi.com/how-prezi-serves-customer-traffic-60fc9711702b">How Prezi Serves Customer Traffic</a> was originally published in <a href="https://engineering.prezi.com">Prezi Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>