Welcome to the first episode of the Mammoth Growth podcast! In this episode, our CTO Drew Beaupre chats with some of our analysts about the challenges and opportunities companies encounter when they migrate from a legacy data warehouse platform to more modern solutions. Drew describes three of the most common reasons why companies decide to change their data platform:
- They might have a platform that was cutting edge 5 or 10 years ago, but now it’s starting to show its age in terms of higher upkeep
- Originally, they might have gone with a “quick and dirty” solution just to get some early answers, but now they’re ready for a more mature tech stack
- It’s also possible that they thought it was a good solution at first, but they were misguided and they quickly hit a wall in terms of what they were able to accomplish
Ben Wilson, one of our Analytics Consultants, mentions that many companies often run into cost and/or performance barriers with Amazon Redshift, because this data platform doesn’t separate storage from compute functions. Dylan Cruise, another one of our Analytics Consultants, points out that Redshift is a great example of how quickly cutting edge technology can slide into legacy status. As recently as 2020, Redshift was the go-to data platform, but it’s now being overtaken by suites of tools that might include an ingestion solution like FiveTran or Stitch, a cloud data warehouse like Google Cloud Platform's BigQuery or Snowflake, dbt for data transformation, and any one of several BI tools.
John Morrison, another one of our Analytics Consultants, gives more context to the Redshift question within the larger context of AWS and data platforms in general, before pivoting to share some of the key questions Mammoth Growth asks when we discuss platform migration with a new client. Ben emphasizes the importance of planning a “reset” before any migration project, a point which is echoed by Drew, because if you don’t take a step back and look at your entire tech stack before a data platform migration, you might wind up carrying the same baggage over into your new system.
The group then shares a few anecdotes about particularly thorny migration projects and how they addressed clients’ challenges. Finally, they talk about how to approach a data platform migration in general, and some red flags that indicate you might want to consider a migration.
Transcript
Drew Beaupre (00:05):
Welcome everyone. My name is Drew Beaupre. I am CTO of Mammoth Growth. Today we're going to be talking about changing platforms, specifically data platforms and all the challenges that go along with that. With me today, sort of round table format we've got for all from Mammoth Growth as well, Ben Wilson, John Morrison, and Dylan Cruz. You guys want to introduce Ben? You want to go ahead? Hey
Ben Wilson (00:28):
Drew. Yeah, thanks for having me on. I'm Ben Wilson and I've been wearing the hat of a tech lead and senior analytics engineer for the last two years with Mammoth Growth.
John Morrison (00:38):
Hey, I'm John Morrison. I've been at Mammoth for a few years now doing a mix of consulting work on MarTech stuff as well as getting more into the data engineering and consulting on that side. I'm currently doing a mix of tech leader tech architecture work depending on what the client needs.
Dylan Cruise (00:52):
Hey, I'm Dylan. I just joined Mammoth about six months ago, but I am an architect here. I have about a decade of consulting experience in the data engineering and data architecture space and looking forward to chatting today.
DB (01:04):
So ultimately we're all in the consulting space, but for the most part it ends up being more like in the problem solving space and in today's problem that we're going to chat about is when we have to change some kind of fundamental building block in our data stack. There's all sorts of reasons why someone might want to change their platform. Sometimes it's just they're on something that was the right platform to be on maybe 5, 10 years ago, but really now it's really starting to show its age, or sometimes it was just kind of a quick and dirty sort of thing in order to do really early answers. But now they're ready to mature a little bit in terms of their tech stack, and sometimes it was just misguided. They thought it would be a good solution and it hit a wall pretty quickly. And ultimately what are the sort of things we're talking about. Could be your data warehouse that's actually storing and running your queries.
(02:21):
Sometimes it's how you're transforming the data. Sometimes it's even the BI tool itself, any of those things and all of those things can be swapped out at any time and there's different challenges and sometimes the challenges are way bigger than anyone ever thought when we're placing those things. First is why don't we talk about some of the legacy platforms that we have come across in our years when we sort of get the request from a client when they want to make a change. One of the platforms I find really challenging is when someone has a pure data lake and they want to move away from something like Presto or Athena. And a lot of the times it's really because they didn't realize how expensive it was really going to be and all the problems that go along with data lake. But Dylan, what are some of the legacy tools and platforms that you've come across?
DC (03:28):
Hey, absolutely think I don't see it as much anymore, but definitely just scattered on-prem SQL servers. They have 50 different SQL servers throughout their organization and they each house a different data set. They each had their specific use case at one point and now there's just too much to manage. And so a lot of that we start to think about how do we consolidate all these 50 on-prem or even if they are in a cloud provider, how do we consolidate these multitude of different servers into one central platform? And then I think as you mentioned, the data lake came about and we started seeing a lot of that leveraging Athena and as you mentioned, that gets very expensive very quick when you have a bunch of people writing not cost optimized queries against that you run up a bill pretty quick.
DB (04:17):
Yeah, we got to forget not long ago people were running on-prem databases
DC (04:24):
Not that long ago at all,
DB (04:26):
Not that long, but it just seems so far away.
DC (04:30):
It really does.
BW (04:31):
Yeah, Redshift gets a lot of credit for kind of spearheading this massive parallel processing databases so they separate but it begins to break down because there's no separation between the storage and the compute, so it's one cluster. You keep having to upsize it, upsize it, and again, you start hitting that cost barrier or performance barrier, so eventually your pipelines break down and you need something more powerful.
DB (04:58):
Yeah, that's another area where we've really got a little bit spoiled in the last few years is being able to think about scaling data storage and data compute independently from each other.
DC (05:14):
I mean it's fascinating because it wasn't long ago at all where Redshift was the go-to platform. I mean this was like three, four years ago where everybody was moving to Redshift and then very quickly became legacy technology.
JM (05:29):
Yeah, they're having attempts to update it I think, and there are things like serverless ships and things like that now, but it's a bit behind and I think one of the main problems you've run into is just the difficulty of configuration. You really need to be an expert in Redshift to take advantage of these features that are put more out of the box with all solutions.
DB (05:46):
John, in your experience, why is that? Why is there more attention to detail than necessary on Redshift?
JM (05:56):
I think it's really just a design choice that was made early on or maybe a choice that wasn't so much explicitly made as the fact that register is extension of AWS is a wider environment. It was a tool that you could build as part of your AWS three and everything else. The people setting it up were more likely to be hardcore software engineers rather than the main stakeholders for other solutions, which are probably business and analyst users that aren't as technical. And so what seems like an okay trade off or a software engineer may be different than what's a good trade off for an analyst or someone that's trying to handle these things their own way, which is in something like BigQuery that's very much abstracted away and you don't have to worry about scaling up compute at all and something that's Snowflake, it's very easy to configure compute separate blocks and it helps you make the decisions and presents a friendly UI on top of it to make that very straightforward.
DB (06:45):
One thing about Redshift that's always confounded me is the settings for setting your ability to run concurrent queries and how buried and difficult it is for you to change that.
JM (07:01):
It results in very strange behaviors as well, building up a concurrency budget. So getting the ability to run things more in parallel over time means that there's almost a race to use that concurrency if you're coming in first thing in the morning depending on how it's built up, whereas can run faster or slower depending on when you're approaching it. And so there's also some very strange behavior, especially if you're trying to figure out how to optimize.
DB (07:24):
Thankfully I haven't come across Postgres analytics databases lately, but that used to be pretty common a couple years back where a company would, they would just sort of do all of their analysis in a Postgres database and it got to a point where some clients, their ETL jobs were pushing 24 hours and so they just literally couldn't keep up at all. And yeah, thankfully we haven't seen very many of those these days.
DC (08:06):
I've seen, not recently, but they use it as a transactional database at first and then they're like, oh, we have it stood up, so let's build some analytics on the side as well. And then all of a sudden, like you said, their warehouse has taken 24 hours to try to handle both use cases and yeah, it gets ugly pretty quick.
DB (08:25):
Yeah, we won't get into the details of OLTP versus all of that, but it's definitely a reason to choose the right tool for the right job. What about on the transform layer? So in the past early on when I was trying new things early on I was trying using a AWS Glue and under the hood Spark SQL. What are some of the legacy transform layers that we're seeing that clients are trying to get off of?
BW (09:03):
Dylan and brought up some bad memories talking about SQL server, we just had loads of stored procedures that were stuck into a shared drive folder and no version control, no scheduling of it. It was just really hard to maintain and keep track of what all the dependencies were. Some of the more modern data tooling really helps build up those dependencies so that you can get a complete picture and know what's referencing each other.
DB (09:33):
SSIS is still actively used.
DC (09:35):
That's what I was just going to say. Had a project a couple years back of just massive SSIS packages that had to migrate to Azure and it's luckily Azure has grown in some instances where you can kind leverage the same packages in the cloud, but it still is not a seamless transition by any means.
DB (09:55):
What about Airflow have you guys come across? Any Airflow migrations?
JM (09:59):
I used extensively but not directly migrating it. I think it's a useful tool. Again, it sort of requires specific expertise, but yeah, the downside is that it does make the transformations with scripts. You're running more opaque and again, it's sort of harder for non-engineers to access that layer.
DB (10:19):
Okay. Why do we find clients are wanting to move from one platform to the next? What is your experience in terms of trying to understand who's the primary stakeholder of wanting to make this change and what are their motivations?
JM (10:41):
Yeah, I think it's difficult. It is one of my favorite questions to ask when working with a new client is why are we here? What was the incident or the thing that couldn't happen, a result in us coming in to try and make a change. It's often not that there is any one thing, I think it's more that there's been that noticeable decline in the ability to keep generating either new features or new understanding to implement some kind of new way of targeting for marketing or implementing some kind of tool. And the reality of that is that there's no silver bullet solution to a lot of these things. It involves going back and looking at how have you set things up? Why has it become this way over time? And having you changed all of it to make it fit your needs. One of the big issues I think with platform change is that things may be just sort of generally bad. There may be a lot of different kind of problems that exist, but without a clear focus on what you're actually trying to achieve with the change, you may end up in kind of the same place. It's a good chance to reset a bit and you can take advantage of that. You have to plan effectively ahead.
DB (11:44):
Yeah, I like the idea of we have to understand the original context, otherwise why would we think that we're not going to fall into the same patterns?
DC (11:53):
I think that's why our approach of tackling an MVP first works very well because you're taking one thread all the way through. You're proving out the tech stack, you're using best practices, you're laying that foundation and you really prove that value of that new platform and then you can go on to layer on those other legacy pieces or the other portions and grow that out. But starting, I don't want to say small, but starting very refined in how you're building out the platform really helps that success.
DB (12:26):
And it also helps the client themselves to really start to learn the platform itself. We want them to understand why you don't usually a new platform has a new way of thinking about the problem and solving the problem and by starting off slowly, it gives the client time to really adjust to kind of the new paradigm because a lot of what makes these new platforms powerful is that they do bring a new paradigm.
DC (12:55):
Well, the new tech skillset too, right, is their engineers have been building out SSIS packages for the last two years and now they have to learn dbt or whatever that new layer is. And so yeah, it's not easy. It's not a shift that happens overnight.
BW (13:09):
John hit on a key phrase for me, it's a good chance to reset and it's not just the tech that changes over time, but the business changes over time and I went through with a client a merger, two businesses were coming together and it was a good chance to reset what tech they used to bring the understanding of the two businesses as one complete entity.
DB (13:34):
Yeah, that's right. If we just take what they have and migrated over, you're also taking all that historical baggage along with you that might really be acting almost like as an anchor. It's preventing you. It was potentially one of the root causes of the platform problem to begin with and now you're bringing it over into the new one.
DB (14:02):
So we've talked about where they're coming from talking about why do they want to change, so where do we go? What are some common, the data loading, there's an opportunity there. I've had clients who were really struggling with Amazon database migration service and the EC2 instance that it was running on running out of memory every few days and then having to do a full re-catch up, which caused a PagerDuty call at like three in the morning because reports were late. So there was a really strong motivation to find a more robust system. So there's lots of different solutions for moving data from A to B, but there's a lot of solutions but maybe not so many tested and trusted solutions. The data warehouse space, again, there's a lot of different solutions out there and on the transform side, less options, but where are we moving to these days when clients are giving us a call?
BW (15:27):
I think we have a pretty standard modern data stack. You use an ingestion service, a managed ingestion service like FiveTran or Stitch, put it into a new cloud data warehouse, Google Cloud platform, BigQuery or Snowflake, and then we set dbt and all those developer best practices that come along with Git and then we have a buffet of BI tools to park the data in.
DB (15:57):
Almost like, yeah, there's not that many. When you put 'em all together you've got maybe three, two or three options at each level and you can mix and match any of those to bring it all together, but ultimately it's a similar pattern loading the data in, having a robust scalable warehouse to execute on new queries, having this well-managed source-driven analytics engineering platform, and then ultimately a high quality BI tool on top of that.
DC (16:34):
Interesting today that there's a thousand different tools out there. I mean they come out of the woodwork every day now, but we have continuously seen, through hours and hours of engineering and thousands of projects. That tool set continuously works over and over for every client that we come across. So it's pretty cool.
DB (16:55):
And I see data warehouses come out that are really focusing on the benchmarks, trying to be faster than all the rest, but at the end of the day it's usually plenty fast the tools that we have and it's more about how easy is it to use, how many hours do you need in order to manage this warehouse? Is it mostly self-optimizing? When you do have a problem, how quickly can you find out what the problem is? And I think that really tells the maturity of some of these tools. Even trying to go back to Redshift and trying to debug why it may not be working. It's challenging. You got to troll through XML outputs of the explain plan, and it's not the easiest thing.
DC (17:53):
And just initial configuration as well. We don't have a lot of time with our clients to start showing value and so the fact that we can get a full environment set up within a day or two is priceless really.
DB (18:08):
So I'd love to hear some anecdotes on some of the platform changes that we've run into. John, you've got one of my favorites. Walk us through, to sort tee it up. Often when we're going into a new platform migration project, we just really have limited information and we're coming in with as much information as we possibly can, but there's always something that gets us every single time and it's always, no matter how many different scenarios we see, there's always something that surprises us.
JM (18:52):
Yeah, so I worked at a business once, they were attempting to migrate. Really the main thing they were hoping to change was their logic, their transformations. They're operating in a venture environment and they had some generic scripting tools that were combining SQL based scripts with sort-of procedural scripting to create a combined version of they hack together a version of dbt that also did logic as well. The thing that made things much more complicated was that they were operating at the absolute limit of their Redshift cluster, so they could not afford to scale that any further. It was running into queuing issues a lot of the time and the size of the data they were operating with was very large.
(19:45):
Now when you're working with that kind of a problem, the first thing we're try to do is sort understand, okay, well what is the purpose of all these transformations that are happening in the first place? What's the kind of output that we're aiming for and how might we do that in a tool like dbt and the way that's clear where you can invite more people to make changes and get buy-in from the stakeholders. The issue that we ran into pretty early on was that there really was nobody there who knew the whole of the thing. Many people have left and so you were left trying to reverse engineer from these scripts how the transformations works in the first place and then at each stage there even interrogating that was difficult because of the constraints of being able to even query data of that size without creating disruption to the users of the Redshift cluster.
(20:32):
And aside, this is maybe one of the most basic configuration errors that you see with something like Redshift where if you don't create good queuing to separate out the BI tool that's trying to hit it from the data loading from the ad hoc queries and transformations that are happening, that kind of a clash can become a big problem for productive use of the warehouse. Ultimately, the problem boiled down to the methods that they were using to load the data and to try and transform it involved doing a lot of edits to this data at scale. So even whilst loading it, they were having to make changes, which meant that they couldn't take advantage of anything like incremental models without losing track of state. And so it ended up being an awful lot of complexity to manage each of these complicating factors without a good clear end goal in mind.
(21:23):
And so trying to diagnose the next steps, our recommendation in this space would've been something along the lines of fix the way that you're collecting data in the first place. Be very thoughtful and considerate how that is happening and make sure it's focused on your end goals and then probably at some point try and move to a new data warehouse environment that will enable you to take advantage of that ease of configuration, ease of splitting up the warehouse load so you can start to get a handle on these complex transformations that largely exist only because you have so many issues with the data loading and reuse in the first place. So it was a case for a lot of different issues at once. I think we're able to make a good dent in starting to understand the core of what the business was aiming towards.
(22:15):
In particular, trying to find areas where work was repeated. So one thing that is another kind of common problem, not necessarily with configuration, but as business grows, they may not run into a case where every part of the business is running their own transformations separately on top of the broad data and when for any business, most parts of the business care about the same kinds of things they care about. Maybe it's orders, maybe it's customer session type information, but if everyone is trying to operate on that at once and you never get to a stage where you've pre-computed or you've agreed on how you should handle these things, you're ending up with 10 different versions all tackling these very large data sets and it's just fundamentally inhibition. So we tried to move them towards a state where they could all agree on the core transformations so that work was only done once and then everyone can operate on top of that kind of well understood core to get their specific needs out at the end. That's really a great use case for something like dbt where you can encode that kind of structure and also make those changes visible to everybody so you can adopt and software various practices including floor requests, including documentation, you can make those changes all explicit and get buy-in at every stage and here everybody's needs for that common core assets whilst tackling something like your out-of-control compute costs because you are avoiding repeating a lot of that over and over.
DB (23:46):
Ben, you've had a situation where there's a choice that the client had to make, stay on Redshift, but do the work in Redshift or migrate to Snowflake and almost start anew. What was your experience in the pros and cons of those two options?
BW (24:09):
Yeah, we talk about what are these migration strategies and so if we had to trace back the existing queries, decouple what logic exists and then try to lift and shift that over to the new platform, but there's different query syntaxes between Redshift and Snowflake and then the data that comes over, you end up comparing apples and oranges even if you're trying to stay as close to their original code as possible. It never quite turns out the same.
DB (24:45):
That's always the biggest challenge is you are really almost never going to achieve a perfect migration because usually the source is flawed to begin with. So how can you ever truly demonstrate that what you've actually done is more correct unless you A, have access to secondary sources of truth and potentially B, come to the agreement that it's time to maybe rethink the whole thing and actually start with a small MVP and rebuild it from the ground up. Maybe not necessarily complete scorched Earth and start completely from scratch, but at least reimagine what those core components are so that you're reimagining what those common objects and metrics are that you want to model and then start building out the new platform from that. I had one client years ago where they knew that this executive report, no one trusted it and we made the attempt to actually redo it and fix it in place at the same time, but then the executives were complaining that it didn't match the old report, but the old report was wrong. And they're like, well, how do we know that this one's not wrong as well? And so it's always a big challenge when you're dealing with that, setting expectations and then ultimately proving you have to prove that the data you're producing is trustworthy even in the face of everyone knowing that the old was not.
JM (26:42):
I think one thing to think about as well is that it is sort of scary for businesses to think about replacing something as fundamental as like a data warehouse or some kind of large scale, all of the reporting, right? But it's interesting when you start digging to actually figure out what is being used, you may find that when you start to look at the core reports, things that have the most eyeballs on them, that is a lot less in the surface area of what is actually produced in the warehouse. And that can give people, I think confidence sometimes that okay, if we try and work on tackling these main core parts of something, you can get a lot of the way quite quickly in a new solution and the stuff that doesn't get migrated over may actually not really be necessary in the first place. It's not about having a one-to-one shift of everything over. It's about taking the most valuable assets that are being referred to a lot and making sure that those things are moved over in a really clean way where they have a lot of trust associated with them by how they've been very carefully rebuilt in a new environment.
(27:40):
I think there are ways of doing this. One of the main ones is to go and interview people and actually get their first person perspective on what they find the most valuable, especially in an environment where there may be the lack of trust in the data, you can figure out what people do find valuable and what they could have maybe considered to be a bit suspect. But also just objectively looking at if you have something like a BI tool where you can see stats for reports, you'll very often find that it's sort of parallel distributed where there's a small number of dashboards, reports that have a lot of eyeballs on them and a long tail where maybe not so much and it is something that you can go and refer to the individuals that maybe have a better view in that.
DC (28:23):
I think that's interesting that you mentioned in a BI tool. I think that's where I've historically seen a lot of redundancy as well, as everybody kind of builds their own flavor of the same rapport over and over and over, and now there's 30 different dashboards that could have just been solved with a single filter or something. So like you said, interviewing those stakeholders, understanding what's being used goes a long way.
DB (28:46):
Dylan, you want to talk about, so when should people know when is time to be thinking of a platform migration and what do they do?
DC (28:58):
Sure. I think there's going to be some red flags that start surfacing. I think how long does it take to maybe add an enhancement or add a new data set or make any sort of change to this data warehouse is it's starting to get longer and longer and longer in my time to value because of the complexity that's already in my current existing platform. I think you'll start to see that time to value just evaporate. And so that may be a time where starting to think about a new platform, I think costs because of the complexities and the inherent technical debt that just seems to snowball in these legacy platforms, your costs start spiking. You mentioned that Postgres instance running 24 hours, you're just starting to push the limits of the current platform. So I think there's a number of cases there. Like I said, cost, performance and time to change. Another one, I think John you called is a new business case. The business has changed, the business has grown. Ben, you mentioned that you had to consolidate, they had a merger. There's a couple of different businesses coming together. So I think something pivotal like that is also a great time to kind of reset your best practices, rethink your tool stack, and probably migrate to a net new platform.
DB (30:20):
Great, well thank you everyone for your time and that was a great chat. That
DC (30:23):
Was fun. Thank you.
DB (30:25):
Thanks everyone.