The Perfect Data Pull Request

Finding the sweet spot between agility and quality.

Over the past five years there has been a growing conversation of “analysts as engineers”. This message is accompanied by software engineering concepts, like DevOps, Pull Requests, and source control.  If you are an engineer moving into the analytics space, these ideas sound familiar.  Perhaps you are an analyst looking to make your data more robust and you believe applying these practices will help do that. Regardless of your background, these ideas and why they are important are becoming the cornerstone of any modern analytics workflow.

In this post, I am going to provide a bit of background on these concepts.  I’m mostly going to focus on the humble pull request.  In my opinion, the pull request is what ties the whole thing together.  The Pull Request (PR) brings all the complexity of analytical work to a head - business requirements, regression testing, code quality - and becomes the last stop before your code reaches production. 

We will cover what goes into a great pull request, and strategies on how to automate.

What is DevOps?

Looking at the big pictures, DevOps is the whole process of how your code makes it into production.  Without a DevOps process, you will be making live changes directly in your data warehouse.  For example, altering a View directly, or updating a hard-coded SQL CRON job.  Making changes in this environment is extremely risky and items can break easily.  

On the other extreme, you may have a process so complicated that you lose your ability to ship changes in a timely manner.  For example, a monolithic release process where your changes are grouped with others into a single release that involves several checkpoints.  A delay on one change could block the deployment of all changes.

Whatever your process is, we refer to the steps, from task to production, as the development lifecycle.  DevOps is the workflow and tools you use to implement this lifecycle.

Typically this will involve some sort of source control such as git, hosted by a service such as Github, Gitlab, etc.  There is a main branch which is considered to be your production code.  You will create your own development branches so you can make your changes without tripping over each other's code.

At some point you are ready to submit your code to be merged with your production code and deployed to be used by the organization.  This is the moment to have a checkpoint.  A framework for your changes to be tested along with the rest of the code, reviewed for correctness, and other automated checks.  This checkpoint is what is called a Pull Request.  A great pull request process has a mix of automation and human intervention.

What is a Pull Request?

Whether you are using Github, Gitlab or Bitbucket, before you merge your change into the production code, you will open a pull request (or merge request).  This is usually the trigger to run the test and check automation.  This is the point in which you can assign a team member to review your code.  The reviewer can request changes, or approve the changes.  At this point the engineer can merge their code, and deploy into production.  Simple?

Initially when you open a pull request, the description is blank.  Without any standards, it’s up to you to decide what information you want to provide to the reviewer.  Someone who is in a rush will not take the time to add meaningful detail.  The reviewer is busy with their own work, and will likely not take the time to ask questions.  The reviewer looks at the code without context of what it’s intended to do. The best case result is that the code is reviewed for obvious flaws in code or style.  Worst case, the code is rubber stamped and passed without question.  The ‘quick fix’ often leads to compounding fixes in the future.  

We need to add intentional friction to the process.  Once code reaches this stage, it should be considered a release candidate.  For every change to production there is a risk of an unintended change in the data or something will break.  There is a cost to request that code is released. 

What should be reviewed?  How do we make sure the change satisfies the original need for the change?  At Mammoth Growth we believe that the pull request is the moment in which we ensure everything is in alignment:

  • The code does what it was intended
  • The code has been tested
  • The code meets our standards
  • There is a deployment plan

This is a heavy burden to put on the reviewer.  A code reviewer is working on their own changes.  They have their own work to do.  You cannot expect a reasonable review turnaround time if the reviewer is expected to seek out information, ensuring they can build the correct context in order to make meaningful decisions.  If this process is as vital as we believe, then this information is required.  The burden must be on the engineer to do the work to convince them that the change is good.  

This doesn’t mean you spend the whole day on your pull request. The process should be as streamlined as possible.  Most of the detail you need to include should already be available.  You are simply bringing it all together into one document.  

Elements of a great data PR

Here is how we structure our pull requests at Mammoth Growth for data warehouse projects.  We have been doing this for several years and have found that the benefits vastly outweigh any additional time taken.  We know that the care and attention it forces saves time from future mistakes.  If you rush through a change you are not only wasting your time, but others as well.  Every successful process has a form of checklist:  pilots before their flights or SpaceX before they launch a rocket.  The PR is your last chance for a “go / no go” check before you launch your change into production!

Detailed summary of requirements

The business reason for the change should be already documented - how else would you know what the change needs to be?  Even a tech-debt task will typically have a JIRA or Asana ticket.  In the odd case in which an unprompted change is made, you should make your case for the change yourself.  I would not want to be responsible for the random change that breaks your CEO’s dashboard.

In the past we didn’t include this information.  Our  stakeholder’s needs/expectation were not present to compare the result to.  The reviewer had no context of what the change was meant to do, so the only thing they could do was to review the code for best practices.  Code was approved, deployed, and the task was marked complete.  However, the code did not actually deliver the result the stakeholder was expecting.  Everyone down the line can point their finger elsewhere because everyone ‘did their job’ but the result was a failure to deliver the original ask.  We believe that you cannot be confident in your change until you’ve proven that you satisfy both business and technical needs.

With this summary in hand, everything can be compared against the original stakeholder need.  

Example:

Request Details

Stakeholder launched a new “How did you hear about us” survey in December 2021 that customers are prompted to complete after placing an order. They want this data to be surfaced in a report so they can start to analyze it.

Exceptions/Edge Cases: the creation_date will be added to the source data on approx. 02/08/22 but there is no ability to backfill this information. For the data prior to creation_date being added, we will need to implement fallback logic to select the best-guess response date associated with the checkout survey.

Links

Add links to your Asana/JIRA Tickets or Business Requirements Document.  If the reviewer needs to dive deeper, they don’t need to waste time waiting for someone to get back to them.  This saves you from having to over-share in your summary.

Stakeholders

@mention your stakeholders so they get updates and can see progress for themselves

Description of change

Describe the engineering work done - make it easy for the reviewer to follow the decisions you made.  Before we made this a requirement, at best, the description had an inconsistent level of detail, at worst, it was completely blank.  Without this context, the reviewer could not provide an honest review.  Instead we would focus on our personal pet peeves.  This isn’t a good use of time for anyone.  

Once engineers started providing their intent and reasoning behind the changes, the quality of review increased dramatically.  The reviewer started focusing on intent, and any change requests became teaching opportunities.  The reviewer can make a judgment call whether the choice was justified or not.

Example:

  • The new model_name_here is responsible for a_thing that relates to business_requirements. 
  • It is doing some complex_thing (which is commented) that is required because of data_quality_issue or business_rule that is outlined in the requirements document.  
  • I made some_choice when doing complex_thing for these_reasons.

Complexity

Explain the level of complexity you want to convey to the reviewer.  When the engineer qualifies the complexity of their change, they are sending a signal to the reviewer.  Maybe the size of the change is 40 lines of boilerplate code vs a complicated windowing function calculation.  If you need a careful set of eyes, or you need this fast-tracked, you can ask for help!

Example:

  • Simple 1-liner, please rubber stamp
  • Simple model update. I don't expect any issues
  • Business rules. Need another set of eyes.
  • Massive re-write. What have I done?

Risk

Are there any special considerations you want to convey to the stakeholders?  Is this change expected to break something and needs to be coordinated along with a dashboard change?  Is there any cleanup required?  Can this be deployed automatically with the merge?

In the past, before we started requiring this information, the engineer would often merge (and trigger a production build) their changes forgetting a Tableau workbook needed to be updated.  By requiring this information it pushed the need to think about the bigger deployment picture up into the planning phase:  How can this change go wrong?

Example:

This change swaps the source table for model_name which is used for the executive dashboard.  Before merging into production, the report needs to be tested against the staging model.  We do not expect there to be any variance in the numbers.

Supporting Data / Audit Result

Arguably the most important piece of the pull request.  You should provide sufficient evidence that your code satisfies the requirement.  You should also be considering how the change could break, and supply evidence to the contrary.  This sounds like a heavy burden, but the engineer should have tested their code in development.  This requirement ensures that this work happens.  As well, the act of articulating your audit often uncovers previously missed issues.

Everyone has been there: making a “quick fix” that in no way could possibly go wrong until it does.  When there is no burden of proof, the time needed for due diligence triggers our loss aversion.  When we make this a part of the process, we start to estimate our time more accurately.  

For best results, include 

  • The SQL you used to validate your changes
  • The summary table that demonstrates what has changed, and what has not changed
  • How does this support the expected change described in the requirements?

Diligence

In order to encourage accountability, we require someone to ‘sign off’ on the steps they performed.  At the point you are requesting a review, you should view your code as production ready.  I am guilty of not fully building and testing my code in development - just the parts that changed - and my changes break the staging build.  Ever since we added these check boxes, I force myself to go back and do a full run/test, then fix my “quick fix” before I create the pull request.

  •  I have checked affected models for appropriate tests and metadata
  •  Project has been run locally
  •  Project has been tested locally
  •  The code change conforms to standards and practices
  •  Code is production ready and is ready to be reviewed

Make your own template

Creating a pull request template is easy and we recommend this for every project.  It sets the tone of what is expected when you want to make changes to your production code.  It should be self-evident and not require much explanation.

Creating a pull request template for your repository - GitHub Docs

Create a merge request template - GitLab

Save time with default pull request descriptions - Bitbucket

GitHub and GitLab use Markdown for their formatting

Automate where possible

Most routine things can (and should) be automated.  Certain steps and checks involved in the DevOps process should just happen automatically, triggered by a key event such as creating a Pull Request.

Automation makes it easier to make right choices.  If you have to load up your query tool, write the boilerplate SQL to run the test, load the results into some sort of spreadsheet to do an analysis, etc.  Is that going to happen?  Is it going to happen every time you make a code change?  Automation means the choice to say yes is as easy as the choice to say no.

When we create a new pull request, various automated checks are triggered.  Your code is compiled and deployed within this environment ensuring that your change hasn’t broken anything.  Checks, such as tests, code quality analysis, or data audits are automatically run and report back success or fail.

Testing that your change does not cause any errors is the first core way to automate your process.  This happens when we build out or staging sandboxes.  If we made a change to a column name which accidentally broke a dependent model, since we can now build and deploy our data warehouse, our build would fail.

What if you could automate the comparison between the output of models which have been modified against your production data?  You should be able to assert what is expected to change.  If something has changed outside of our expectations, this is great to catch at this stage rather than a week later when the CEO notices.

Datafold has a feature called Data Diff which is very useful in making it easy to compare two datasets.  It can even be triggered by your staging build to automatically compare your change to production, and produce a report.  From the perspective of the code reviewer, this instantly conveys the scope of the change, and is easily matched with what was expected to change.  From the engineer's perspective, this makes it very easy to compare your development version with production and have an extreme degree of confidence in the accuracy of your change!

Conclusion

At Mammoth Growth, our goal is to find that sweet spot between agility and quality.  Even though we covered a lot, there is so much more which can be accomplished with a solid DevOps process.  We see this as the intersection between strong business requirements, expectations and assertions on what should change, accountability and thoughtful engineering, leadership and a safety net.  We automate what can be automated without losing sight of our goals.  Automation is the key to making this a streamlined process in which everyone has confidence in the output.  Otherwise, why bother - just deploy straight to production!

Ready to unlock new
growth opportunities?

We and selected third parties collect personal information. You can provide or deny-  your consent to the processing of your sensitive personal information at any time via the “Accept” and “Reject” buttons.