Bayesian models are based on the idea of iteratively updating knowledge about an underlying relationship in the data. They are one of the most potent methods when dealing with limited data or uncertain scenarios.

PyMC and ArviZ are an excellent pairing of open-source Python libraries for modeling and visualizing Bayesian models. PyMC provides an intuitive API for describing observables, priors, and likelihoods, while ArviZ produces standard plots with just a few lines of code.

The sheer amount of artifacts the iterative Bayesian modeling process generates can be challenging to keep organized. Experiment trackers like neptune.ai help data scientists systematically record, catalog, and analyze modeling artifacts and experiment metadata.

Even though neptune.ai does not have built-in integration for PyMC and ArviZ, it’s straightforward to track artifacts produced by these libraries through the powerful *run* interface. Using the experiment tracker’s run comparison feature, we can analyze the progress in performance between iterations of the Bayesian modeling process.

When dealing with limited data or uncertain scenarios, one of the most potent methods is Bayesian inference. At its core, it is a formulation of statistics that enables one to incorporate prior knowledge and update beliefs systematically and coherently. Its power lies in the flexibility in model-building, especially its ability to take into account insights about the process being studied.

The typical Bayesian inference workflow is an iterative process: we start by building simple models and fit their parameters to some data. Then, we check the models’ predictions and evaluate their performance. If we find a model with satisfactory performance, great! If not, we should try to understand what’s holding us back and, from there, possibly reassess our assumptions and build better models.

PyMC is a powerful and well-maintained Python library that we can use for Bayesian inference. However, it does not provide any visualization features. For this, we will use ArviZ, a backend-agnostic tool for diagnosing and visualizing Bayesian inference results. You’ll see that both libraries work very well together.

Iterative modeling with PyMC and ArviZ creates a lot of artifacts in the form of plots, data, metrics, and so on. Keeping track of all of them is very important for many reasons! For instance, to have an overview of which approaches in modeling were fruitful and which ones were not. Another reason is that over time, we might be accumulating more and more data, and a model that had acceptable performance in the past can become unacceptable in the future. Access to old artifacts can help us diagnose such problems and discover ways to fix them.

But as you can imagine, storing all this information in a reliable, accessible, and intuitive way can be difficult and tedious.

Luckily, there are tools that can help us with this! In this post, I will show you how to use neptune.ai to store, view, and compare artifacts from different experiments. Neptune itself is not integrated out-of-the-box with PyMC and ArviZ, but thanks to its extensibility, it is easy enough to use it in combination with both.

For the following code examples, I assume you have a Python >=3.10 environment with neptune, pymc, and arviz installed. Here is my requirements.txt file, which defines the package versions I used when developing the code in this post.

neptune.ai is an experiment tracker for ML teams that struggle with debugging and reproducing experiments, sharing results, and messy model handover.

It offers a single place to track, compare, store, and collaborate on experiments so that Data Scientists can develop production-ready models faster and ML Engineers can access model artifacts instantly in order to deploy them to production.

## How to log PyMC and Arviz artifacts on Neptune

We will start by visiting app.neptune.ai, registering there, and following their instructions for creating a new project. (If you’d like more detailed instructions, see the Create a Neptune project page in the documentation.)

Once we’ve created a project, we will be greeted by a helpful tooltip showing the basic boilerplate code needed to integrate Neptune with our code. In particular, we are shown how to initialize a run object that encapsulates a single run of our experiment pipeline. In this tutorial, a run consists of defining a model, inferring its parameters, and comparing it with some data. You can think of a run as an “experiment” that you want to rerun some variations of – and for which it is interesting to compare some metric or plot across those variations.

Now, let’s generate some synthetic data:

(You can find all of the code in this post in this Jupytext notebook.)

Here is what the data we generated looks like: y(x) is a random variable that depends on x, and for every x, its average <y(x)> lays on the line <y(x)> = 2x + 3, and its standard deviation sigma(x) increases as we get further away from x=0. (It can be proven that it follows the relation sigma(x)^2=sqrt(0.1^2 + x^2)).)

Let’s imagine that this is a dataset that we got from real-world observations. Ignoring the fact that the variance changes with x, it looks like the observations more or less lay on a line – so we might start trying to get an estimate of the coefficients of this line: that means that the first model we will fit is a linear regression.

But first, to be able to log the artifacts we are about to generate with Neptune, we’ll now initialize a new run:

Then, we define the model by instantiating three parameters, for each of which we have some prior belief, and combine them into the likelihood for linear regression:

In this example, we’ll use some arbitrary prior distributions for the parameters. In a real-world scenario, the choice of priors would be informed by what we know about the modeled process.

The first thing we’ll now do is to validate that the priors we chose make the model output sensible values. In other words, we’ll check the model’s output **before we fit it to the data**. This is called a prior predictive check. We can use Arviz’s plot_ppc function for this purpose:

This code produces the following output:

This plot shows how the distribution of observed values for y compares with the predictions we generate from our prior beliefs. Each blue line corresponds to sampling one value for each of the parameters sigma, beta, and intercept, inserting them in the formula for y (together with observed values for x), and computing a KDE on the sampled values for y. We can see, for instance, that our priors on the parameters allow for a wide variety in the distributions we predict for y. We can also see that the observed values lie in a range similar to at least some of the distributions we sampled.

It is an interesting plot to keep track of, so we will log it on Neptune by running the following expression:

The string we use in the accessor (”plots/prior_predictive”) indicates that this artifact will be stored as a file named “prior predictive” in a folder “plots” associated with this run on Neptune – and indeed, if we visit Neptune and click on the run we just initialized, we find the plot we just created under the “All metadata” tab:

The next step in a typical Bayesian workflow is to try to infer the parameters of our model from the data. In PyMC, we can do it like this:

The sequence of samples from the posterior distribution we obtain by calling pm.sample is called a trace. A trace is a central object in Bayesian inference, and it can be of great help when we need to diagnose if something went wrong during sampling.

To log the idata object, which includes both trace and the prior predictive samples, to Neptune, we call

I decided to upload it both as a pickled object and also as a table. With the pickled version, it will be easier to re-instantiate it later as an object in Python. By logging it as a table, we can inspect it directly on Neptune.

(Aside from logging the trace after we’re done with sampling, it is possible to upload it on a rolling basis by defining a callback as described here.)

We should also have a look at trace plots (plots of the sequence of values we sampled, as well as their frequencies) and upload them to Neptune:

The output is the following plot.

(Notice that Arviz returns an array of plots, which cannot be directly passed to the run[…].upload method, but the code above circumvents that.)

We then plot the posterior distribution of the model’s parameters and run posterior predictive checks as well, logging everything into Neptune as we previously did for similar plots:

We can consider what we did so far a first pass through the Bayesian workflow: with this, the first iteration of our model is complete, and we tell Neptune we are done with the run by calling

Since we have logged all our plots to Neptune, we can compare any subsequent iteration to this baseline.

Our model is far from perfect. If we use the mean value for each parameter and plug it into the formula for y, we get the following plot:

We see that the model captures the mean correctly. But notice that the standard deviation of y(x) (represented with the blue shaded region in the plot above) is overestimated for values of x close to 0 and underestimated values of x far from 0.

Our model assumes a constant variance for all values of x. To improve our model, we need to change how we parametrize the variance of y and allow it to change with the magnitude of x.

Equipped with this insight, we can return to the code block where we defined the model and change it to:

After making this change, we started a new run (i.e., executed our code again), and now the work we did to integrate Neptune will pay off!

In Neptune’s “Compare runs” panel, we can compare all the plots of the new run side-by-side to the results of the first run:

From a look at the “Compare Runs” tab, it is clear that our change improved the model! Our posterior predictive checks now suggest a much better agreement between our model and observations. And not only that. If we pick the mean values for the parameters we just estimated and plug them into the formula for <y(x)> and sigma(x), we obtain a much better agreement with observations than before:

## Summary

In conclusion, we were able to integrate Neptune as a valuable tool into a typical Bayesian workflow. Even a simple implementation like the one we just saw, where we are just logging some plots/metrics and looking back at them over time, can significantly help us develop increasingly better hypotheses and better understand the data!

As a next step, if you want more info about Neptune, a great starting point is their docs. If you want to gain in-depth knowledge about Bayesian inference, I wholeheartedly recommend the book “Probability Theory: The Logic of Science” by E. T. Jaynes.

I am sincerely grateful for the contributions of Kilian Kluge and Patricia Jenkner, whose meticulous editing enhanced the quality and clarity of this post.