TL;DR: We can transform the score distributions of new models to match those of old models, while preserving the new model’s ROC curve. This allows model changes to improve accuracy without disrupting automated systems triggered by fixed score thresholds.
range from [0, 1], where higher scores are more suspicious. We encourage customers to use these scores in automating key actions, by comparing scores to fixed thresholds. These actions can be things like blocking orders, hiding content, or banning users. Because these actions are so important to our customers’ businesses, they’re naturally sensitive to even relatively small changes in the number or share of users or orders flowing through these automated systems.
We provide tools for building these automated actions (through our Workflows feature), but for a range of reasons, some customers need to write and run their own automation systems. As a result, we may have little or no knowledge of the specific thresholds that a customer uses to make decisions. This opens up a concerning possibility: because we don’t know which thresholds are critical choice points for our customers, we could easily change score distributions in a way which radically changes the number of users flowing through a given automated system.
This issue is exacerbated by some common practices for evaluating ML performance which see score distributions in a very different way than our customers do. When we’re considering changes to the ML systems, the first metric we look at is the area under the ROC curve (AUC ROC), which is a good measure for how well a classifier separates good and bad (e.g. fraud) examples. However, ROC curves are not sensitive to specific score values, or to the thresholds that our customers use.
This post describes how we can reconcile these two views of a classifier’s behavior, allowing us to choose a model change based on its AUC ROC, and ship those changes without disrupting our customers’ automated systems.
Solutions we rejected
But before we get to the solution that actually works for us, let’s refine our goals by considering some potential approaches which we discarded and what we learned from exploring each one.
Don’t shift change models suddenly
Historically, customers might notice a rapid change in the volume of traffic flowing through an automated action. When they raised the issue with us, they would highlight the suddenness of the change as a concern. A naive but reasonable person might suggest that maybe we should just shift their model slowly rather than quickly. We could define a mixture model between the old and new models, and slowly dial up the contribution from the new model from 0 to 1 over hours or days.
However, this approach could actually make things worse. Customers still ultimately need to update their thresholds, they just have to do it at a different time. If the period over which we transition the models is long, they may need to update those thresholds more than once, because no threshold may work well over the whole period. This option is unfriendly to the customer, so we didn’t consider it further.
Treat model changes like API changes
The opposite end of the spectrum is to suggest that since model changes can have a flavor of non-backwards-compatibility, we should treat them like API changes. Customers should be able to request scores from a specific versioned model, and after we release a new model, we permit customers to upgrade to it at their own pace. They’re responsible for resetting any thresholds before they use scores from a new model.
From a certain software engineering purist perspective, this might appear to be the “right” thing to do, but from a product experience view, it’s awful. It’s critical to our product to release new models frequently, and for customers to use scores that are informed by recent data and insights. But no one wants to be obligated to constantly do these small migrations. Customers want a system that just works, not an ongoing demand that they spend resources and attention to update their integration to consume from the most recent model.
Define scores to be percentiles
If the issue is that the share of users or orders falling above (or below) a fixed threshold changing is bad, one straight forward suggestion is to define scores to be percentiles relative to some recent distribution. In this view, what it means for a score to be 90 is that the associated user or order was more suspicious than 90% of data points we saw around the same time. By definition, then, the share of orders that gets blocked by scoring above a fixed threshold is going to be preserved over time.
Unfortunately, this approach totally breaks down when the actual fraud rate changes. Suppose your business launches a new product that fraudsters love — it’s in high demand, holds value well, and is reshippable. Your fraud rate increases from 0.95% before the launch to 1.5% after. If previously you were pretty well served by having a fixed threshold at 99 (which blocked 1% of orders), suddenly 0.5% of orders which you would like to be blocked are not blocked. By definition, you can’t have 1.5% of volume be above the 99th percentile.
Additionally, the percentiles-as-scores approach comes with some inconveniences when dealing with highly imbalanced prediction tasks like fraud. If typically fraud is a very small minority, then the interesting range of ambiguous scores where you might want to place a threshold might be all crammed into a narrow score range.
And existing customers still need to update their fixed thresholds to adapt to this new definition of scores anyway.
A related alternative is to apply techniques for calibration like Platt scaling or isotonic regression. These aim for scores to be well-calibrated probabilities. In this definition, if you collect a lot of data points which got scored as 90, you should observe that 90% of them turn out to actually be fraud.
This retains the drawback that existing customers will still have to update their thresholds to adapt to it. It introduces the new cost that we have to calibrate using a labeled holdout dataset. Because our labels are so imbalanced (fraud is a small minority), and because often the score range which needs to be most carefully calibrated is in a low-density region (i.e. plausibly fraud is also a small minority), this holdout set might need to be very large, possibly prohibitively so.
The options we’ve rejected have revealed some relevant non-goals:
- We’re not aiming to remove all score distribution shifts. When the incoming data changes, as with a change to the true fraud rate, then we want the score distribution to change.
- We’re not trying to avoid suddenness of score distribution changes; again if the incoming data changes suddenly, the score distribution should change suddenly.
- We’re not aiming to introduce a new definition for what scores “mean.” The available options would be bad for customers, or bad for us, and still require that existing customers revisit their automated actions.
The goal we do end up with is simply this: model changes should not be the cause of score distribution changes. Those changes can happen, and at any rate, but they need to be a function of changing incoming data.
A quick review of ROC curves
For a more comprehensive view, see wikipedia, google, or a decades long literature. But the single paragraph summary is that for a given classifier, the ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) as the threshold for that classifier varies. The resulting curve is a good indication of how well separated the score distributions are for positive and negative examples. If there is very little overlap, the classifier has achieved high separation, and the area under this curve will be high (near 1).
For our purposes, the key observation to make is that the definitions of TPR, and FPR (and thus the ROC curve) don’t depend on specific score values. The ROC curve for a given classifier is determined by the ordering over examples created by a score function. Let’s hone this observation a bit:
- The ROC curve for a given classifier is determined by the ordering over examples created by a score function.
- Said slightly differently, two distinct score functions (which produce different scores for the same data points) if they produce the same ordering will also produce the same ROC curve.
- Said slightly differently, we can compose a score function with an order preserving function (a monotonic increasing function), and the ROC curve will be preserved.
This gives us an enormous degree of freedom to change score functions. We can move scores up or down, or push them together or stretch them apart as long as we don’t reorder them. And there’s an infinite number of monotonic increasing functions we can choose from. (Technically there’s not a whole vector space of them, but there is a whole positive orthant of a vector space of them.)
So with this great freedom, how do we know which order-preserving / monotonic-increasing function we should use? Here are some more leading observations:
- The CDF of a probability distribution is a monotonic increasing function. This is a basic consequence of how CDFs are defined.
- For a univariate distribution, if its CDF is continuous, then the inverse CDF is defined and is also monotonic increasing.
- The composition of two monotonic increasing functions is also monotonic increasing.
Putting 4-6 from the above section together, we can take a classifier whose ROC curve we like, and disguise it, such that its score distribution matches any other distribution we choose, while retaining the original ROC curve. If a customer already has defined fix score thresholds against model
A, and we believe they’ll get better accuracy under model
B_raw, all we have to do is define a new classifier whose score function is
score_B_public = InvCDF_A(CDF_B(score_B_raw(x)))
This approach has some strong advantages:
- We can force our existing model to reproduce the score distribution from a previous model (on comparable data sets) extremely closely. For this reason, even if we don’t know a customer’s specific thresholds, we’re assured that customers won’t need to update them.
- We can make this remapping function (
InvCDF_A(CDF_B(_))) a static record, which we produce at the time of switchover, and never update. This means that the score distribution can change in response to changing incoming data going forward. An increase in the proportion of
wcreates an equivalent increase in the proportion of
Logistics: recording score distributions
The order-preserving remapping described above depends first and foremost on capturing score distributions from a given model. Of course, we never observe a distribution, but only values from that distribution. How can we know whether an estimate of the distribution based on some finite number of observations is good enough to build these remapping functions?
A very general (and conservative) finding that will help us is the Dvoretzky-Kiefer-Wolfowitz inequality:
P(sup_x [|F_n(x) – F(x)| > c]) <= exp(-2nc^2)
- F_n is the empirical CDF from n observations,
- F is the true CDF, and
- c is an error budget, for the largest discrepancy between F and F_n.
This says, for a given budget c and number of samples n, the probability that the budget is violated is bounded by the expression on the right hand side. Flipping this around, if we fix our budget, and the probability that we overrun, we can solve for a required sample size n.
In our use, because customers are sensitive to small absolute changes in low-volume buckets (e.g. orders with scores >90), c needs to be quite small. We picked c = 0.25%, and P = 2.5%, which gives us n ~= 300,000.
Note that it’s slow to collect this many scores, even for customers with high volumes. This means that we can’t see small changes with high confidence quickly. But this insight applies to our customers as well! In a very specific way, we can keep score distribution shifts small enough that customers won’t be able to notice them.
Logistics: Data Flow
The above discussion sorted out the math of gathering data and trusting our empirical CDFs. Now let’s turn our attention to some of the mechanics of how data actually moves through this system.
The above diagram shows a timeline for shipping model changes in this system:
- at t0, model A is in place. The customer is used to that model, and builds their automations against the distribution it produces. We bucket scores, and keep running counts of those buckets in our DB. This represents an in-progress view of a score distribution.
- at t1, the number of counts passes our n=300,000 requirement, and we write that whole distribution as a static record to a new table. It’s timestamped, and never updated. This represents the score distribution as observed over a specific time window.
- at t2 we deploy a new model B. This has two configured versions
- b_raw is configured to never produce public scores. But we run it in shadow against live data, so that we can observe its score distribution
- b_remapped is configured to be publicly visible, but it only activates when a remapping function for it is available.
- At t3, we collect scores for b_raw, and eventually record its completed distribution.
- At t4, we now have two static distributions. We can now write a remapping function, which also gets stored in the db as a static, write-once, update-never record.
- At t5, the predicate which checks for the existence of that remapping function finally returns true , which means we can start using it to produce consumable scores for b_remapped.
And it works!
This spring, we undertook a project to migrate some customers to custom-trained ensembles of multiple types of models. These customers were both strategically important to us, as well as internally quite sophisticated in how they used our scores in combination with other signals. It was critical that these customers not be disrupted by the transition. We used the above technique, and checked score distributions for the old models, for the new raw models, and for the new remapped models, and confirmed that the new remapped models very closely matched the score distribution of the original models.
This is visible in the below figure, where for each model we have both the CDF and the PDF. Note also that these distributions were not all captured over exactly the same time period, so some discrepancy between the old model and the remapped new model is to be expected.
Will customers benefit from updating their fixed thresholds after one of these model updates?
Potentially. By preserving the score distribution while improving the AUCROC, we can make sure that customers are better off after the model change than before. They’ll block the same fraction of users/orders, but a greater proportion of those decisions will turn out to be correct.
However, this mechanism cannot ensure that the existing threshold will be optimal for any given preferences for precision vs recall (i.e. for your favored F-measure). A model deploy which improves AUCROC and is mapped to match the preexisting score distribution may allow customers to raise their auto-block threshold, lower their auto-accept threshold, or both.
So every model release will be unnoticeably smooth going forward?
Unfortunately, that’s not yet true. In particular, the stringent requirements on the number of data points in order to capture the CDF of the new model make this process quite slow. For this reason, we use this technique proactively for large model changes which we expect would change the score distribution meaningfully (such as migrating a customer to a new model type). However, for our regularly scheduled model releases (where almost all customers stay on the same model type, but with refreshed data and features), we use a different set of tools to detect which customers will have significant shifts, rather than preemptively avoid such shifts occurring at all.
When you use this technique, should a customer be unable to notice a model change?
Not quite. This technique should avoid changes in the top-level score distribution. However, it doesn’t preserve score distributions of subsets of users or orders — and this ability to shift is necessary for us to actually improve accuracy. For example, if our new model finds that users with emails from a specific domain are more likely to be fraudulent, the distribution of scores for those users should change. A natural extension of this observation is that the scores of individuals may change dramatically when we change models.