This week, we hosted the first session of our new summer speaking series (Turn Up The Bayes). I gave a talk on how we leverage a distributed database, HBase, to power an infrastructure that enables performant, distributed online learning. The following is a brief summary…but first, a quick introduction.
Fraudsters always search for new ways to exploit opportunities at the expense of companies that provide legitimate goods and services. At Sift Science, we use real-time supervised machine learning to sabotage fraudster plots. As it turns out, the “real-time” portion of our product brings significant infrastructure challenges.
Traditional supervised machine learning models are trained on a batch of historical data and then shipped to production servers, which can classify incoming data based on those trained models. This process generally works well, except that if a new type of fraud emerges after the model was trained, the model will not fully adapt until the next batch training. In order to provide a truly real-time machine learning solution, we use online learning so that the models continue to be updated up to the point that the next batch-trained model is released.
However, a major complicating factor in supporting online learning is functioning at scale. We needed to build a system that would be able to maintain model parameters in a distributed way, but still be able to classify users with a reasonable latency.
Now, on to the talk itself.
A full house on Monday June 29.
We aim to support online-updatability by transforming a sparse feature value space into a dense one and by updating the importance of features in the actual statistical model. Our strategy, broadly, is to do the following:
- Decompose all updates into HBase atomic operations like increment, check-and-put, and check-and-delete
- Batch and coalesce updates to reduce write traffic and merge multiple increments to the same key into a single increment
- Issue all reads in batches, e.g. ask for an entire feature vector rather than a single feature at a time
- Add multiple levels of caching to reduce the frequency at which we need to make network trips or disk accesses to obtain model parameters
For sparse feature value transformation, we coalesce uncommon feature values based on the number of users who have each particular feature value. We built the tables below to maintain arbitrary sets, where each set key could be an email address, and each corresponding set is the collection of users we have seen use that email address.
The second table collapses sets into set sizes for lower-latency accesses.
You can read our previous blog post about ML infrastructure on HBase for more on densification, the process of transforming sparse features into dense ones.
The remainder of our model parameters can re-use much of the same infrastructure built for sparse feature densification for counting set sizes. However, we also need to maintain weights and other arbitrary numeric values that need to be updated via increments efficiently. Therefore, we introduce a third table, “NumericParameterTable,” which maps keys to numeric values that supports increments on those values.
In practice, and with the help of local and memcached-based caches, we find that we are able to keep up with our traffic. We issue hundreds of batch requests per second, with each batch consisting of hundreds to thousands of data points; we store over 200 million sets and numeric parameters in our database. Because both our local cache and distributed cache hit rates are over 90%, we are able to get the storage benefits of a distributed database without sacrificing most of the latency reduction from local accesses.
If you missed the talk, my slides are embedded below! We’ll post video from the night next week. Please feel free to reach out to me or anyone else here at Sift if you have any questions or want to learn more.