My last week at LiveRamp also happened to be a hackweek. These weeks happen at the end of every 9-week development cycle, and the goal is to do something relevant to LiveRamp, but not your day-to-day work.
Some examples of things people have done:
My final week at LiveRamp was a hackweek, which means I could work on a cool project! Woohoo!
As explained in my earlier post, LiveRamp receives a number of redirects from other websites (Web-Match Partners). What we’re trying to do essentially is maximize our reach on the internet; however receiving these redirects from Web-Match Partners is quite expensive. Ultimately, LiveRamp hopes to cover the most number of cookies online with the least costly Web-Match Partners.
The first step of this would be to emulate something similar to what a company called Aggregate Knowlege (AK) posted:
Essentially we want to find the amount of overlap between every pair of partners that LiveRamp sees. If we find that some partners have a high-percentage of overlap with others, we could make a decision and cut some of our partners.
So I set about my task.
Step 1: Learning about HyperLogLogPlus
Because LiveRamp receives roughly 2 billion requests a day, we can’t simply just track all of those requests in a large hashset or a list as the run-time would simply be too slow. As a result, we generate stats each day with a library (also created by AK) called HyperLogLogPlus. While the math that goes into this is pretty complicated, a rough heuristic for it can be explained as follows.
Let’s say you’re a server and want to count the number of unique IP addresses that you see each day.
1. When an IP address hits your server, use a random hash function to hash the IP address to a range of [0,1].
2. Keep track of the smallest hash that you’ve seen all day.
3. Repeat steps 1 & 2 for the whole day.
Now let’s say at the end of your day, you’ve seen 10 unique IP addresses. You would expect your distribution of hashes to be roughly 0.1, 0.2, 0.3, …, 0.9 (actually this is off by one, but no need to worry about that). Which means if at the end of the day you’re smallest hash was 0.1, you can roughly say that you saw 10 unique IP addresses during the day. Similarly if your smallest number was .01, you saw roughly 100; if it was .005, you saw roughly 200, etc.,
While this variance is really high; you could choose like 16 different hash functions and then for each IP address hash it into 16 different buckets and keep track of the smallest one in each bucket. Then at the end of the day; you can find the average over all of the buckets. This variance is much lower.
While this is not how HyperLogLog is implemented, it does something that is conceptually similar (uses 1 hash function and splits it into 16 buckets based on the first x digits of the hash etc.,). Essentially you get an O(1) estimate of the number of unique items you’ve seen in a stream. To learn more about HyperLogLogPlus, go here.
This is what LiveRamp uses to track the number of unique visits per day. And to create a partner heat-map, we want the percentage of intersection between each pair of partners. This is also quite simple. If each partner’s counter is split into buckets the same way, you can simply merge two counters by finding the smallest number of each bucket – then an average over the buckets is the union. From there an intersection is A + B – (A U B). 🙂 The variance goes up by a good amount, but that’s not a huge deal when we’re simply looking for large overlaps.
Step 2: Hadoop + Cascading
Because LiveRamp has roughly 2000 partners. That leads to 2000 x 2000 = 4,000,000 pairs we need to merge. Boom. MapReduce.
Step 3: But is this that helpful…?
While the partner overlaps are sort of interesting; do they really answer the question we want to ask? Let’s say we have one partner that has a 10% overlap with every other partner. This doesn’t give us any information about the nature of these 10%’s. Are those 10% different 10%’s or are those 10%’s the same 10% of people? This is vastly important because theoretically, it could tell us whether the publisher is 0% helpful or 90% helpful.
As I pondered this problem in the shower, an idea clicked in my head: Set Cover! 🙂
Essentially what LiveRamp is trying to find is how can we cover the most number of browsers with the least number of publishers. This is almost exactly a minimum-set-cover problem. While finding the optimal set cover is not easy – there is a greedy algorithm which takes O(n^2) time and computes a set cover within a factor of log(n) of the optimal solution. Although O(n^2) seems large, creating the partner heat-map also takes O(n^2) (because we literally look at every pair of partners).
The greedy algorithm also allows us to say one more thing: rather than finding an optimal set cover to cover all of the browsers, we can choose a percentage at which we can stop. Presumably every partner is sending us at least ONE unique cookie. This means the optimal set cover is just all of the partners. On the other hand, if we say: “How many publishers do we need to cover 95% of our cookie reach,” that’s a really interesting question. When I came up with this, I was incredibly excited as I just took an algorithms course in the spring and found an application immediately!
Here’s the alg:
1. For each partner not chosen yet, find the one that covers the most cookies that the other partners don’t cover.
2. Add that partner to your cover.
3. Repeat steps 1 & 2 until some percentage of your total coverage is met.
Step 4: Results
I’m not actually going to go into details of all the results as it’s private to LiveRamp, but there’s some things we learned.
1. We generate stats each night for a ton of partners. With 10% of them, we cover up to around 85% of our cookie reach. Whoa. (Also, not all partners are created equal…)
2. When LiveRamp was acquired by Acxiom, we received all the traffic from Acxiom’s partners (they tried to make a competing product). Because Acxiom paid some of the same partners as we did there were a few partners that had an 80% overlap in the heatmap and got chosen to be a part of the set cover much later than their individual cardinalities would have suggested. Also whoa.
I do want to note that after a certain point (pretty early), the error rate is so high this is just not that meaningful anymore…You can read about intersection error rates if you google for “hyperloglog intersection error”.
So ultimately, I had a ton of fun doing this as my last week’s project at LiveRamp. I got to work with Map-Reduce, play around with some big-data, dabble in statistics (e.g. I tried to calculate the variance of the intersections produced by HyperLogLog etc.,), and ultimately produced some meaningful results.
Again, I had a ton of fun and learned a lot at LiveRamp this summer. If anyone would like to check it out, here’s the link again 🙂