I recently finished my Ph.D in Applied Mathematics here in Boulder and have returned to work with the Spot Influence team. Maybe I should have taken more than a week off in between to travel Europe or follow Phish for a summer, but there is too much exciting stuff happening at Spot that I couldn’t wait to get back in it. Before I let the memories of my dissertation fade and get replaced by the constant flow of new and exciting technological knowledge we use here, let me tell you a bit about it.
My thesis work focused on randomized techniques in numerical linear algebra (nla). The goal was to develop methods that are fast and accurate, that work well on distributed systems, and that can operate on the scale needed to process tons of large noisy data. Traditionally, there has always been an amount of randomness in nla; iterative methods might use a random starting vector or roundoff error produces slight random perturbations that steer methods away from singularity, but this randomness was not fully exploited. The basic observation of our randomized methods is to realize that if you draw random vectors from a high dimensional space they not only are linearly independent, but they also maintain a certain quantifiable distance from each other. In other words, if you stack the random vectors into a matrix the smallest singular value is bounded from below.
Building on these properties of random matrices, we were able to construct factorization methods that only require access the input matrix a constant number of times. In other words, if you want 50 eigenvectors or even 500, you only need to access the matrix a few times, sometimes as few as one. This property is extremely useful when dealing with large matrices that cannot fit in RAM since it minimizes the bottleneck of data transfer from disk. For distributed computing, where the cost of data i/o is cut down by using many machines, the randomized methods provide an additional benefit of blocked operations and bulk computation.
These methods and their suitability for massive scale distributed matrix factorization are no secret either! The MapReduce machine learning library Mahout incorporated a variant of these randomized algorithms, the Stochastic Singular Value Decomposition, into their 0.5 release and are continuing to add functionality as they roll out 0.6 that includes a Principal Component Analysis option. Deploying the algorithm on Amazon’s EMR I was able to compute 100′s of singular triplets of a matrix whose dimensions exceeded 35,000,000. This is by no means a limit to the size of data that can be processed, but simply a limit to my graduate student budget that I was working on $$.
If you want to know more you can find my thesis here. Also check out Mahout if you need to svd some impossibly large data.









