Tuesday, 2 July 2013

Algorithms every data scientist should know: Reservoir Sampling

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that is used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
Data science
The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.