Cookie Information: 
Cookies are text files containing small amounts of data which are downloaded to your computer, or other device, when you visit a website.

Cookies are useful for carrying out various tasks, including improving your experience on our website. Some cookies are also necessary for the technical operation of our website. Cookies do not harm your computer.

For information about your cookie options including turning them off, click here.

To carry on with cookies running, click proceed or click the X to close this window and continue browsing. You can review your cookie options at any time by clicking on the Cookies link at the foot of each page
Proceed

The latest technology and data news, analysis and ideas from the DataMine Lab blog

Blog

Calculating unique visitors in Hadoop and HiveSeptember 6, 2011 by Radek Maciaszek

Unique visitors

One of the most important website metrics is the number of unique visitors. However, it is also one of the most difficult to calculate. In this post, I will review a sampling strategy which produces a very good estimate of unique users, yet is computationally cheap.

Non-additive data

It is relatively easy to calculate small numbers of unique visitors: all you need to do is perform a single SQL query.

To calculate the number of unique records in Hive, run the following:

[gist id=1159244]

However, once the number of records in the table “page_views” becomes very large, this query may result in OOM errors. If this happens, there are other ways to calculate the exact number of unique visitors. Alternatively, it is possible to generate useful figures by using a sample.

Sampling

In practice, estimating the unique visitors metric gives pretty close results. In our tests on tens of millions of records, the results came within 0.1% of real values. One thing to remember is to ensure you sample visitors and not page views. The presented sampling method is a simple Bernoulli Sampling.

Having a sample can sometimes be even more useful than calculating the exact number. You can build a data warehouse around the sample and slice and dice on unique visitors — something which cannot be done on pre-calculated non-additive data. I will show at the end of this post how to create a cube that can be used to visualise unique visitors data.

Hashing

In order to sample users, we need to get every n-th user randomly from the population of records. One way to do it is to calculate the visitor hash for every record using a uniform hashing function (such as Md5). Md5 generates a random hexadecimal string on which we can filter only those users whose hash finishes with an arbitrary string, such as ’00’. Notice that since this is a uniform hashing function, the probability that the user hash finishes with ‘0’ is 1/16, and so the probability that it finishes with ’00’ is 1/256.

Note that Hive (at the time of writing, version 0.7) does not implement an Md5 function, so feel free to use the following code to add an Md5 hash function to Hive:

[gist id=1050002]

Alternatively you may patch your Hive distribution with the code from the following ticket HIVE-1262.

HiveQL

The following query will generate a unique visitors sample:

[gist id=1049918]

Pentaho

There are many other issues with unique visitors, such as how to present non-additive results to the end user. BI tools (such as Pentaho Mondrian) allow you to do this with the distinct aggregate function:

[gist id=1159282]

After loading the sample to your aggregate, the OLAP tools will allow you to report on it in a similar way to how you would report on standard additive data. See below:

Unique Visitors Cube

Unique Visitors Cube

5 Responses to Calculating unique visitors in Hadoop and Hive

  1. Rlharmon says:

    okay, i understand the hashing and the idea. just for clarity sake are you attaching pentaho directly to HIVE? So, the function created is now part of the HQL that is sourcing the UInique Visitors Cube?

    -thanks
    robert

    • Radek Maciaszek says:

      When exposing the data in Pentaho I always import first the results into a relational database. Usually that is still quite a big dataset so using columnar databases, such as Infobright, is a good choice. Pentaho talks directly to InfoBright and Hive is used only to create a sample.

      Best,
      Radek

      • Rlharmon says:

        okay, so you create the table in Infobright, Vertica, or what ever. how are you getting the sample information?
        i guess i see the hash function creating a sample data set. so, now how are you adding the browser and date information to that sample data set. are you doing this as part of a hive query and the taking this into your columnar database of choice.

        is this the actual strucutre that you are creating in Hive
        tables hive_sample
        date_key,
        browser_key,
        user_hash

        so, if you wanted unique over more columns do you need to extend the above table. what the sampling gives you at that point is simply the same wdith of table, but thinner.
        date_key,
        browser_key,
        store_key,
        user_hash

        thanks,
        Robert

        • Radek says:

          The sampled data is on the smallest granularity level (individual users). You can easily add here any more columns as required. If you want to count uniques over specific dimension you can select distinct records and group by that dimension. So I think you are right in your assumptions above (if I understand the question correctly).
          -Radek

Leave a Reply

Your email address will not be published. Required fields are marked *