We use cookies,
see info/options.
If you're OK with
cookies click below
or continue browsing.
Proceed
Cookie Information: 
Cookies are text files containing small amounts of data which are downloaded to your computer, or other device, when you visit a website.

Cookies are useful for carrying out various tasks, including improving your experience on our website. Some cookies are also necessary for the technical operation of our website. Cookies do not harm your computer.

For information about your cookie options including turning them off, click here.

To carry on with cookies running, click proceed or click the X to close this window and continue browsing. You can review your cookie options at any time by clicking on the Cookies link at the foot of each page
Proceed

The latest technology and data news, analysis and ideas from the DataMine Lab blog

Blog

YCSB run against HBase 0.92 on Amazon Elastic MapReduce
September 16, 2012 by Krystian Nowak,  no comments

In this post we will show you how in simple steps using Yahoo! Cloud Serving Benchmark: https://github.com/dataminelab/YCSB you can run benchmarks against HBase 0.92 cluster deployed automatically by Amazon Elastic MapReduce and what measurements and comparisons you can obtain while choosing among different available instance types.

We will create EMR HBase clusters using the tooling provided by Amazon:
http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip

Note: As you might see in commands.rb the default_hadoop_version is set to 0.20(.x), but as our tests found using Hadoop in version 1.0.3 has significant performance gain. Therefore when creating EMR cluster, we will explicitly set this version.

Let’s create one:

elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 2 \
--instance-type m1.large \
--hadoop-version 1.0.3
Created job flow j-1PP3JU6UJ0HQ1

elastic-mapreduce --list --active
j-1PP3JU6UJ0HQ1     WAITING
ec2-23-22-19-48.compute-1.amazonaws.com          EMR HBase YCSB
 COMPLETED      Start HBase

Build the project (HBase master server variables should now defaults to localhost (127.0.0.1)).

git clone git@github.com:dataminelab/YCSB.git
cd YCSB
export MAVEN_OPTS="-Xmx512m -Xms128m -Xss2m"

(check http://jira.codehaus.org/browse/MASSEMBLY-549 why…)

mvn clean install -Dcheckstyle.skip=true
cd distribution/target
scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-23-22-19-48.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz 
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-23-22-19-48.compute-1.amazonaws.com
tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb

Create the working table in HBase (aleady pre-split):

hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family

Hard to be perfect – because of https://issues.apache.org/jira/browse/HBASE-4163 is still not in place – please vote! :)
But it still seems to be better than no split at all!

You might spot:

12/08/25 13:39:16 ERROR metrics.MetricsSaver:
Failed SaveRecords hdfs:/mnt/var/lib/hadoop/metrics/raw/i-694c4712_04272_raw.bin
Shutdown in progress

as in https://forums.aws.amazon.com/thread.jspa?threadID=100643 but it doesn’t seem to hurt us…

hbase shell
scan '.META.', {COLUMNS => 'info:regioninfo'}
exit

Load initial data into HBase

./bin/ycsb load hbase -p columnfamily=family -P workloads/workloada | tee load.log

Check for your own eyes that the data is loaded into HBase

hbase shell

hbase(main):001:0> count 'usertable'
Current count: 1000, row: user995698996184959679
1000 row(s) in 2.3210 seconds

And run the tests – only as a warm-up:

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log

And now the real tests with 10 threads:

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log

cat real-tests-workload-a.log

[OVERALL], RunTime(ms), 47132.0
[OVERALL], Throughput(ops/sec), 2121.700755325469
[UPDATE], Operations, 50209
[UPDATE], AverageLatency(us), 186.93305980999423

And also 10 threads, but for another workload type.

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s -threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log

[OVERALL], RunTime(ms), 52748.0
[OVERALL], Throughput(ops/sec), 1895.8064760749223
[UPDATE], Operations, 50018
[UPDATE], AverageLatency(us), 11.925006997480907

Now we might check how these workload scenarios behave when increasing thread number.
Starting with 100 threads.

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log

[OVERALL], RunTime(ms), 24234.0
[OVERALL], Throughput(ops/sec), 4126.433935792688
[UPDATE], Operations, 50063
[UPDATE], AverageLatency(us), 1076.5547010766434

500 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-a-500t.log
cat real-tests-workload-a-500t.log

[OVERALL], RunTime(ms), 20706.0
[OVERALL], Throughput(ops/sec), 4829.518014102193
[UPDATE], Operations, 50099
[UPDATE], AverageLatency(us), 6167.192359128925

1000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log

[OVERALL], RunTime(ms), 21484.0
[OVERALL], Throughput(ops/sec), 4654.626698938745
[UPDATE], Operations, 49988
[UPDATE], AverageLatency(us), 9423.208390013604

2000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log

[OVERALL], RunTime(ms), 24358.0
[OVERALL], Throughput(ops/sec), 4105.427374989737
[UPDATE], Operations, 49957
[UPDATE], AverageLatency(us), 7786.985767760274

And the same for the other workload scenario now:
100 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-f-100t.log
cat real-tests-workload-f-100t.log

[OVERALL], RunTime(ms), 33924.0
[OVERALL], Throughput(ops/sec), 2947.7655936799906
[UPDATE], Operations, 50136
[UPDATE], AverageLatency(us), 17.44125977341631

1000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log

[OVERALL], RunTime(ms), 29309.0
[OVERALL], Throughput(ops/sec), 3411.921252857484
[UPDATE], Operations, 50127
[UPDATE], AverageLatency(us), 16.611586570111914

2000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log

[OVERALL], RunTime(ms), 29311.0
[OVERALL], Throughput(ops/sec), 3411.688444611238
[UPDATE], Operations, 49951
[UPDATE], AverageLatency(us), 59.80148545574663

3000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-f-3kt.log
cat real-tests-workload-f-3kt.log

[OVERALL], RunTime(ms), 32314.0
[OVERALL], Throughput(ops/sec), 3063.6875657609703
[UPDATE], Operations, 49492
[UPDATE], AverageLatency(us), 20.00127293299927

4000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 4000 | tee real-tests-workload-f-4kt.log
cat real-tests-workload-f-4kt.log

[OVERALL], RunTime(ms), 35051.0
[OVERALL], Throughput(ops/sec), 2852.985649482183
[UPDATE], Operations, 50095
[UPDATE], AverageLatency(us), 38.50611837508733

Let’s now try more instances instead just one slave – 4 slaves, same type as before.

elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 5 \
--instance-type m1.large \
--hadoop-version 1.0.3
Created job flow j-OE7G6YUHMD2I

elastic-mapreduce --list --active
j-OE7G6YUHMD2I      WAITING
ec2-50-17-100-242.compute-1.amazonaws.com         EMR HBase YCSB
COMPLETED      Start HBase

Now just copy already built test suite:

scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-50-17-100-242.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-50-17-100-242.compute-1.amazonaws.com

tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb

Initialize table:

hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family

Load initial data:

./bin/ycsb load hbase \
-p columnfamily=family \
-P workloads/workloada | tee load.log

And run tests:
warm-up

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log

10 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
cat real-tests-workload-a.log

[OVERALL], RunTime(ms), 42609.0
[OVERALL], Throughput(ops/sec), 2346.9220117815485
[UPDATE], Operations, 50073
[UPDATE], AverageLatency(us), 117.53685618996265

100 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log

[OVERALL], RunTime(ms), 23500.0
[OVERALL], Throughput(ops/sec), 4255.31914893617
[UPDATE], Operations, 49837
[UPDATE], AverageLatency(us), 1089.7759295302687

500 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-a-500t.log
cat real-tests-workload-a-500t.log

[OVERALL], RunTime(ms), 19763.0
[OVERALL], Throughput(ops/sec), 5059.960532307848
[UPDATE], Operations, 50196
[UPDATE], AverageLatency(us), 4854.259104311101

1000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log

[OVERALL], RunTime(ms), 20028.0
[OVERALL], Throughput(ops/sec), 4993.0097862991815
[UPDATE], Operations, 49904
[UPDATE], AverageLatency(us), 9582.977617024688

2000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log

[OVERALL], RunTime(ms), 22608.0
[OVERALL], Throughput(ops/sec), 4423.2130219391365
[UPDATE], Operations, 49988
[UPDATE], AverageLatency(us), 6244.29357045691

5000 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 5000 | tee real-tests-workload-a-5kt.log
cat real-tests-workload-a-5kt.log

[OVERALL], RunTime(ms), 24861.0
[OVERALL], Throughput(ops/sec), 4022.3643457624394
[UPDATE], Operations, 50100
[UPDATE], AverageLatency(us), 8150.377125748503

10k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10000 | tee real-tests-workload-a-10kt.log
cat real-tests-workload-a-10kt.log

[OVERALL], RunTime(ms), 25336.0
[OVERALL], Throughput(ops/sec), 3946.9529523208084
[UPDATE], Operations, 50176
[UPDATE], AverageLatency(us), 8851.578204719388

workload f, 10 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log

[OVERALL], RunTime(ms), 53310.0
[OVERALL], Throughput(ops/sec), 1875.8206715438005
[UPDATE], Operations, 49867
[UPDATE], AverageLatency(us), 12.18058034371428

100 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-f-100t.log
cat real-tests-workload-f-100t.log

[OVERALL], RunTime(ms), 30991.0
[OVERALL], Throughput(ops/sec), 3226.7432480397533
[UPDATE], Operations, 50145
[UPDATE], AverageLatency(us), 13.73040183467943

1k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log

[OVERALL], RunTime(ms), 29185.0
[OVERALL], Throughput(ops/sec), 3426.4176803152304
[UPDATE], Operations, 50047
[UPDATE], AverageLatency(us), 29.82979998801127

2k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log

[OVERALL], RunTime(ms), 31906.0
[OVERALL], Throughput(ops/sec), 3134.206732276061
[UPDATE], Operations, 50111
[UPDATE], AverageLatency(us), 24.55253337590549

3k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-f-3kt.log
cat real-tests-workload-f-3kt.log

[OVERALL], RunTime(ms), 34410.0
[OVERALL], Throughput(ops/sec), 2877.070619006103
[UPDATE], Operations, 49607
[UPDATE], AverageLatency(us), 23.37424153849255

Now let’s see how even more serious instances offered by AWS would behave in this scenario!
m1.xlarge (2 x more memory, 2 x more CPU than m1.large)

elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 5 \
--instance-type m1.xlarge \
--hadoop-version 1.0.3
Created job flow j-2ICBS9029MJAV

./elastic-mapreduce --list --active
j-2ICBS9029MJAV      WAITING
ec2-107-21-130-111.compute-1.amazonaws.com         EMR HBase YCSB
COMPLETED      Start HBase

scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-107-21-130-111.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-107-21-130-111.compute-1.amazonaws.com

tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb

hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family

./bin/ycsb load hbase \
-p columnfamily=family \
-P workloads/workloada | tee load.log

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log

10 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
cat real-tests-workload-a.log

[OVERALL], RunTime(ms), 39481.0
[OVERALL], Throughput(ops/sec), 2532.8639092221574
[UPDATE], Operations, 49981
[UPDATE], AverageLatency(us), 62.85440467377604

100 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log

[OVERALL], RunTime(ms), 17877.0
[OVERALL], Throughput(ops/sec), 5593.779716954747
[UPDATE], Operations, 50100
[UPDATE], AverageLatency(us), 640.4568662674651

1k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s -threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log

[OVERALL], RunTime(ms), 13986.0
[OVERALL], Throughput(ops/sec), 7150.00715000715
[UPDATE], Operations, 49750
[UPDATE], AverageLatency(us), 8759.566291457286

2k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log

[OVERALL], RunTime(ms), 14783.0
[OVERALL], Throughput(ops/sec), 6764.526821348847
[UPDATE], Operations, 50118
[UPDATE], AverageLatency(us), 26718.534857735744

3k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-a-3kt.log
cat real-tests-workload-a-3kt.log

[OVERALL], RunTime(ms), 15477.0
[OVERALL], Throughput(ops/sec), 6396.588486140725
[UPDATE], Operations, 49465
[UPDATE], AverageLatency(us), 12066.01403012231

4k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 4000 | tee real-tests-workload-a-4kt.log
cat real-tests-workload-a-4kt.log

[OVERALL], RunTime(ms), 15261.0
[OVERALL], Throughput(ops/sec), 6552.650547146321
[UPDATE], Operations, 49883
[UPDATE], AverageLatency(us), 22551.664294449012

another workload, 10 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log

[OVERALL], RunTime(ms), 45751.0
[OVERALL], Throughput(ops/sec), 2185.744573889095
[UPDATE], Operations, 49950
[UPDATE], AverageLatency(us), 9.801721721721721

500 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-f-500t.log
cat real-tests-workload-f-500t.log

[OVERALL], RunTime(ms), 21870.0
[OVERALL], Throughput(ops/sec), 4572.473708276178
[UPDATE], Operations, 49678
[UPDATE], AverageLatency(us), 11.18187125085551

1k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log

[OVERALL], RunTime(ms), 19207.0
[OVERALL], Throughput(ops/sec), 5206.435153850159
[UPDATE], Operations, 49879
[UPDATE], AverageLatency(us), 11.812406022574631

2k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log

[OVERALL], RunTime(ms), 20493.0
[OVERALL], Throughput(ops/sec), 4879.715024642561
[UPDATE], Operations, 50114
[UPDATE], AverageLatency(us), 12.770423434569182

And for now, more CPU power!
c1.xlarge (same memory, 5 x more CPU than m1.large)

elastic-mapreduce --create \
--hbase \
--name "EMR HBase YCSB" \
--num-instances 5 \
--instance-type c1.xlarge \
--hadoop-version 1.0.3
Created job flow j-3KZHQRG2D74AY

./elastic-mapreduce --list --active
j-3KZHQRG2D74AY     WAITING
ec2-75-101-255-226.compute-1.amazonaws.com          EMR HBase YCSB
COMPLETED      Start HBase

scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
hadoop@ec2-75-101-255-226.compute-1.amazonaws.com:/home/hadoop/ycsb.tar.gz
ssh -i ~/.ssh/dataminelab-ec2.pem \
hadoop@ec2-75-101-255-226.compute-1.amazonaws.com

tar xvzf ycsb.tar.gz
ln -s ycsb-0.1.5-SNAPSHOT ycsb
cd ycsb

hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family

./bin/ycsb load hbase \
-p columnfamily=family \
-P workloads/workloada | tee load.log

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=10000 \
-s \
-threads 10 | tee warm-up-tests.log

10 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-a.log
cat real-tests-workload-a.log

[OVERALL], RunTime(ms), 32121.0
[OVERALL], Throughput(ops/sec), 3113.228106223343
[UPDATE], Operations, 49973
[UPDATE], AverageLatency(us), 71.10029415884577

100 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 100 | tee real-tests-workload-a-100t.log
cat real-tests-workload-a-100t.log

[OVERALL], RunTime(ms), 15076.0
[OVERALL], Throughput(ops/sec), 6633.059166887769
[UPDATE], Operations, 50167
[UPDATE], AverageLatency(us), 644.8327187194769

1k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-a-1kt.log
cat real-tests-workload-a-1kt.log

[OVERALL], RunTime(ms), 12864.0
[OVERALL], Throughput(ops/sec), 7773.63184079602
[UPDATE], Operations, 50240
[UPDATE], AverageLatency(us), 9889.390306528663

2k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-a-2kt.log
cat real-tests-workload-a-2kt.log

[OVERALL], RunTime(ms), 14889.0
[OVERALL], Throughput(ops/sec), 6716.367788300087
[UPDATE], Operations, 50216
[UPDATE], AverageLatency(us), 41222.41986617811

3k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 3000 | tee real-tests-workload-a-3kt.log
cat real-tests-workload-a-3kt.log

[OVERALL], RunTime(ms), 14461.0
[OVERALL], Throughput(ops/sec), 6845.9995850909345
[UPDATE], Operations, 49451
[UPDATE], AverageLatency(us), 51852.53568178601

5k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 5000 | tee real-tests-workload-a-5kt.log
cat real-tests-workload-a-5kt.log

[OVERALL], RunTime(ms), 17072.0
[OVERALL], Throughput(ops/sec), 5857.544517338331
[UPDATE], Operations, 49835
[UPDATE], AverageLatency(us), 82378.54861041436

10k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloada \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10000 | tee real-tests-workload-a-10kt.log
cat real-tests-workload-a-10kt.log

[OVERALL], RunTime(ms), 20226.0
[OVERALL], Throughput(ops/sec), 4944.131316127757
[UPDATE], Operations, 50113
[UPDATE], AverageLatency(us), 49147.25219005049

another workload, 10 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 10 | tee real-tests-workload-f.log
cat real-tests-workload-f.log

[OVERALL], RunTime(ms), 40801.0
[OVERALL], Throughput(ops/sec), 2450.920320580378
[UPDATE], Operations, 49966
[UPDATE], AverageLatency(us), 12.13715326421967

400 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 400 | tee real-tests-workload-f-400t.log
cat real-tests-workload-f-400t.log

[OVERALL], RunTime(ms), 17856.0
[OVERALL], Throughput(ops/sec), 5600.358422939068
[UPDATE], Operations, 50071
[UPDATE], AverageLatency(us), 14.301591739729584

500 threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 500 | tee real-tests-workload-f-500t.log
cat real-tests-workload-f-500t.log

[OVERALL], RunTime(ms), 17909.0
[OVERALL], Throughput(ops/sec), 5583.784689262382
[UPDATE], Operations, 50210
[UPDATE], AverageLatency(us), 16.105915156343357

1k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 1000 | tee real-tests-workload-f-1kt.log
cat real-tests-workload-f-1kt.log

[OVERALL], RunTime(ms), 16982.0
[OVERALL], Throughput(ops/sec), 5888.5879166175955
[UPDATE], Operations, 50088
[UPDATE], AverageLatency(us), 15.313268647180962

2k threads

./bin/ycsb run hbase \
-p columnfamily=family \
-P workloads/workloadf \
-p columnfamily=family \
-p operationcount=100000 \
-s \
-threads 2000 | tee real-tests-workload-f-2kt.log
cat real-tests-workload-f-2kt.log

[OVERALL], RunTime(ms), 17219.0
[OVERALL], Throughput(ops/sec), 5807.538184563564
[UPDATE], Operations, 49989
[UPDATE], AverageLatency(us), 17.61469523295125

Even after running these simple scenarios we are able to check how for given configuration the number of threads used influences the throughput for each of workload type:

  • workload a:
  • workload f:

You can now play with other instance types and instance numbers. You can also mix multiple nodes running YCSB benchmark code and observe possible saturation, either from master’s CPU or network layer.

We also invite you to play with the code or even contribute features and improvements, so that others can benefit from them too – have fun!

BigData events
May 4, 2012 by Radek Maciaszek,  2 comments

We observe an explosion of BigData events. While half a year ago London hosted maybe one interesting meetup a month nowadays there is rarely a week without few of them. Supply is keeping up with demand.

There is an increasing number of monthly meetups: BigData London, HUG UK, Data Science London, London R, Cassandra London, Neo4j London, London MongoDB User Group, Oracle BigData, Data Visualisation London, Big Data Debate, DeNormalised London, LonData, CloudComputing.

Upcoming conferences that are worth mentioning:

We just had a London BigData week that was full of meetings and hackatons dedicated to Hadoop, Visualisations and NoSQL. In case you missed the last Big Data week you are for a treat – simply like us on Facebook to have a chance of winning one ticket (worth £495) for 3 days of SkillsMatter NoSQL tutorials.

There are as well few online places where every data scientist can improve or challenge their skills:

If you know of anything interesting coming up in London, let us know in the comments.

R Analytics in the Cloud
November 21, 2011 by Radek Maciaszek,  no comments

Last week I was invited to Big Data London to talk about “R Analytics in the Cloud”. As a case study, I presented the ageing project I’ve been working on as part of my Masters studies at Birkbeck, University of London. Ageing is one of the fundamental mysteries in biology and many scientists are already studying this process. I am excited to be part of the research group led by Eugene Schuster at UCL Institute of Healthy Ageing. This project has also given me the chance to use some of my Hadoop experience in the academic field.

Bioinformatics is the science of applying information technology to biology in order to understand the latter. There are numerous ways in which computers can aid biologists. In this particular project, we have been using microarrays to find the connection between different genes. The use of microarray technologies has enabled us to detect changes to gene expression across the genome in thousands of experiments with hundreds of species. However, interpreting the changes identified in these experiments has been hampered by a lack of knowledge of the gene function. Even in highly studied genomes, approximately 50-60% of genes will be assigned functions, yet less than 30% will be annotated with a highly specific function. Little of the annotation will have been observed in experiments conducted with the species of interest, as most gene function annotation is based on annotations assigned to orthologous genes taken from experiments done with other species, such as yeast and mammalian cell culture.

We are interested in building a better understanding of gene function in the worm C. elegans by harnessing the large quantity of experimental microarray data in the public database. Currently, we have a database of over fifty curated experiments. With this, we attempt to assign putative functions to genes based on the expression profile across experiments in the public repositories. My role in this project is to help expand the number of curated experiments in the database and study the functions of approximately 1000 genes known to be regulated in long-lived worms, to try to understand the functions of these genes, e.g. by showing experimental evidence of a role in nutrient sensing, innate immunity or stress response.

Here are the slides from the presentation. Refer to slides 10 and 11 to see how to migrate your R application to the cloud in just 3 lines of code:

Oh, and did I mention how cool our lab is? Have a look at the following ad, which was made at UCL  just a couple of metres from my desk.

Full disclosure: DataMine Lab is in no way affiliated with Birkbeck or UCL and the above project is part of my individual bioinformatics studies.