Jun 27, 2016

Neo4j Cluster Performance Tuning

Neo4j is one of the leading graph database these days, and it is very popular in recommendation systems, fraud detetion and social networks scenarios.

While the single instance (that is included in the community edition) performs very well (usually w/ under 10ms response time), You may face challenges in cluster mode

Why Should You Expect for Performance Degradation in Neo4j Cluster?
Two simple reasons:
  • Neo4j cluster is a Master-Slave cluster w/ an auto failover method (much like MongoDB). However, unlike MongoDB, primary node deteciton by client is done by a server side load balancer and not by the client's driver. 
  • Cluster replication is syncronious by default, unlike MongoDB async default behviour.
How Much will it Cost us?
  • The various nodes of the cluster should be behind a LB. If you select AWS ELB, it will cost you 7 to 30ms according to our measures below. The ELB latency is increased as request and response become larger (see details on the bottom). Note: impelementing a MongoDB like driver could be a great improvment and will help saving this latency and minimize system cost. and it's a great idea for a side project!
  • The nodes behind the ELB replicate changes.from master to clients. The level of syncronization is controlled by the ha.tx_push_factor parameter w/ a default value of 1. This parameter controls the number of slaves that should recieve the commit before answering the client. By setting it to 0, you avoid syncronization and get a similar result to a single node. Changing the factor wil save 70ms at average (and much more at peak time), and will leave us w/ an average 40ms per query (inc. ELB cost).
You can find the diffrences below, where in the tested environment a community edition instance was replaces by a 3 nodes cluster behind an ELB:
  • In the left section you can see an average of 9ms in the initial state (single community edition instance)
  • In the middle you can see a flactuating response time of 40ms (reads) to 300ms (writes) for a 3 nodes cluster behind ELB w./  ha.tx_push_factor parameter w/ default value 1.
  • In the right section you can see a steady 40ms for both reads and writes for or a 3 nodes cluster behind ELB w./  ha.tx_push_factor parameter w/ value set to 0 (async replication).
Neo4j performance as measured by DataDog client side metrics
Bottom Line
HA have some cost by its side. Better implmementation of the load balancing and right seleciton of syncronization model, can help you gain the needed performance

Keep Performing,

Measure from Inside the Server:
> mytime="$(time ( curl http://localhost:7474/db/manage/server/ha/master ) 2>&1 1>/dev/null )"
> echo "$mytime"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     4    0     4    0     0    581      0 --:--:-- --:--:-- --:--:--   666

real    0m0.006s
user    0m0.005s
sys     0m0.000s

Measure from the Application Server Direct to Master:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     4    0     4    0     0    988      0 --:--:-- --:--:-- --:--:--  1333

real    0m0.006s
user    0m0.000s
sys     0m0.000s

Measure from the Application Server to Master through the ELB:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     4    0     4    0     0    417      0 --:--:-- --:--:-- --:--:--   444

real    0m0.015s
user    0m0.000s
sys     0m0.000s

Some More Measures to Gain Data to Explore the Neo4j Performance:
  • Enable slow log query and filter server time
  • Install monitoring like Datadog in the application level:


Intense Debate Comments

Ratings and Recommendations