Hey all, happy to share our experience in scaling the VerneMQ connections for 1Million devices. I can share here how we as a team achieve this.

To begin with the context, Our need is to enable 1Million smart devices to communicate with our platform via VerneMQ broker. The smart devices which connect to the broker will publish and subscribe to events.

Setup details we had in hand:

The VerneMQ broker was running on top of Kubernetes as a StatefulSet in AWS EKS. We had 3 node setup which will have 1 pod each spin up in individual instances. The instance type we chose for the node group is m5.xlarge type nodes.

Oh nice, we had all things in hand, we have started the load testing with this setup directly to the VerneMQ cluster. We were eager to get the result like what is the maximum open connection that this cluster setup can support, but how can we track what is the maximum open connections we reached and error metrics in VerneMQ. To get exact metrics we have integrated the VerneMQ broker with Datadog (a monitoring tool). With this integration we were able to see metrics getting populated in Dashboards. Hurray !!!

After the load simulation we noted that the cluster was able to support a few hundred connections which we were getting to track but how to know what is the maximum connection a single node broker can handle. For this we scaled down the replicas to 1. We simulated the load again. We were curious to know the metrics captured, and now the node was able to reach its maximum capacity of ~128K connection.

Now it’s time to see how we can improve the number of connections in this single node setup. For this we spent a few days validating the logs and identifying the suitable configs that need to be configured to upscale the capability limit of VerneMQ.

Config changes made on instances created:

The VerneMQ has dependency on erlang, by default VerneMQ will support the 256K erlang process, this configuration changes only get applied by changing the vm.args file which is part of the VerneMQ docker image.

+P denotes the erlang process and +e denotes the named value map table entries

vm.args

+P 1024000+e 1024000-env ERL_CRASH_DUMP ./log/erl_crash.dump-env ERL_FULLSWEEP_AFTER 0+Q 1024000+A 64-setcookie vmq-name vernemq@127.0.0.1+K true+W w-smp enable

Dockerfile

FROM vernemq/vernemq:1.11.0RUN rm /vernemq/etc/vm.argsCOPY vm.args /vernemq/etc/vm.args

- Updated the externalTrafficPolicy: Local, this is an annotation on the Kubernetes service resource that can be set to preserve the client source IP. When this value is set, the actual IP address of a client machine is propagated to the Kubernetes service instead of the IP address of the node. This helps in avoiding the overlap of ip port combinations that we were facing.

Example:

Let’s consider LB service 1 ip is 10.40.0.8 and two clients trying to establish connection with VerneMQ, though the client ip is different if externalTrafficPolicy is not set to Local the service ip will be replaced in place of client ip.

client 1 43.20.0.5:32051 → LB service 1 → 10.40.0.8:32051 → VerneMQ

client 2 44.20.0.6:32051 → LB service 1 →10.40.0.8:32051 → VerneMQ

Now VerneMQ already has a connection established with one ip & port combination, on the second request it will throw an error. This can be avoided by setting externalTrafficPolicy to Local.

service:

type: LoadBalancerexternalTrafficPolicy: Localmqtt:port: 8883targetPort: 1883

Update sysctl conf `net.netfilter.nf_conntrack_max` to scale VerneMQ connections. By default for the m5.xlarge instance the net.netfilter.nf_conntrack_max is set to 131K. With the command below it can be updated. This config keeps track of all connections to and from the machines.

sudo sysctl -w net.netfilter.nf_conntrack_max=262144

TCP buffer size needs to be increased to improve the performance of VerneMQ as there will be high traffic flow. The configuration can be updated with command below

sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 16384"sudo sysctl -w net.ipv4.tcp_rmem="4096 16384 16384"

After applying this configuration we were able to reach the AWS machine maximum connection limit configured in os (connection limit is dependent on the instance type we choose). Finally we were able to reach 200K connections per node. We have spin up a 5 m5.xlarge node and 5 pods (one pod per node) setup to achieve 1Million connections. We noticed a few connection errors which were resolved by adding a buffer node so finally we had a 6 m5.xlarge node setup to achieve 1Million connections.

Happy to know if this information has helped you in any way.

To know about AWS instancy type and max connection click here

--

--