Hunting Network Errors
Problem background
Recently, we've started witnessing some strange symptoms on the platform. For example, one of our crucial search components wasn’t loaded on the landing page. It appeared that Varnish couldn’t fetch part of the main page which was esi:included on this page. Some other back office sites were loading extremely slowly because they were reading some of the details from our other microservices and we found several timeouts on the server. One person even had one 503 error displayed in the browser when searching. The first thought I had was – maybe some AWS errors were to blame (network or load balancers), but I soon realized there were no recent big errors on the AWS, at least not in our region (eu-west-01) and someone else would probably have these problems as well. And still, it continued happening on several random occasions.
A 503 error roughly means "service is unavailable". Usually, we see this error when the load balancer cannot reach the backend server (either it is simply not running, it is busy and can't accept more connections, or the load balancer waited too long for a response). But because it happened several times on unknown occasions, I started to investigate the root cause of the problems.
Infrastructure
Without going into too much detail, I’d like to briefly introduce our infrastructure. All the infrastructure is hosted on AWS. The production environment lies in the private network with one NAT gateway. After the gateway, we have Elastic Load Balancers which point directly to 2 Frontend Proxies (load balancers, as EC2 machines). On the EC2 machines we have Varnish running which acts mostly as a load-balancer but also helps us in including parts of other pages with an esi:include directive. These load-balancers then redirect HTTP requests to the application servers. We have 4 different services, each backed up by one additional EC2 instance and 2 postgresql db instances. Each application is written in Java 8 and uses Dropwizard + Jersey. Every application server can talk to another server through 2 internal, Backend Proxies which also run Varnish. These proxies contain rules to route /internal/* requests which are unavailable outside our network. All instances are running on Ubuntu OS.
First try
First of all, I started reading Varnish logs on frontend (or backend) proxies, for example, by running the following command:
{{CODE}}
varnishlog -g request -N /var/lib/varnish/frontend_proxy/_.vsm -q "BerespStatus >= 500"
{{ENDCODE}}
I learned that, from time to time, our servers could respond with a 503 error.
I started thinking, why is it possible? There was no deploy at the time of error occurrence. I logged into the application server running the search service and there it was; the uptime of the machine was over 100 days and the app itself had been working since the last deployment and restart. So, the service itself couldn't be restarted
Then, a light bulb blinked in my mind - what if our server cannot accept additional connections that it should be accepting?
And so, I checked the max number of sockets the kernel can handle:
{{CODE}}
> sysctl net.core.somaxconn
net.core.somaxconn = 128
{{ENDCODE}}
… only 128 connections!
Then I needed to check the number of current socket connections:
{{CODE}}
> netstat -npActive Internet connections (w/o servers)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 172.10.60.27:8080 172.10.10.219:46848 TIME_WAIT - tcp 0 0 172.10.60.27:8080 172.10.20.251:54028 TIME_WAIT - tcp 0 0 172.10.60.27:43018 172.10.30.16:5432 ESTABLISHED 19290/java tcp 0 0 172.10.60.27:8080 172.10.20.251:54138 TIME_WAIT - tcp 0 0 172.10.60.27:8080 172.10.20.250:51044 TIME_WAIT - tcp 1 0 172.10.60.27:46136 172.10.20.54:80 CLOSE_WAIT 19290/java
{{ENDCODE}}
In total, there were 112 open connections – pretty close to the limit. I checked other important values as well:
{{CODE}}
> ulimit
unlimited
{{ENDCODE}}
So, we can open an unlimited number of file descriptors (socket is fd as well) - good!
{{CODE}}
> ifconfigeth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001 inet 172.10.60.27 netmask 255.255.255.0 broadcast 172.10.60.255 inet6 fe80::40b:d2ff:fecc:ac9 prefixlen 64 scopeid 0x20<link> ether 06:0b:d2:cc:0a:c9 txqueuelen 1000 (Ethernet) RX packets 72631533 bytes 24987367100 (24.9 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 81748852 bytes 13452034916 (13.4 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
{{ENDCODE}}
txqueuelen is a queue length on the network interface which tells us how many pending connections can wait until they're processed.
{{CODE}}
> sysctl net.core.netdev_max_backlog
net.core.netdev_max_backlog = 1000
> sysctl net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 128
{{ENDCODE}}
The first value tells us about the max number of incoming connections (between net interface and kernel until kernel opens the socket). The latter tells us about the number of connections pending in SYN request (to sync with the server).
Then, I tried to increase some of these numbers:
{{CODE}}
ifconfig eth0 txqueuelen 5000
echo "/sbin/ifconfig eth0 txqueuelen 5000" >> /etc/rc.local
sysctl -w net.core.somaxconn=1024
sysctl -w net.ipv4.tcp_max_syn_backlog=2048
sysctl -w net.core.netdev_max_backlog=2000
{{ENDCODE}}
Second attempt
As it turned out, it's not related to the number of connections. At least not on the OS level. I then thought it could be because of insufficient connections in Jetty / Dropwizard. I re-configured the number of connections and max threads number (to something around 1024) and gave it a try
After restart it turned out it helped nothing. The issue persisted.
Third attempt
Therefore, I thought I could try to do some manual testing. I tried to curl the problematic resource, being on the search service instance:
{{CODE}}
curl -H "X-Esi-Level: 1" -i http://localhost:8080/search/problematic/path
{{ENDCODE}}
But again, no response. I was waiting for quite a few minutes, and nothing happened. OK, so I get the same results which our Varnish instances get. But … Varnish times out after 30 seconds and returns 503 (this is why one of our backoffice pages is displayed after ~30 seconds). That meant the search had to be continued, possibly to the code.
Fourth and last try
Staying on the search service instance, I curled Dropwizard’s admin context of search service to see some metrics:
I got the following:
{{CODE}}
"jvm.threads.blocked.count": {
"value": 26},"jvm.threads.count": {
"value": 81},"jvm.threads.daemon.count": {
"value": 11},"jvm.threads.deadlock.count": {
"value": 0},"jvm.threads.deadlocks": {
"value": []},"jvm.threads.new.count": {
"value": 0},"jvm.threads.runnable.count": {
"value": 10},"jvm.threads.terminated.count": {
"value": 0},"jvm.threads.timed_waiting.count": {
"value": 39},"jvm.threads.waiting.count": {
"value": 6},
{{ENDCODE}}
The most interesting was related to the number of blocked threads... why were so many threads blocked? I truly doubt that’s a normal situation (although I’ve never checked these values before, so again it’s only a guess)
That’s why I checked threads as well:
{{CODE}}
curl localhost:8081/admin/threads
{{ENDCODE}}
And I found something really interesting:
{{CODE}}
dw-53 - GET /search/problematic/path id=53 state=BLOCKED
- waiting to lock <0x6a892bdd> (a java.security.SecureRandom)
owned by dw-125 - GET /search/ id=125
at java.security.SecureRandom.nextBytes(SecureRandom.java:468)
at com.acme.search.web.security.SessionFilter.anonymousUser(SessionFilter.java:136)
at com.acme.search.web.security.SessionFilter.filter(SessionFilter.java:126)
{{ENDCODE}}
In the end, it turned out that our SessionFilter (which is invoked before / after the resource code is being executed; the problematic one was after resource was executed, when setting cookies to the response) blocked execution of threads. And I therefore realized that the thread was blocked in SecureRandom. It was apparent that this SecureRandom instance was reading random bytes from /dev/random and it then blocks the thread until we have enough randomness. So, more and more threads were coming into this trap and were being blocked… at least for more than a few minutes.
The code:
{{CODE}}
public static class SessionFilter implements ContainerResponseFilter {
private final SecureRandom randomGenerator;
@Override
public void filter(final ContainerRequestContext requestContext,
final ContainerResponseContext responseContext) throws IOException {
final Cookie oldCookie = requestContext.getCookies().get(SessionCookie.ANONYMOUS.getName());
if (oldCookie == null) {
final String userId = anonymousUser();
final NewCookie anonymousCookie = new NewCookie(SessionCookie.ANONYMOUS.getName(), userId, "/", null,
null, -1, requestContext.getSecurityContext().isSecure(), true);
responseContext.getHeaders().add("Set-Cookie", anonymousCookie);
}
}
private String anonymousUser() {
final byte[] bytes = new byte[32];
randomGenerator.nextBytes(bytes);
return DatatypeConverter.printHexBinary(bytes);
}
}
view raw
{{ENDCODE}}
As long as we don't need SecureRandom for such trivial things like generating cookies for anonymous users (which in fact is just a 64-character hex number), I have decided to use simple java.util.Random - it should be sufficient for our needs. Moreover, we can’t block threads when trying to find out sufficient random numbers.
After deployment, the problematic error hasn’t reappeared - mission successful!
In case of any questions, drop me a line @ma_moczulski
Reference:
http://serverfault.com/questions/291027/how-to-determine-number-of-open-sockets-vm-resource-issues
https://tweaked.io/guide/kernel/
http://docs.oracle.com/javase/8/docs/api/java/security/SecureRandom.html#nextBytes-byte:A-
https://tersesystems.com/2015/12/17/the-right-way-to-use-securerandom/
{{CTA}}
Interested in working at Voucherify?
{{ENDCTA}}