SSL handshake latency and HTTPS optimizations.

2:00:00 PM 0 Comments

At work today, I started investigating the latency differences for similar requests between HTTP and HTTPS. Historically, I was running with the assumption that higher latency on HTTPS (SSL) traffic was to be expected since SSL handshakes are more CPU intensive. I didn't really think about the network consequences of SSL until today.
It's all in the handshake.
TCP handshake is a 3-packet event. The client sends 2 packets, the server sends 1. Best case, you're looking at one round-trip for establishing your connection. We can show this empirically by comparing ping and tcp connect times:
% fping -q -c 5 www.csh.rit.edu
www.csh.rit.edu : xmt/rcv/%loss = 5/5/0%, min/avg/max = 112/115/123
Average is 115ms for ping round-trip. How about TCP? Let's ask curl how long tcp connect takes:
% seq 5 | xargs -I@ -n1 curl -so /dev/null -w "%{time_connect}\n" http://www.csh.rit.edu
0.117
0.116
0.117
0.116
0.116
There's your best case. This is because when you (the client) receive the 2nd packet in the handshake (SYN+ACK), you reply with ACK and consider the connection open. Exactly 1 round-trip is required before you can send your http request.
What about when using SSL? Let's ask curl again:
% curl -kso /dev/null -w "tcp:%{time_connect}, ssldone:%{time_appconnect}\n" https://www.csh.rit.edu
tcp:0.117, ssldone:0.408

# How about to google?
% curl -kso /dev/null -w "tcp:%{time_connect}, ssldone:%{time_appconnect}\n" https://www.google.com
tcp:0.021, ssldone:0.068

3.5x jump in latency just for adding SSL to the mix, and this is before we sent the http request.
The reason for this is easily shown with tcpdump. For this test, I'll use tcpdump to sniff https traffic and then use openssl s_client to simply connect to the http server over ssl and do nothing else. Start tcpdump first, then run openssl s_client.
terminal1 % sudo tcpdump -ttttt -i any 'port 443 and host www.csh.rit.edu'
...

terminal2 % openssl s_client -connect www.csh.rit.edu:443
...

Tcpdump output trimmed for content:

# Start TCP Handshake
00:00:00.000000 IP snack.home.40855 > csh.rit.edu.https: Flags [S] ...
00:00:00.114298 IP csh.rit.edu.https > snack.home.40855: Flags [S.] ...
00:00:00.114341 IP snack.home.40855 > csh.rit.edu.https: Flags [.] ...
# TCP Handshake complete.

# Start SSL Handshake.
00:00:00.114769 IP snack.home.40855 > csh.rit.edu.https: Flags [P.] ...
00:00:00.226456 IP csh.rit.edu.https > snack.home.40855: Flags [.] ...
00:00:00.261945 IP csh.rit.edu.https > snack.home.40855: Flags [.] ...
00:00:00.261960 IP csh.rit.edu.https > snack.home.40855: Flags [P.] ...
00:00:00.261985 IP snack.home.40855 > csh.rit.edu.https: Flags [.] ...
00:00:00.261998 IP snack.home.40855 > csh.rit.edu.https: Flags [.] ...
00:00:00.273284 IP snack.home.40855 > csh.rit.edu.https: Flags [P.] ...
00:00:00.398473 IP csh.rit.edu.https > snack.home.40855: Flags [P.] ...
00:00:00.436372 IP snack.home.40855 > csh.rit.edu.https: Flags [.] ...

# SSL handshake complete, ready to send HTTP request. 
# At this point, openssl s_client is sitting waiting for you to type something
# into stdin.

Summarizing the above tcpdump data for this ssl handshake:
  • 12 packets for SSL, vs 3 for TCP alone
  • TCP handshake took 114ms
  • Total SSL handshake time was 436ms
  • Number of network round-trips was 3.
  • SSL portion took 322ms (network and crypto)
The server tested above has a 2048 bit ssl cert. Running 'openssl speed rsa' on the webserver shows it can do a signature in 22ms:
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.022382s 0.000542s     44.7   1845.4
Anyway. The point is, no matter how fast your SSL accelerators (hardware loadbalancer, etc), if your SSL end points aren't near the user, then your first connect will be slow. As shown above, 22ms for the crypto piece of SSL handshake, which means 300ms of the SSL portion above was likely network latency and some other overhead.
Once SSL is established, though, it switches to a block cipher (3DES, etc) which is much faster and the resource (network, cpu) overhead is pretty tiny by comparison.
Summarizing from above: Using SSL incurs a 3.5x latency overhead for each handshake, but afterwards it's generally fast like plain TCP. If you accept this conclusion, let's examine how this can affect website performance.
Got firebug? Open any website. Seriously. Watch the network activity. How many HTTP requests are made? Can you tell how many of those that go to the same domain use http pipelining (or keepalive)? How many initiate new requests each time? You can track this with tcpdump by looking for 'syn' packets if you want (tcpdump 'tcp[tcpflags] == tcp-syn').
What about the street wisdom for high-performance web servers? HAProxy's site says:
 "If a site needs keep-alive, there is a real problem. Highly loaded sites often disable keep-alive to support the maximum number of simultaneous clients. The real downside of not having keep-alive is a slightly increased latency to fetch objects. Browsers double the number of concurrent connections on non-keepalive sites to compensate for this." 
Disabling keep-alive on SSL connections means every single http request is going to take 3 round-trips before even asking for data. If your server is 100ms away, and you have 10 resources to serve on a single page, that's 3 seconds of network latency before you include SSL crypto or resource transfer time. With keep alive, you could eat that handshake cost only once instead of 10 times.
Many browsers will open multiple simultaneous connections to any given webserver if it needs to fetch multiple resources. Idea is that parallelism gets you more tasty http resources in a shorter time. If the browser opens two connections in parallel, you'll still incur many sequential SSL handshakes that slow your resource fetching down. More SSL handshakes in parallel means higher CPU burden, too, and ultimately memory (per open connection) scales more cheaply than does CPU time - think: above, one active connection cost 22ms of time (most of which is spent in CPU) and costs much more than that connection holds in memory resources and scales better (easier to grow memory than cpu).
For some data, Google and Facebook both permit keep-alive:
% URL=https://s-static.ak.facebook.com/rsrc.php/zPET4/hash/9e65hu86.js
% curl  -w "tcp: %{time_connect} ssl:%{time_appconnect}\n" -sk -o /dev/null $URL -o /dev/null $URL
tcp: 0.038 ssl:0.088
tcp: 0.000 ssl:0.000

% URL=https://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js
% curl  -w "tcp: %{time_connect} ssl:%{time_appconnect}\n" -sk -o /dev/null $URL -o /dev/null $URL
tcp: 0.054 ssl:0.132
tcp: 0.000 ssl:0.000
The 2nd line of output reports zero time spent in tcp and ssl handshaking. Further, if you tell curl to output response headers (curl -D -) you'll see "Connection: keep-alive". This is data showing that at least some of big folks with massive qps are using keep alive.
Remember that new handshakes are high cpu usage, but existing SSL connections generally aren't as they are using a cheaper block cipher after the handshake. Disabling keep alive ensures that every request will incur an SSL handshake which can quickly overload a moderately-used server without SSL acceleration hardware if you have a large ssl key (2048 or 4096bit key).
Even if you have SSL offloading to special hardware, you're still incuring the higher network latency that can't be compensated by faster hardware. Frankly, in most cases it's more cost effective to buy a weaker SSL certificate (1024 bit) than it is to buy SSL hardware - See Google's Velocity 2010 talk on SSL.
By the way, on modern hardware you can do a decent number of SSL handshakes per second with 1024bit keys, but 2048bit and 4096bit keys are much harder:
# 'openssl speed rsa' done on an Intel X5550 (2.66gHz)
rsa 1024 bits 0.000496s 0.000027s   2016.3  36713.2
rsa 2048 bits 0.003095s 0.000093s    323.1  10799.2
rsa 4096 bits 0.021688s 0.000345s     46.1   2901.5
Fixing SSL latency is not totally trivial. The CPU intensive part can be handled by special hardware if you can afford it, but the only way sure way to solve network round-trip latency is to be closer to your user and/or to work on minimizing the total number of round-trips. You can be further from your users if you don't force things like keep-alive to be off, which can save you money in the long run by letting you have better choices of datacenter locations.

Bloom Filters by Example

10:51:00 AM 0 Comments

Bloom Filters by Example

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.
The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate:
               
01234567891011121314
Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1.
It's easier to see what that means than explain it, so enter some strings and see how the bit vector changes. Fnv and Murmur are two simple hash functions:
Enter a string: 
fnv: 
murmur:
Your set: []
When you add a string, you can see that the bits at the index given by the hashes are set to 1. I've used the color green to show the newly added ones, but any colored cell is simply a 1.
To test for membership, you simply hash the string with the same hash functions, then see if those values are set in the bit vector. If they aren't, you know that the element isn't in the set. If they are, you only know that it might be, because another element or some combination of other elements could have set the same bits. Again, let's demonstrate:
Test an element for membership: 
fnv: 
murmur:
Is the element in the set? no
Probability of a false positive: 0%
And that's the basics of a bloom filter!

Advanced Topics

Before I write a bit more about Bloom filters, a disclaimer: I've never used them in production. Don't take my word for it. All I intend to do is give you general ideas and pointers to where you can find out more.
In the following text, we will refer to a Bloom filter with k hashes, m bits in the filter, and n elements that have been inserted.

Hash Functions

The hash functions used in a Bloom filter should be independent and uniformly distributed. They should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very good choices).
Examples of fast, simple hashes that are independent enough3 include murmur, thefnv series of hashes, and Jenkins Hashes.
To see the difference that a faster-than-cryptographic hash function can make, check out this story of a ~800% speedup when switching a bloom filter implementation from md5 to murmur.
In a short survey of bloom filter implementations:
  • Cassandra uses Murmur hashes
  • Hadoop includes default implementations of Jenkins and Murmur hashes
  • python-bloomfilter uses cryptographic hashes
  • Plan9 uses a simple hash as proposed in Mitzenmacher 2005
  • Sdroege Bloom filter uses fnv1a (included just because I wanted to show one that uses fnv.)
  • Squid uses MD5

    How big should I make my Bloom filter?

    It's a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.
    Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug the number n of elements you expect to insert, and try various values of k and m to configure your filter for your application.2
    This leads to an obvious question:

    How many hash functions should I use?

    The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.
    Since you have to pick k when you create the filter, you'll have to ballpark what range you expect n to be in. Once you have that, you still have to choose a potential m (the number of bits) and k (the number of hash functions).
    It seems a difficult optimization problem, but fortunately, given an m and an n, we have a function to choose the optimal value of k(m/n)ln(2) 23
    So, to choose the size of a bloom filter, we:
    1. Choose a ballpark value for n
    2. Choose a value for m
    3. Calculate the optimal value of k
    4. Calculate the error rate for our chosen values of nm, and k. If it's unacceptable, return to step 2 and change m; otherwise we're done.

    How fast and space efficient is a Bloom filter?

    Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing are O(k). That is, each time you want to add an element to the set or check set membership, you just need to run the element through the k hash functions and add it to the set or check those bits.
    The space advantages are more difficult to sum up; again it depends on the error rate you're willing to tolerate. It also depends on the potential range of the elements to be inserted; if it is very limited, a deterministic bit vector can do better. If you can't even ballpark estimate the number of elements to be inserted, you may be better off with a hash table or a scalable Bloom filter4.

    What can I use them for?

    I'll link you to wiki instead of copying what they say. C. Titus Brown also has an excellent talk on an application of Bloom filters to bioinformatics.

    References

    1: Network Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview.
    2: Wikipedia, which has an excellent and comprehensive page on Bloom filters
    3: Less Hashing, Same Performance, Kirsch and Mitzenmacher
  • How can I handle a destructor that fails?

    6:57:00 PM 0 Comments

    Write a message to a log-file. Or call Aunt Tilda. But do not throw an exception!
    Here's why (buckle your seat-belts):
    The C++ rule is that you must never throw an exception from a destructor that is being called during the "stack unwinding" process of another exception. For example, if someone says throw Foo(), the stack will be unwound so all the stack frames between the throw Foo()and the } catch (Foo e) { will get popped. This is called stack unwinding.
    During stack unwinding, all the local objects in all those stack frames are destructed. If one of those destructors throws an exception (say it throws a Bar object), the C++ runtime system is in a no-win situation: should it ignore the Bar and end up in the } catch (Foo e) {where it was originally headed? Should it ignore the Foo and look for a } catch (Bar e) { handler? There is no good answer — either choice loses information.
    So the C++ language guarantees that it will call terminate() at this point, and terminate() kills the process. Bang you're dead.
    The easy way to prevent this is never throw an exception from a destructor. But if you really want to be clever, you can say never throw an exception from a destructor while processing another exception. But in this second case, you're in a difficult situation: the destructor itself needs code to handle both throwing an exception and doing "something else", and the caller has no guarantees as to what might happen when the destructor detects an error (it might throw an exception, it might do "something else"). So the whole solution is harder to write. So the easy thing to do is always do "something else". That is, never throw an exception from a destructor.
    Of course the word never should be "in quotes" since there is always some situation somewhere where the rule won't hold. But certainly at least 99% of the time this is a good rule of thumb.

    Is there any difference between List x; and List x();?

    3:33:00 PM 0 Comments

    big difference!
    Suppose that List is the name of some class. Then function f() declares a local List object called x:
    void f()
    {
      List x;     // Local object named x (of class List)
      ...
    }
    
    But function g() declares a function called x() that returns a List:
    void g()
    {
      List x();   // Function named x (that returns a List)
      ...
    }