Friday, March 30, 2012

Small String Optimization and Move Operations

EDIT: A lot of readers seem more interested in my implementation of the String class than the purpose of this post, which is to show a somewhat remarkable fact about how small string optimization interacts with move operations. Please don't post more "This implementation is not the most efficient". You are missing the point. The example implementation of my string class here is neither complete, efficient, nor representative of STL implementations. It is only used to illustrate the idea behind small string optimization. I don't even know if the code actually compiles.

Typically, std::string is implemented using small string optimization, a technique by which small strings are stored directly inside the object, rather than dynamically allocated on the heap.

Before I start discussing the issue with small string optimization, let me give you a quick introduction to how it works. Let's say we want to implement a very simplified string class ourselves. In each String object, we simply keep a pointer to the beginning and to the terminating null character of the char array (std::string also needs to keep a pointer to an allocator, as you are allowed to specify your own memory management).


class String {
public:
  String(const char* p);
  String(const String& s);

  String(String&& s);
  ~String();
  unsigned length() const { return _size; }

  char* begin() { return _beg; }
  char* end() { return begin() + _size; }
  // ...
private:
  char* _beg; // first character in array
  unsigned _size; // excluding terminating null char
};


We can implement these constructors in a straightforward manner, using low level programming for efficiency:


inline String::String(const char* p) : _size(strlen(p))
{
    _beg(new char[_size+1]);
    memcpy(_beg,p,_size+1);
}

inline String::String(const String& s) : _size(s.length())
{
    _beg = new char[_size+1];
    memcpy(_beg,s._beg,len+1);
  }

inline String::String(String&& s) : _size(s.length())
{
   _beg = s._beg;
   s._beg = nullptr; // zero out
}

inline String::~String() { delete [] _beg; }



(The implementation will be slightly faster if you replace new/delete with malloc/free, as there is no need to call any constructors/destructors for char arrays. In theory, compilers can optimize this away; in practice, they don't.) See this post for a discussion on this issue (unrelated to small string optimization and move operations).

So far so good, but now you realize that you can do something clever: since we need to a char* pointers in every string, why not use that space to store small strings of length up to sizeof(char*)? That way, we won't need to call new/delete for small strings. In fact, since strings are in most applications quite short, why not store a little more while we're at it?

This is known as small string optimization. One way we can efficiently and elegantly implement this is by using tagged unions. Let's try it out:


class SString {
public:
  SString(const char* p);
  SString(const String& s);

  SString(String&& s);
  ~SString();
  unsigned length() const { return _size; }

  char* begin() { return _size < 16 ? str : _beg; }
  char* end() { return begin() + _size; }
  // ...
private:
  unsigned _size;
  union {
    char* _beg;
    char str[16];
  };
};


Note how SString::begin() was defined to either return _beg or str, depending on the size of the string.


The constructor for const char* is not that hard:


inline SString::SString(const char* p) : _size(strlen(p))
{
    if (_size < 16) {
      // Small string optimization: store in static str
      memcpy(str,p,_size+1);
    } else {
      // Dynamically allocate string (like before)
      _beg(new char[_size+1]);
      memcpy(_beg,p,_size+1);
    }
}


The copy constructor is pretty much the same:


inline SString::SString(const SString& s) : _size(s.length())
{
    if (_size < 16) {
      // Small string optimization: store in static str
      memcpy(str,s.str,_size+1);
    } else {
      // Dynamically allocate string (like before)
      _beg(new char[_size+1]);
      memcpy(_beg,s._beg,_size+1);
    }
}

Now comes the interesting part where you'll see the trade off when using small string optimization. Here is the move constructor:


inline SString::SString(SString&& s) : _size(s.length())
{
    if (_size < 16) {
       // ???
    } else {
       _beg = s._beg; // steal resource from s
       s._beg = nullptr; // prevent s from deleting on destruction
    }
}

So when we have a dynamically allocated array, we can steal the resource. When we've used the small string optimization, on the otherhand, there is no way to steal the resource, as it is statically allocated. We must hence copy it:


if (_size < 16) {
      memcpy(str,s.str,_size+1);
} else {

What may be surprising to a string user, is that the  performance degradation occurs only for small strings.
This effect can be seen on MSVC 2010, which uses a small string optimization of 16 characters for std::string. We take a command line string, then move the string between two string objects 10 million times.
Here's the result:
With the a string of 16 and 17 characters, including the null terminating char, the execution time in seconds is:

Length      Time
16             0.171
17             0.06

In other words, the small string optimization, which is now the small string deterioration, is about 3 times slower than copying pointers. This leads to the surprising conclusion that when moving strings, it is actually faster to move larger strings.

Note that most of the time, you don't just move around strings, but allocate them, deallocate them, copy them, etc. In all those cases, of course, small string optimization is very useful.

Tuesday, March 27, 2012

C versus C++ performance: printing and string conversions

A widely held belief is that C++ code is inherently slower than C code. Although this is true sometimes, it is usually largely a myth. In fact, there are situations in which the opposite holds true: C++ is faster than C.

Ironically, C++ is faster in one of the instances where people typically expect the opposite: printing with C++'s iostream library versus C's stdio library.

Using Microsoft Visual C++ 2010, I tested printing "Hello World!" a million times to a file using ofstream and fprintf. Here are the results:

C          C++        Comparison
0.381     0.24       C++ is 60% faster than C

To be fair, this myth problems stems from a very related fact: if we try out the same experiment printing random integers, we get the following:
C          C++        Comparison
0.307     0.929      C is 3 times faster than C++

If you've done conversions between strings and numeric types prior to C++ 11, you've probably suspected that these conversions are not very efficient:

string to_string(int i)
{
  stringstream ss;
  ss << i;
  return ss.str();
}

However, there's no reason why iostream can't make full use of C's conversion routines for better efficiency. To see that conversions are really the problem here, let's redo the experiment by first converting the integers into strings using std::to_string, and then printing out the strings with ofstream:

C          C++        Comparison
0.306    0.606       C is 2 times faster than C++

Yes, this actually means that:
cout << to_string(42);
is faster than:
cout << 42;

This is obviously a design flaw in the Microsoft's C++ implemention (which is otherwise an excellent C++ compiler), as we could get a better implementation by simply changing the library:

ostream& operator<<(ostream& os, int i)
{
  // Replace current code with:
  os << std::to_string(i); // use std::string overload; also note that to_string never throws here
}

In fact, converting it to a std::string and then printing has some overhead of its own (probably very little due to the small string optimization), but we can get closer to C conversion performance using sprintf to convert the integer into a C string buffer (its easy to calculate the maximal length of the array using log, sizeof and adding 2 more chars, one for the potential minus sign and one for null terminating the string):
With this, we get down from 0.606 to 0.5 seconds. There's still a gap of 64% in performance, which is unexplained for. What it does mean, however, is that Microsoft's C++ library did not prioritize efficient type conversions to strings, which does happen quite a lot when parsing.

Conclusion:
C++ prints faster than C.
C converts numeric types to strings faster than C++.
There's no reason why C++ should convert slower than C, as it has full access to C's underlying implementation of the stdio library (it's included in C++ too, after all).