## Wednesday, April 11, 2012

### new/delete vs malloc/free for POD types

C++ typically allocates memory using the new/delete pair, whereas C uses malloc/free.
New/delete is necessarily when dealing with non-POD types, as the constructor/destructor needs to be invoked. However, this is not necessary on POD types, so the question I was having is: How well does C++ live up to its philosophy of "you don't pay for what you don't use" here?

This of course depends on the compiler, and I'm testing it for MSVC 10.

The first benchmark simply allocates and deallocates 100 million ints using new/delete vs malloc/free.

We also check whether the default constructor is invoked (which would effectively zero the allocated value).
Here's the result on my machine (time in seconds):

malloc/free:

Execution time [s]: 6.49
zero: 0
nonzero: 100000000

new/delete:
Execution time [s]: 6.876
zero: 0
nonzero: 100000000

The difference is rather small: malloc is about 6% faster.
It turns out that my previously made assumption of new/delete calling a default constructor for POD types (in this case, int(), to zero the value) is false, as the elements are clearly non-zero after allocation. Thanks to Antti Tuppurainen for actually looking at the assembly output to demonstrate this.

However, the fact remains that new/delete is slower than malloc/free on PODs.

I can only guess that the overhead lies in the definition of new, which requires std::get_new_handler to be called upon allocation failure until the handler fails too. Then std::bad_alloc is thrown. Due to it's support for mixing C++ with managed code, MSVC 10 has a slight overhead in entering a try block (and having a throw statement, I think), so this may explain the slight penalty. (It would be informative if someone could run the same benchmark with other compilers, e.g. GCC, which does not have an overhead when entering a try block).

On a related note, there's a subtle difference between C++98 and C++03: the latter allows you to decide whether a POD should be zeroed out when calling new or not:

struct pod {
int i;
double d;
};

pod* p = new pod; // does not zero-initialize members
pod* q = new pod(); // zero initializes members



I've tried the benchmarks with and without zero initialization:
malloc/free:

Execution time [s]: 6.483
zero: 0
nonzero: 100000000

new/delete:
Execution time [s]: 7.026
zero: 0
nonzero: 100000000

new/delete is about 8% slower here. As you can see, this is not due to zero-initialization, which is correctly handled. With zero initialization, things get a little worse:

new/delete with zero initialization (i.e. new pod() instead of new pod)

Execution time [s]: 7.543
zero: 100000000
nonzero: 0

I'll still use new/delete though: most sensible objects have constructors, because they can conveniently be initialized at the point of declaration.

It should be noted that in C++11, some structs with constructors can still be POD. It basically boils down to "if there's not C++ magic involving hidden data being inserted into the object, then it can be made POD". It would be nice to have C++ perform as well as C in dynamic memory allocation for POD types, but is there such an implementation, given that C++ needs proper exception handling?

I hope someone will do this benchmark for GCC and Intel C++.

## Monday, April 2, 2012

### Limits of Named Return Value Optimization

I decided to test how well MSVC 10 can perform return value optimization.

Here's the code for my benchmark:

#include <iostream>

struct A {
A() { std::cout << "A() Constructor\n"; }
A(int) { std::cout << "A(int) Constructor\n"; }
A(const A&) { std::cout << "RVO Failed (copy)\n"; }
A(A&&) { std::cout << "RVO Failed (move)\n"; }
A& operator=(const A&) { std::cout << "RVO Failed (copy assignment)\n"; }
A& operator=(A&&) { std::cout << "RVO Failed (move assignment)\n"; }
A operator+(const A&) const { return A(0); }
};

A rvo(int i)
{
if (i < 0) {
return A();
} else {
return A(i);
}
}

A rvo2(int i)
{
if (i < 0) {
return A();
} else {
return A()+A();
}
}

A nrvo(int i)
{
A a;
if (i < 0) {
return a;
} else {
return a;
}
}

A nrvo2(int i)
{
A a;
if (i < 0) {
return a;
} else {
A b(i);
return b;
}
}

A hybrid(int i)
{
A a;
if (i < 0) {
return a;
} else {
return A(i);
}
}

int main()
{
std::cout << "rvo(-1):\n";
rvo(-1);
std::cout << "\n";

std::cout << "rvo(1):\n";
rvo(1);
std::cout << "\n";

std::cout << "rvo2(-1):\n";
rvo2(-1);
std::cout << "\n";

std::cout << "rvo2(1):\n";
rvo2(1);
std::cout << "\n";

std::cout << "nrvo(-1):\n";
nrvo(-1);
std::cout << "\n";

std::cout << "nrvo(1):\n";
nrvo(1);
std::cout << "\n";

std::cout << "nrvo2(-1):\n";
nrvo2(-1);
std::cout << "\n";

std::cout << "nrvo2(1):\n";
nrvo2(1);
std::cout << "\n";

std::cout << "hybrid(-1):\n";
hybrid(-1);
std::cout << "\n";

std::cout << "hybrid(1):\n";
hybrid(1);
std::cout << "\n";

}

The function rvo tests simple application of return value optimization.
nrvo tests for named return value optimization in the simplest case.
nrvo2 returns two different objects, depending on execution path.
hybrid performs nrvo in one path, and rvo in the other.

If RVO/NRVO is enabled, no move/copy constructor shall be invoked. As MSVC 10 supports move semantics, we don't expect to see any copy operations at all.

Here's the result:

rvo(-1):
A() Constructor

rvo(1):
A(int) Constructor

rvo2(-1):
A() Constructor

rvo2(1):
A() Constructor
A() Constructor
A(int) Constructor

nrvo(-1):
A() Constructor

nrvo(1):
A() Constructor

nrvo2(-1):
A() Constructor
RVO Failed (move)

nrvo2(1):
A() Constructor
A(int) Constructor
RVO Failed (move)

hybrid(-1):
A() Constructor
RVO Failed (move)

hybrid(1):
A() Constructor
A(int) Constructor

The conclusion is clear:

RVO is successfully performed. Just create the unnamed object in the return statement. It can be constructed as part of a more complicated statement too, the only requirement seems to be that it is an rvalue.

NRVO succeeds when you declare the object to be returned and return that object in all paths.
If you return different objects in different paths, no NRVO is performed for any path (this is not too surprising if you know how NRVO is implemented).
The hybrid test shows that these two conclusions may be combined: if you have a function that performs RVO in one path, and attempts to perform NRVO in another, by what I've just said, no NRVO will be performed, but RVO will. This is precisely the result you see in the hybrid test.

It should be noted, however, that move operations are usually very fast (but not always, see my previous post about small string optimization and move operations), as they only copy and zero out a couple of pointers values.

Conclusion:

When you can, construct the rvalue object in the return statement (this works only for trivial cases). RVO always kicks in.

If you need to construct an object differently based on different paths, declare it first, then let the different paths do whatever they need, and return that same named object. NRVO will kick in as long as you make sure to always return the same named object. In this case, do not combine RVO by constructing an unnamed object in a return statement, as that will disable all NRVO operations.

Put more compactly: use all RVO or all NRVO. Don't combine. Or don't bother too much, as move operations are usually fast anyway.