Thursday, April 19, 2012

SCARY iterators and multimap alternative

I don't find std::multimap particularly useful, mostly because I can't order the duplicates however I want. There has been a proposal to order duplicates according to insertion order (like push_back), but this wouldn't be enough for many needs I have. Instead, I find myself rewriting a particular intance of multimap using something like std::map<Key, std::vector<T>>. Here, I'll show how to write a multimap replacement container.

The class has its own iterators. Every class that can should have them, as the generic algorithms make code vastly more elegant. I'll also take the opportunity to discuss SCARY iterators with code, since Microsoft introduced them in MSVC 11 beta (GCC already has them in libstdc++).

First, we define the multimap class itself, called mmap (for lack of imagination):


template <
 class Key,
 class T,
 class Compare = std::less<Key>,
 class Vector = std::list<T>,
 class MapAllocator = typename std::map<Key,Vector,Compare>::allocator_type
>
class mmap {
 typedef Vector vec_type;
 typedef std::map<Key,Vector,Compare,MapAllocator> map_type;

The first three template arguments are the same as for a std::multimap.
The third is the underlying vector to be used to store all duplicates, which can be a vector, list, queue, and perhaps something more fancy (the restrictions will become obvious as we implement member functions). Perhaps a bit surprising, I have chosen list instead of vector as the default, since it guarantees that no iterator is invalidated when inserting elements, something that will carry over to mmap (erasing won't with equal_range, unfortunately).

We also need some rather mechanical work to get the proper typedefs:


 typedef Key key_type;
 typedef T mapped_type;
 typedef typename std::pair<const Key,T> value_type;
 typedef typename map_type::size_type size_type;
 typedef Compare key_compare;
 typedef MapAllocator allocator_type;
 typedef value_type& reference;
 typedef const value_type& const_reference;
 typedef typename map_type::pointer pointer;
 typedef typename map_type::const_pointer const_pointer;

Here's where things get interesting, we want iterators too. Now, the iterator can either be defined as a nested class, or a separate one. What does the iterator need to contain? It needs to contain two map_type::iterator: one for the current key, and one for the end of the map. Our iterators also need to reference a certain vector element, so we need one vec_type::iterator. We don't need any vec_type::iterator that points to the end of the vector, since this is easily retrieved by the current map_type::iterator.

Clearly, our iterator class depends on the template arguments Key and T, but also on the Vector (since we need a vec_type::iterator). So we have two options here:

Either we define it as a nested class:
class mmap_iterator {
But this would mean that mmap_iterator implicitly has the same template arguments as mmap, so that the template instantiations mmap<int,double> and mmap<int,double,custom_compare> would yield two difference mmap_iterator classes. As we saw, the iterator doesn't need the comparator to be implemented, so this would seem like unnecessary bloat (modern C++ compilers won't generate dead code for member functions of mmap_iterator that we never use, but this doesn't help us much since generic algorithms include pretty much everything our iterators will be able to do anyway).

A better way would be to write our mmap_iterator class separately from mmap, and conveniently type define it as an iterator:


template <class Key, class T, class Vector> class mmap_iterator; // before class mmap
 typedef mmap_iterator<Key,T,Vector> iterator; // inside class mmap

And then define it outside. This seems like a great idea, since our instances mmap<int,int> and mmap<int,int,custom_compare> can now use the same iterator class definition, namely mmap_iterator<int,double,std::list<double>>. This technique, of making iterators independent of template arguments that it doesn't need (in particular, STL comparators and allocators), is known as SCARY iterators. Our mmap class allows the flexibility of using different Vector types (std::list is the default), but this provides no overhead as long as we use the same Vector type everywhere.

Let us then define the mmap_iterator class after the mmap class definition:


template <class Key, class T, class Vector>
class mmap_iterator {
public:
 // Friend declaration
 template <class Key, class T, class Compare, class Vector, class MapAllocator> friend class mmap;
 typedef T value_type;
 typedef T& reference;
 typedef T* pointer;
 typedef const T& const_reference;
 typedef const T* const_pointer;
 typedef std::bidirectional_iterator_tag iterator_category;
 // Note: use default comparator and allocator:
 typedef typename mmap<Key,T,std::less<Key>,Vector>::map_type::iterator iterator;
 typedef typename mmap<Key,T,std::less<Key>,Vector>::range_iterator r_iterator;
 typedef typename mmap<Key,T,std::less<Key>,Vector>::map_type::difference_type difference_type;

This is mostly necessary typedefs for generic algorithms, and a friend declaration so that mmap can access the private members of mmap_iterator. The data member part of mmap_iterator looks like this:


protected:
 iterator cbeg,cend; // current key, end of map
 r_iterator ri; // current element
};

We can then construct a mmap_iterator by supplying the arguments:


// Constructors
 mmap_iterator(iterator i, iterator e, r_iterator j) : cbeg(i), cend(e), ri(j) {}

It would be convenient to have a shorthand for end iterators, so we provide such a constructor too:


mmap_iterator(iterator e) : cbeg(e), cend(e), ri(???) {}

Here, e is intended to be the end iterator, which will then be both the current pointer (cbeg), the end pointer (cend). But what will the "range iterator" j be, that is, the iterator that points into the elements of the Vector? After all, there is no such vector now, since cbeg == cend.

One way would be to simply not assign ri to anything, but this formally leads to undefined behavior when we perform assignments (MSVC 10 in debug mode will complain). There's also an efficiency reason not to do this, which I'll mention when relevant.

So instead we declare an empty dummy vector, which we will always point to when map is at the end:


// Data members of mmap_iterator
protected:
 iterator cbeg,cend;
 r_iterator ri;
 static Vector dummy_vec; // static, always empty. ri points to dummy_end() when cbeg == cend
};

template <class Key, class T, class Vector>
Vector mmap_iterator<Key,T,Vector>::dummy_vec = Vector(); // just an empty vector

And back to our mmap_iterator for end():


mmap_iterator(iterator e) : cbeg(e), cend(e), ri(dummy_vec.end()) {}

We are now ready to define begin() and end(). We've pretty much already covered end(), so let's define it:

template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::end() -> iterator
{
 return iterator(m.end()); // use one argument version 
}

And here's begin():


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::begin() -> iterator
{
 auto e = m.begin();
 if (e != m.end()) {
  return iterator(m.begin(),m.end(),m.begin()->second.begin());
 } else {
  return end(); // empty map
 }
}

The iterator is created by initalizing the three data members we just discussed: the "pointer" to the first key m.begin(), the pointer to the end of keys m.end(), and the first element of the Vector (note that the vector itself is m.begin()->second): m.begin()->second.begin().

Back to the iterator class, we need to traverse it:


// Iteration
 mmap_iterator& operator++() {
  if (++ri == cbeg->second.end()) { // move to next list element
   if (++cbeg == cend) { // move to next group
    ri = dummy_vec.end();
   } else {
    ri = cbeg->second.begin();
   }
  }
  return *this;
 }
 mmap_iterator operator++(int) { auto it = *this; ++(*this); return it; }

The code for traversing backwards is even easier, because we never reach the end:


mmap_iterator& operator--() {
  if (cbeg == cend || ri == cbeg->second.begin()) { 
   --cbeg; // move to previous group
   ri = --(cbeg->second.end());
  } else {
   --ri; // move to previous list element
  }
  return *this;
 }
 mmap_iterator operator--(int) { auto it = *this; --(*this); return it; }

So far so good, but we also need to compare iterators. The correct solution should be obvious:

// Comparisons

bool operator==(const mmap_iterator& it) const { return cbeg == it.cbeg && ri == it.ri; }
bool operator!=(const mmap_iterator& it) const { return !(*this == it); }

Note that if cbeg == it.cbeg, ri and it.ri always reference the same vector, the entire comparison is well defined due to short circuiting (we never reach ri == it.ri if cbeg != it.cbeg, which is important).

Note that when cbeg == it.cbeg == cend (don't write that in real code), ri and it.ri both point to the end of the dummy vector. However, had I not defined the dummy vector, and assuming there's no problem in using uninitialized iterators/variables, these comparisons would have needed modification, since ri and it.ri are uninitialized and could hence point to different things. The correct version would then be:

// Comparisons (bad way, without dummy vector)
 bool operator==(const mmap_iterator& it) const 
       { 
               return cbeg == it.cbeg && (cbeg == cend || ri == it.ri); // Less efficient!
        
bool operator!=(const mmap_iterator& it) const { return !(*this == it); }

Obviously less efficient, and inelegant too.

Continuing, dereferencing is easy:


 // Dereferencing
 reference operator*() { return *ri; }
 pointer operator->() { return &(*ri); }

We can now define some useful members of mmap. For example, to push_front a Key/Value and get back the inserted element:


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::push_front(const value_type& p) -> iterator
{
 auto at = m.find(p.first);
 if (at == m.end()) {
  // First entry of this kind
  at = m.insert(map_type::value_type(p.first,Vector(1,p.second))).first;
 } else {
  // Entries already exist, push to front
  at->second.push_front(p.second);
 }
 return iterator(at, m.end(), at->second.begin());
}

And rvalue version wouldn't hurt:


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::push_front(value_type&& p) -> iterator
{
 auto at = m.find(p.first);
 if (at == m.end()) {
  // First entry of this kind
  Vector v;
  v.push_back(std::move(p.second));
  at = m.insert(map_type::value_type(std::move(p.first),std::move(v))).first;
 } else {
  // Entries already exist, push to front
  at->second.push_front(std::move(p.second));
 }
 return iterator(at, m.end(), at->second.begin());
}

push_back is similar:


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::push_back(const value_type& p) -> iterator
{
 auto at = m.find(p.first);
 if (at == m.end()) {
  // First entry of this kind
  at = m.insert(map_type::value_type(p.first,Vector(1,p.second))).first;
  return iterator(at, m.end(), at->second.begin());
 } else {
  // Entries already exist, push back
  at->second.push_back(p.second);
  return iterator(at, m.end(), --at->second.end());
 }
}

equal_range is a bit trickier, since the "end" iterator will be the first element of the next key:


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::equal_range(const Key& k) -> std::pair<iterator,iterator> 
{
 auto at = m.find(k);
 if (at != m.end()) {
  auto next = std::next(at);
  if (next != m.end()) { // "default" case: key found, next element exists
   return make_pair(
    iterator(at,m.end(),at->second.begin()),
    iterator(next,m.end(),next->second.begin()));
  } else { // key found, but there is no next element
   return make_pair(
    iterator(at,m.end(),at->second.begin()),
    end());
  }
 } else { // key not found
  return make_pair(
   iterator(end()),
   iterator(end()));
 }
}

Erase works as you think, but it also has to remove the Vector if it was a singleton:


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::erase(iterator it) -> iterator
{
 auto next = std::next(it);
 auto& vref = it.cbeg->second;
 vref.erase(it.ri);
 if (vref.empty()) {
  m.erase(it.cbeg);
 }
 return next;
}

Calculating the sizes can be done by taking std::distance(begin(),end()), but a more efficient way would be to exploit the fact that Vector already knows its size (list::size is O(1) in C++11):


template <class Key, class T, class Compare, class Vector, class MapAllocator>
auto mmap<Key,T,Compare,Vector,MapAllocator>::size() const -> size_type
{
 return std::accumulate(m.begin(),m.end(),size_type(0),
  [](size_type a, const map_type::value_type& p){ return a + p.second.size(); });
}


We can now test the mmap implementation:


#include "mmap.h"
#include <iostream>
#include <string>
#include <algorithm>

int main()
{
 typedef mmap<std::string,std::string> mymap;
 mymap m;
 auto print = [&](mymap& mm){ // should be const mymap&, but we haven't defined const iterators 
  std::cout << "Content of map (size: " << mm.size() << "):\n";
  for (auto i = mm.begin(); i != mm.end(); ++i) {
   std::cout << *i << "\n";
  }
  std::cout << "\n";
 };

 m.push_back(mymap::value_type("J","Josef"));
 m.push_back(mymap::value_type("A","Anna"));
 m.push_back(mymap::value_type("J","Jens"));

 print(m);

 m.push_front(mymap::value_type("A","Anita"));
 m.push_front(mymap::value_type("J","John"));

 print(m);

 auto rng = m.equal_range("A");
 std::cout << "All entries in A:\n";
 std::for_each(rng.first,rng.second,[&](const mymap::mapped_type& s){ std::cout << s << "\n"; });
 std::cout << "\n";

 rng = m.equal_range("J");
 std::cout << "All entries in J:\n";
 std::for_each(rng.first,rng.second,[&](const mymap::mapped_type& s){ std::cout << s << "\n"; });
 std::cout << "\n";

 // Erase all names containing 'n'
 for (auto i = m.begin(); i != m.end(); ) {
  if (i->find('h') == std::string::npos) {
   i = m.erase(i);
  } else {
   ++i;
  }
 }

 std::cout << "After erasing all names without 'h':\n";
 print(m);

 m.clear();

 std::cout << "After clearing\n";
 print(m);
}

This prints:

Content of map (size: 3):
Anna
Josef
Jens

Content of map (size: 5):
Anita
Anna
John
Josef
Jens

All entries in A:
Anita
Anna

All entries in J:
John
Josef
Jens

After erasing all names without 'h':
Content of map (size: 1):
John

After clearing
Content of map (size: 0):

Now let us return to SCARY iterators. My implementation of mmap_iterator was independent of the comparator (template argument Compare) and map allocator (template argument MapAllocator). (The Vector allocator is implicitly defined in Vector, so there was no need to provide it separately.)

Now, in the test suite I use a mmap<std::string,std::string>, so it uses the default arguments for Compare and MapAllocator. But what if I use a mmap<std::string,std::string,my_compare>? My code will still use the same mmap_iterator instance: mmap_iterator<std::string,std::string,std::list<std::string>>.

Notably, my mmap_iterator has the following type definitions:
typedef typename mmap<Key,T,std::less<Key>,Vector>::map_type::iterator iterator; typedef typename mmap<Key,T,std::less<Key>,Vector>::range_iterator r_iterator;

Put differently, I'm relying on the default comparator and allocator for cbeg, cend, and ri.
If I now use mmap<std::string,std::string,my_compare>, things will not go smoothly, as cbeg is of type mmap<std::string,std::string,defaults here...>, but, since MSVC 10 doesn't support comparator independence in iterators (SCARY iterators), it generates code which looks like an attempt at illegal type conversion due to the perceived difference between using std::less<Key> and my_compare.

Let's try it:

// in main.cpp

// No trivial lambda to pointer conversion in MSVC 10, need to define functor outside
struct my_compare {
bool operator()(const std::string& s, const std::string& t) {
return std::lexicographical_compare(s.rbegin(),s.rend(),t.rbegin(),t.rend());
}
};

int main()
{
mmap<std::string,std::string,my_compare> rm;
std::for_each(rm.begin(),rm.end(),[&](const std::string& s){ std::cout << s << "\n"; });
}

Trying to compile, as expected we get:
error C2665: 'mmap_iterator<Key,T,Vector>::mmap_iterator' : none of the 3 overloads could convert all the argument types
(and a lot more of the barely comprehensible template output)

My point here is the following: there is a proposal to make SCARY iterators part of the standard, so that my code above is standard compliant. It is not now, because I'm assuming that the underlying implementation of iterators (in particular, std::map and Vector's) satisfy the "SCARY condition". MSVC 11 will implement SCARY iterators, so the code will be fine.

But look at what happens if SCARY iterators do NOT get standardized: users of C++, like me, who want to wrap some composition of STL containers into a nice class and provide iterators for access to generic algorithms, won't have the chance to provide an efficient implementation using SCARY iterators, unless the STL itself does so. Of course, one could always implement my mmap from scratch, but that's not really what C++ programming is about. The containers are meant to be used; they increase productivity.

Why not help the user increase performance and reduce bloat too, then?
I hope SCARY iterators will be standardized.

No comments:

Post a Comment