Monday, October 21, 2013

How to iterate over the values of a std::map

EDIT: Changed example to better reflect the generality of the problem = title of this post. END EDIT.

Now that C++ has lambdas, the C++ STL algorithms are practically useful, but how do you iterate over the values of a std::map?

Say you have a function template:

template <typename Iter> void f(Iter curr, Iter end);

f assumes the iterators point to std::strings:

std::vector<std::string> v;
f(v.begin(),v.end()); // ok

std::map<int,std::string> m;
f(m.begin(),m.end()); // Won't work: *m.begin() is a pair<Key,Value>

What we thus need is a way of converting the map::iterator (and map::const_iterator) to an iterator to always dereferences the Value type of the pair.

Solution using Boost

The solution is easy using boost:

#define BOOST_RESULT_OF_USE_DECLTYPE
#include <boost/iterator/transform_iterator.hpp>

// First, define a lambda to get the second element of a pair:
auto get_second = [](const std::pair<const int,std::string>& p){ return p.second; };

// Then, we can convert a map iterator into an iterator that automatically dereferences the second element
auto beg = boost::make_transform_iterator(m.begin(),get_second);
auto end = boost::make_transform_iterator(m.end(),get_second);

f(beg,end); // ok, works!

The line
#define BOOST_RESULT_OF_USE_DECLTYPE
is needed so inform the boost libraries that the result type (Value in this case) should be inferred using decltype(), rather than by requiring a result_of typedef (prior to C++11, decltype did not exist).

Solution without Boost

If you can't use boost, you'll need to implement the dereferencing by hand. Here's the code:

#include <map>
#include <iterator>

template <typename Iter>
class map_iterator : public std::iterator<std::bidirectional_iterator_tag,typename Iter::value_type::second_type> {
public:
 map_iterator() {}
 map_iterator(Iter j) : i(j) {}
 map_iterator& operator++() { ++i; return *this; }
 map_iterator operator++(int) { auto tmp = *this; ++(*this); return tmp; }
 map_iterator& operator--() { --i; return *this; }
 map_iterator operator--(int) { auto tmp = *this; --(*this); return tmp; }
 bool operator==(map_iterator j) const { return i == j.i; }
 bool operator!=(map_iterator j) const { return !(*this == j); }
 reference operator*() { return i->second; }
 pointer operator->() { return &i->second; }
protected:
 Iter i;
};

template <typename Iter>
inline map_iterator<Iter> make_map_iterator(Iter j) { return map_iterator<Iter>(j); }


// We can now do:

 std::map<int,std::string> m;
 // ...
 f(make_map_iterator(m.begin()),make_map_iterator(m.end()));


Sunday, October 13, 2013

STL Container Performance


STL Container Performance Table

There's already a good table at Stack Overflow listing time complexity (in Big O notation) of common operations with C++ STL containers for comparison, although it's structured in a more abstract way and a little hard to read because it's not a HTML table. Here's my version, which also includes priority_queue (technically an adaptor).

Persistent iterators means that the iterators are not invalidated by insertions and erase (except when erasing the element referred to by the iterator, which is necessarily invalidated).

ContainerInsertionAccessEraseFindPersistent Iterators
vector / stringBack: O(1) or O(n)
Other: O(n)
O(1)Back: O(1)
Other: O(n)
Sorted: O(log n)
Other: O(n)
No
dequeBack/Front: O(1)
Other: O(n)
O(1)Back/Front: O(1)
Other: O(n)
Sorted: O(log n)
Other: O(n)
Pointers only
list / forward_listBack/Front: O(1)
With iterator: O(1)
Index: O(n)
Back/Front: O(1)
With iterator: O(1)
Index: O(n)
Back/Front: O(1)
With iterator: O(1)
Index: O(n)
O(n)Yes
set / mapO(log n)-O(log n)O(log n)Yes
unordered_set / unordered_mapO(1) or O(n)O(1) or O(n)O(1) or O(n)O(1) or O(n)Pointers only
priority_queueO(log n)O(1)O(log n)--

Always O(1): begin(), end(), empty(), size(), push_back()

The following operations are always O(1) when they exist:
  1. begin(), end()
  2. empty()
  3. size() (note that list::size() was not necessarily O(1) prior to C++11)
  4. push_front() (note that std::vector does not have push_front(), as it would not be O(1))
  5. push_back()

Some Additional Notes

Adaptors: queue and stack

For std::queue and std::stack, complexity depends on the underlying container used (by default std::deque).

vector

std::vector has constant time (O(1)) back insertion provided no reallocation needs to take place (use reverse/capacity to allocate/check). When reallocation is necessary, all elements are copied (or moved, if possible) to a need memory location. It is guaranteed that back insertion is amortized constant, meaning: "if we perform a large amount of back insertions, the average time for back insertion is constant".

Insertion does not invalidate iterators as long as no reallocation is performed (when reallocating, all iterators become invalid). Deletion invalidates all iterators after the deleted element, iterators to elements before are still valid.

deque

Insertion and deletion of elements in a std::deque may invalidate all its iterators. Pointers are however persistent. In practice accessing / iterating over a std::vector is faster than std::deque.

All iterators may become invalid after an insertion or deletion, but pointers/references are always valid.

list

If you have an iterator to an element, you can insert right after (or before) that element in constant time (O(1)). Of course, you can also erase it or access it directly (O(1)) using the iterator (or any adjacent element, as ++iterator / --iterator are constant time operations).

If you only know the index, e.g. that you wish to insert/retrieve/erase the 4th element, you'll need to iterate the list until you reach that element. Put differently: std::list does not provide random access.

sorted vector and deque

To search for an element in a sorted std::vector or std::deque, use std::equal_range. If only the first element is needed, there is std::lower_bound. If you only want to know whether an element exists or not, there is std::binary_search.

set and map

Requires a less-than comparison function. Complexities also apply to multiset and multimap.

unordered_set and unordered_map (hash tables)

unordered_set and unordered_map has constant time performance on all operations provided no collisions occur. When collisions occur, traversal of a linked list containing all elements of the same bucket (those that hash to the same value) is necessary, and in the worst case, there is only one bucket; hence O(n).

Requires a hash function and equality comparison function. Complexities also apply to unordered_multiset and unordered_multimap.

Deletion does not invalidate any iterators (other than erased element). Insertion keeps all iterators valid as long as no rehashing is done. When rehashing is performed, all iterators become invalid. Pointers/references to elements always remain valid.

multiset, multimap, unordered_multiset, unordered_multimap

std::multiset and std::multimap follow the same rules as std::set and std::map.
std::unordered_multiset and std::unordered_multimap follow the same rules as std::unordered_set and std::unordered_map.

The only reason they are not listed in the table/throughout this document is to save space.

basic_string

Strictly speaking std::string and std::wstring are typedefs for basic_string, which is a container. What applies to string above applies more generally to basic_string (and hence, to std::wstring too).

Note that prior to C++11 basic_string was not required to store its elements (characters) contiguously. Now it acts as std::vector, except its only valid for POD types and some tricks that don't violate the constraints may be employed (in practice: small string optimization).

priority_queue

Requires a less-than comparison function. Always gives the greatest element (according to comparison function) when top() is called. top() is constant time, but pop() requires O(log n) as the queue needs to be rearranged (to ensure next top() correctly gives greatest element).


Sunday, October 6, 2013

How to Crack Door Codes and Suitcases Fast

Cracking Code Locks


When I was still a student in Linköping University, my friends were staying in student apartments where the main entrance used a code lock. The lock required a four digit passcode, but there was something unusual about it: it lacked the green "OK"/"Enter" button to confirm once choice. Instead, the code was automatically verified as the digits were inserted. This may not seem like such a big deal, but, as we will see, it is actually.

Consider that the code is 1234. Let us now say that you are returning home on a Saturday morning at 3.30 AM and wish to enter your building. Since you are not at your best, you happen to start by pressing a "2" instead of the "1". It's not such a big deal, you think, and you now press "1234", upon which the door opens. It works as you expected, but there is one crucial detail here: you actually inserted the string "21234", so the first four digits are in fact "2123", which is not the right code. When you then inserted the "4", the verification mechanism clearly knew that it should check the last three digits "123", and append your newly inserted "4".

Clearly then, you have just tried two code: "2123" and "1234", and you have in fact done so with only 5 key presses instead of the 4*2=8 you would normally expect. This is due to the missing "OK" button (and perhaps a "Cancel"/"Restart" button).

This convenience thus comes at a price: if someone wishes to crack the lock (by guessing the code that is, not physically breaking it), that cracker could reuse previously inserted digits as part of the new code. For instance, consider the following. First, we insert "0000", which tries precisely that code. Then, we insert a "1", upon which we are trying the code "0001". If we insert a "0" again now, we are trying "0010" (as the code lock is constantly remembering the last three digits inserted). We can clearly take shortcuts, but how many? If we could constantly insert new codes, without ever having to return to a previous code, we would effectively try all 10^4 = 10000 codes by pressing only 10003 digits (the 10000 codes + 3 digits we need to start the process).

At first glance, it's not even clear if it's possible to find such a sequence. Consider a code that does not deal with digits from 0 to 9, but only the binary 0 and 1. If the code is two digits long, then there are four different possible codes: 00, 01, 10, and 11 (not a very useful code lock for practical purposes, but it serves as an easy example for us to understand the problem). Normally, trying each code means we need to press 2*4=8 buttons (excluding the "OK" in between each). But if we don't have the "OK" button, so that the code lock uses a memory, we can in fact try all four codes by pressing "00110" (five presses instead of eight). To see why, consider each two consecutive digits in the sequence: 00, 01, 11, 10.

Now consider using binary digits again (only two buttons are available: 0 and 1), but with a code of length 3. We now have 8 possible codes: 000, 001, 010, 011, 100, 101, 110, 111. If we start with 000, we need to then append a 1 (otherwise we retry 000), so we begin with 0001. We could then start with 0 again, as the sequence 00010 would try 000, 001, 010 (just look at three consecutive digits in the sequence). Let's add another 0 and see what happens. The sequence is now 000100, which tries 000, 001, 010, 100. These are unique codes, which is great. However, we now run into problems, as the last 2 digits are 00, exactly what we began with. We have already tried both 000 and 001, so we are now forced into retrying a code!

The point is that some sequences will repeat previously tried codes, which is a waste of time. The sequences that do not repeat previous codes are known as De Bruijn sequences. So with binary digits and a length of 2, the sequence 00110 is a De Bruijn sequence, because it tries the codes: 00, 01, 11, 10, that is, all possible combinations exactly once each. The sequence 00011 is not a De Bruijn sequence, as it tries 00 two times (in the first two trials) and does not try the code 10.

Put differently, De Bruijn sequences are the shortest possible sequence of button presses needed to try all codes on doors with code memory (without "OK" button).

Do De Bruijn sequences exist for all number of digits and lengths (k and n)? We succeeded in finding such for binary digits with length 2 (k=2, n=2), but not for binary digits with length 3 (k=2,n=3). And what about the real world door codes, with has 10 digits and typically length 4 (k=10, n=4)?

It turns out that De Bruijn sequences exist for all possible digits k and length n. This means that a real world door code can be cracked in 10003 key presses instead of the expected 40000 (4 per code, 10000 codes). That is, it can be cracked four times faster!

To see that De Bruijn sequences always exist (and how to find them), we first make some observations:
  1. At any point, we have a state, which are the n-1 digits previously inserted (the "memory").
  2. We then make k different choices, each giving us a code. Put differently: at each point (after initializing with n-1 digits), we have n-1 digits in state and we choose one more digit (k choices) to obtain a new code of length n.
  3. To get a code, say 1234, there is in fact only one way to reach it: from the state 123, and adding the digit 4.
We can picture this process in a graph, where the nodes represent states, and the links represent codes. Thus, note that attempts to try a code are on the links (edges), not nodes (vertices). Below you can see such a graph for k=2 (only binary digits) and n=4 (all codes are composed for four binary digits). The correct code might thus be e.g. 1001 or 0111.




Clearly, the question is now: can we traverse this graph in a way that passes each link once and only once? Note that we may be in the same state multiple times. In fact this is always necessary: to try both 0000 and 0001, we need to be in the state 000 first. However, we don't want to try any code more than once, so each link should only be visited once. This is known as a Eulerian path (or cycle, if we finish at the same node as we started). Now, each node in the graph will have exactly k links going out and k links going in (k=2 in the depicted graph), since there are k digits to choose from in each state, and k ways to throw away a digit to obtain the state. For example, in the graph above, there are two ways out of the state 000: by adding another 0, or by adding a 1. This corresponds to trying code 0000 and 0001. There are also two ways to reach 000: from 0000, and from 1000, as we throw away the left most digit (oldest digit).

Since there are equally many links pointing in and out (namely, k), we can be sure there is an Eulerian cycle, so our problem is indeed always solvable! In fact, all solutions are given precisely by all Eulerian cycles, so any algorithm to find cycles will do (Hierholzer's algorithm is not only an efficient and intuitive way of finding such cycles, but also provides a proof of existence).

Cracking Suitcases


Cracking suitcases is similar (k=10, n=3), except that we can now rotate each wheel independently. Thus, in one "click" we can go from 000 to 001, 010, 100, 009, 090, and 900. These six "next codes" corresponding to rotate one of the three wheels one step "up" or "down" (to their adjacent values). If we were to depict the solutions as a graph, where each node again corresponds to state and links to solutions, we have an important difference: the state is now the previous solution (we don't throw away any digits). From each state there are 6 links going out, and in fact, all these links are also going in, since we can always turn the wheel back one step. Thus we may consider the undirected graph where each node has 6 links, and, since every node has an even degree, it is guaranteed that an Eulerian cycle exists.

Note that the "obvious" solution of enumerating does not work: 000, 001, 002, 003, ... 008, 009 is fine, but going from 009 to 010 requires two switches: turning the least significant 9 to 0, and the middle 0 to 1.

In this manner, we only need to rotate the "wheels" a total of 1000 times, as each rotation tries a new code (and there are precisely 1000 codes, namely 000 to 999).