This article has been updated with better information.
While implementing (or, rather, optimizing) xtd::string, I also analyzed MSVC’s implementation of std::string with the standard Microsoft allocators (VC2015 msvcrt), dlmalloc 2.8.6, nedmalloc 1.10-Beta 3, and Intel TBB 4.3 Update 2.
std::string appears to use small buffer optimizations on sizes smaller than 16 bytes. That is, no heap allocations are performed until the size of the string exceeds this. As such, I wanted an equivalent implementation in xtd::string, but I did not want the branching overhead in all operations to determine if we were pointing to a buffer or to ourselves.
In xtd::string, an alternative approach is taken. There is a global, locked string pool which reserves a huge chunk of virtual address space (1 GiB by default) for pool chunks. Each string is allotted 64 byte chunks on their first usage (presuming that usage is less than or equal to 64 bytes, of course). This gives similar performance (though worse for the first allocation than pure SBO, since doing something is slower than doing nothing).
Another difference – std::string is designed around std::allocator, which does not have a reallocation routine. Therefore, all string allocations must allocate, copy, and free. xtd::allocator does have a reallocation function, so it’s often possible for the string’s heap buffer to simply be extended, which is far faster. This benchmark won’t show that, though, since these are immediate assigns. A later (and much slower) incremental benchmark will show that.
This benchmark creates a *::string object with and assigns a string of the appropriate size to it. This is the best-case scenario for both std::string and xtd::string. I will later post benchmarks for incremental assignments as well.
for each (thread)
string str = presized_string
Now, the results!
1-Thread msvcrt results
1-Thread dlmalloc results
1-Thread nedmalloc results
1-Thread TBB results
With a single thread, xtd::string appears to always win. There is no contention for the string cache in this case. Interestingly, we get equivalent results in dlmalloc and msvcrt. xtd::string stutters a bit with small strings using nedmalloc, but is still faster. TBB operates very similarly to the msvcrt and dlmalloc.
The single-threaded results above are probably the only relevant results for anything other than highly-specialized applications. Unless your software is constantly creating small strings on many threads concurrently, you would never run into contention issues, and you would run logically the same as single-threaded. That being said…
4-Thread msvcrt results
4-Thread dlmalloc results
4-Thread nedmalloc results
4-Thread TBB results
At four threads, we get very interesting results. With the Microsoft allocators, std::string wins for small strings, but xtd::string wins for large strings that don’t use the string cache. Results are similar with dlmalloc, though std::string operates much worse once it has to hit the heap. dlmalloc is simply not designed for multiple threads, so it must lock on every access. I presume that msvcrt allocators work much better in this case. With nedmalloc, xtd::string is overall worse; it’s small performance is terrible, and it’s large performance is the same. I am not sure why this is, though. TBB continues to operate very similarly to msvcrt.
32-Thread msvcrt results
32-Thread dlmalloc results
32-Thread nedmalloc results
32-Thread TBB results
At 32 threads, our results start getting weird. This is because of rather high contention, and also because my CPU only has 4 physical cores (and 8 logical cores), so there is likely a significant amount of overhead there. However, we still get different results depending on the allocator.
With msvcrt, the contention in the string cache is very noticeable. However, we immediately become faster once we are in the heap due to using reallocation schemes instead of allocate/copy/free. With dlmalloc, the contention issues within dlmalloc become apparent. However, I am intrigued by the performance results here – xtd::string is very flat, but still higher than std::string. nedmalloc looks very similar to the msvcrt variant, but is always slower, whereas with msvcrt, it becomes faster with large strings. nedmalloc‘s nedmalloc2 appears to perform very poorly in heavily concurrent situations. TBB is still extremely similar to the msvcrt. Perhaps that is what VS uses underneath? That, or they are just very similar.
It is important to note that the string cache is only touched if the first allocation is less than or equal to 64 bytes. If you assigned a 65 byte literal to a new string, it would never touch the string cache and would immediately allocate from the heap. 0 byte strings do not allocate from either and instead require no space other than the 24 bytes of the xtd::string.
Also, anything other than the single-threaded use-case is extremely rare in anything other than specialized software or strange benchmarks (cough). Even when there are multiple threads, unless you were creating small strings on multiple threads simultaneously all the time, you would never hit contention in the string cache. I’d expect the single-thread results to be the ones to take home.
See Part II here.