Skip navigation

I am currently going over both older PDFs and presentations about D3D12, Mantle, and Vulkan, and am also going over more recent documentation (mainly from GDC).

I hope to be able to compile what I’ve been able to discern so far here, as well as present an example of some of my work to create an API which can work on top of D3D11, D3D12, and Vulkan with near-optimal performance for each.

Stay tuned!

See Part I here.

This is a follow-up to my earlier post. Noting the issues with contention, I’ve made an adjustment to the string cache, so that it is split across multiple subcaches which are iterated over after every allocation. This way, the contention is better distributed – it’s still there, but it should remove spikes which were noticeable in earlier tests. I’ve also dropped dlmalloc and nedmalloc given their atrocious performances, and have added an iterative test as well.

Single-Allocation Tests


These are the same test as the previous post, but with the new string cache changes.

// Pseudocode
for each (thread)
	string str = presized_string

Single Threaded

 

1-Thread single-alloc msvcrt

1-Thread single-alloc msvcrt

1-Thread single-alloc TBB

1-Thread single-alloc TBB

The results for these are very similar. TBB tends to perform a little better. std::string wins with very small allocations, xtd::string wins with large allocations.

4 Threaded

 

4-Thread single-alloc msvcrt

4-Thread single-alloc msvcrt

4-Thread single-alloc TBB

4-Thread single-alloc TBB

Again, very similar results. std::string greatly wins with small allocations, and the string cache still has contention issues with multiple threads. However, the major spikes seen earlier are no longer there. xtd::string is still faster with large strings.

32 Threaded

 

32-Thread single-alloc msvcrt

32-Thread single-alloc msvcrt

32-Thread single-alloc TBB

32-Thread single-alloc TBB

Very similar results, analysis is mostly the same as the 4-threaded variant.

 

Iterative-Allocation Tests


This test is a bit different. Instead of just allocating the string at once, we append 2 byte chunks onto it. You will notice that xtd::string crushes std::string. This is due to the fact that std::allocator does not have a realloc function due to C++’s focus on non-POD types. While it is reasonable that a system that has a realloc_try function (like dlmalloc) could have a custom variant of std::allocator that has it and use it for POD-types, Visual C++’s implementation does not appear to do this. So, while std::string must alloc/copy/free every time they enlarge a string, xtd::string simply reallocs, though sometimes that is the same as std::string, it often isn’t.

// Pseudocode
for each (thread)
	string str
	while (str.size != max_size)
		str += "aa"

Single Threaded

 

1-Thread iterative-alloc msvcrt

1-Thread iterative-alloc msvcrt

1-Thread iterative-alloc TBB

1-Thread iterative-alloc TBB

The results here are rather telling. Being able to reallocate memory instead of being required to allocate it substantially changes performance. There may be other factors at play that I am unaware of.

4 Threaded

4-Thread iterative-alloc msvcrt

4-Thread iterative-alloc msvcrt

4-Thread iterative-alloc TBB

4-Thread iterative-alloc TBB

The results here are basically the same. I should note that at very small sizes, std::string still does win, but it quickly begins losing after it begins to use the heap.

32 threaded

32-Thread iterative-alloc msvcrt

32-Thread iterative-alloc msvcrt

32-Thread iterative-alloc TBB

32-Thread iterative-alloc TBB

Similar results. Uninteresting.

Conclusion


Yeah… if you have large strings, you want to use xtd::string. If you have small strings, you may prefer to use std::string, or you may prefer to disable the string cache for xtd::string. You may also want to disable the subcache support, as it slows things down if there isn’t any contention (probably cache effects).

 

This article has been updated with better information.

While implementing (or, rather, optimizing) xtd::string, I also analyzed MSVC’s implementation of std::string with the standard Microsoft allocators (VC2015 msvcrt), dlmalloc 2.8.6, nedmalloc 1.10-Beta 3, and Intel TBB 4.3 Update 2.

std::string appears to use small buffer optimizations on sizes smaller than 16 bytes. That is, no heap allocations are performed until the size of the string exceeds this. As such, I wanted an equivalent implementation in xtd::string, but I did not want the branching overhead in all operations to determine if we were pointing to a buffer or to ourselves.

In xtd::string, an alternative approach is taken. There is a global, locked string pool which reserves a huge chunk of virtual address space (1 GiB by default) for pool chunks. Each string is allotted 64 byte chunks on their first usage (presuming that usage is less than or equal to 64 bytes, of course). This gives similar performance (though worse for the first allocation than pure SBO, since doing something is slower than doing nothing).

Another difference – std::string is designed around std::allocator, which does not have a reallocation routine. Therefore, all string allocations must allocate, copy, and free. xtd::allocator does have a reallocation function, so it’s often possible for the string’s heap buffer to simply be extended, which is far faster. This benchmark won’t show that, though, since these are immediate assigns. A later (and much slower) incremental benchmark will show that.

This benchmark creates a *::string object with and assigns a string of the appropriate size to it. This is the best-case scenario for both std::string and xtd::string. I will later post benchmarks for incremental assignments as well.

// Pseudocode
for each (thread)
	string str = presized_string

Now, the results!


 

1 Threaded

1-Thread msvcrt results

1-Thread msvcrt results

1-Thread dlmalloc results

1-Thread dlmalloc results

1-Thread nedmalloc results

1-Thread nedmalloc results

1-Thread TBB results

1-Thread TBB results

With a single thread, xtd::string appears to always win. There is no contention for the string cache in this case. Interestingly, we get equivalent results in dlmalloc and msvcrt. xtd::string stutters a bit with small strings using nedmalloc, but is still faster. TBB operates very similarly to the msvcrt and dlmalloc.
The single-threaded results above are probably the only relevant results for anything other than highly-specialized applications. Unless your software is constantly creating small strings on many threads concurrently, you would never run into contention issues, and you would run logically the same as single-threaded. That being said…


 

4 Threaded

4-Thread msvcrt results

4-Thread msvcrt results

4-Thread dlmalloc results

4-Thread dlmalloc results

4-Thread nedmalloc results

4-Thread nedmalloc results

4-Thread TBB results

4-Thread TBB results

At four threads, we get very interesting results. With the Microsoft allocators, std::string wins for small strings, but xtd::string wins for large strings that don’t use the string cache. Results are similar with dlmalloc, though std::string operates much worse once it has to hit the heap. dlmalloc is simply not designed for multiple threads, so it must lock on every access. I presume that msvcrt allocators work much better in this case. With nedmalloc, xtd::string is overall worse; it’s small performance is terrible, and it’s large performance is the same. I am not sure why this is, though. TBB continues to operate very similarly to msvcrt.


 

32 Threaded

32-Thread msvcrt results

32-Thread msvcrt results

32-Thread dlmalloc results

32-Thread dlmalloc results

32-Thread nedmalloc results

32-Thread nedmalloc results

32-Thread TBB results

32-Thread TBB results

At 32 threads, our results start getting weird. This is because of rather high contention, and also because my CPU only has 4 physical cores (and 8 logical cores), so there is likely a significant amount of overhead there. However, we still get different results depending on the allocator.

With msvcrt, the contention in the string cache is very noticeable. However, we immediately become faster once we are in the heap due to using reallocation schemes instead of allocate/copy/free. With dlmalloc, the contention issues within dlmalloc become apparent. However, I am intrigued by the performance results here – xtd::string is very flat, but still higher than std::string. nedmalloc looks very similar to the msvcrt variant, but is always slower, whereas with msvcrt, it becomes faster with large strings. nedmalloc‘s nedmalloc2 appears to perform very poorly in heavily concurrent situations. TBB is still extremely similar to the msvcrt. Perhaps that is what VS uses underneath? That, or they are just very similar.


Conclusion

It is important to note that the string cache is only touched if the first allocation is less than or equal to 64 bytes. If you assigned a 65 byte literal to a new string, it would never touch the string cache and would immediately allocate from the heap. 0 byte strings do not allocate from either and instead require no space other than the 24 bytes of the xtd::string.

Also, anything other than the single-threaded use-case is extremely rare in anything other than specialized software or strange benchmarks (cough). Even when there are multiple threads, unless you were creating small strings on multiple threads simultaneously all the time, you would never hit contention in the string cache. I’d expect the single-thread results to be the ones to take home.

See Part II here.