One of the key factors in enabling the compiler and the operating system to squeeze better performance out of the applications that are create is to provide sufficient information about the intent of the code. By providing sufficient information about what the code is trying to do, it is possible to optimise the code for better parallel throughput at compile- and run-time, allowing the developer to focus more on the business-domain problem they are trying to address and leaving the heavy lifting of multi-core and multi-processor scheduling to infrastructure code in the compiler, run-time libraries and operating system.
Looping functions are one of the key areas where splitting parts of the loop across all available hardware resources will generally give an increase in application performance. Consider the trivial case of iteration through all elements in a collection to determine their total. The simplest and most straight forward implementation is:
std::vector<int> v; v.push_back(1); v.push_back(5); int total = 0; for (int ix = 0; ix < v.size(); ++ix){ total += v[ix]; }
This sample is easy for humans to read and write, and for developers used to reading C-language family syntax, the intent of the loop is simple to understand. However, for a compiler and runtime-library combination to come in and schedule the loop across multiple threads, something like the OpenMP pragma directives would be needed to explicitly tell the compiler more about the optimizations that could take place:
std::vector<int> v; v.push_back(1); v.push_back(5); int total = 0; #pragma omp for for (int ix = 0; ix < v.size(); ++ix){ #pragma omp atomic total += v[ix]; }
The first OpenMP directive requests that the for loop should be executed on multiple threads, and the omp atomic directive is used to prevent multiple simultaneous writes to the total variable. For more information on OpenMP, the MSDN Library reference is available for detailed documentation on all the available directives.
If a declarative looping technique had been used, applying parallelism to the vector addition could have been cleaner and simpler. The STL for_each function is an ideal replacement, and the code in the first sample can be re-written as:
class Adder{ private: int _total; public: Adder() : _total(0) {} void operator ( ) ( int& i ) { _total += i; } operator int ( ) { return _total; } }; void VectorAdd() { std::vector<int> v; v.push_back(1); v.push_back(5); int total = std::for_each(v.begin(), v.end(), Adder()); }
In this case, the explicit for loop has been removed, and the actual vector adding code is slightly cleaner, but the need to define a class with a number of operator overloads significantly adds to the complexity of the solution, and unless the code-base has a large number of similar collection addition statements, it is extremely unlikely that a developer would go to the extra effort to define a new class just for the benefit of using the STL for_each function.
Examining the Adder class, it is clear most of its contents exist just to satisfy the requirements to use an instance of it as a function object. The only real logic that is need out of the class is the simple line _total += i. Recognizing this fact, C++ 0x provides a dramatically simplified syntactical technique in the form of lambda functions. Lambda functions remove all the need for the scaffolding code, and allow a predicate function to be defined in-line in another statement. The VectorAdd function can be re-written as:
std::vector<int> v; v.push_back(1); v.push_back(5); int total = 0; std::for_each(v.begin(), v.end(), [&total](int x) {total += x;} );
The syntax for lambda function is reasonably straight-forward. The first element of the lambda in the square brackets tells the compiler that the local variable total is being captured by reference (in this case you are capturing by reference as you would want the results of the vector element addition available after the end of the for_each function), and the second part of the lambda is the parameter list. The final part of the lambda is the function body, which in this case simply adds the value of the x parameter to the total variable.
If no variables need to be captured for the lambda function, or if the lambda function needs to simply capture a copy of a variable, the square-bracket syntax at the start of the lambda expression can be left empty:
std::for_each(v.begin(), v.end(), [](int x) { std::cout << x << std::endl; });
It is also possible to use mixed capture behaviour:
int total = 0; bool displayInput = true; std::for_each(v.begin(), v.end(), [&total, displayInput](int x) { total += x; if (displayInput){ std::cout << x << std::endl; } });
In this case, the displayInput variable is captured by copy, and the Visual C++ compiler will actually raise error C3491: ‘displayInput’: a by-value capture cannot be modified in a non-mutable lambda if an attempt is made to modify its value within the lambda function body.
The final syntactic feature worth covering for lambda functions is the return type. Where possible (and required), the compiler will infer the return type of the lambda expression, but for complex, multi-line expressions, the return type may need to be explicitly declared. Return type declaration is achieved by adding the -> operator and return type between the lambda functions parameters and the method body:
std::for_each(v.begin(), v.end(), [&](int x)->void {total += x;}); }
With lambda functions present in the C++ language, using a more declarative programming style and taking advantage of STL algorithms is made much simpler and cleaner. Lambda functions allow the inline definition of a function body in the code section in which it is to be logically used. As well as providing strong hints to the compiler about potential in-lining optimizations, lambda functions promote a coding style that make discerning the intent about what a section of code is doing much easier. Visual C++ 2010 ships with significant advancements in the area of parallelization, and lambda functions are one of the key enabling technologies for making the use of these advancements feasible.
About the Author
Nick Wienholt is an independent Windows and .NET consultant based in Sydney. He is the author of Maximizing .NET Performance and co-author of A Programmers Introduction to C# 2.0 from Apress, and specializes in system-level software architecture and development, with a particular focus of performance, security, interoperability, and debugging.
Nick is a keen and active participant in the .NET community. He is the co-founder of the Sydney Deep .NET User group and writes technical article for Australian Developer Journal, ZDNet, Pinnacle Publishing, CodeGuru, MSDN Magazine (Australia and New Zealand Edition) and the Microsoft Developer Network. In recognition of his work in the .NET area, Nick was awarded the Microsoft Most Valued Professional Award from 2002 through 2007.