Applying Multithreading to Resilient Propagation and Backpropagation
This article shows how the Multi Propagation (MPROP) algorithm was implemented for Encog for Java. Though this article focuses on the Java implementation the C# version would be very similar. MPROP is based on resilient propagation, but is designed to work well with multicore computers and gain maximum performance.
Multicore computers are becoming more and more common. There seems to be only some many gigahertz that computer manufactures can squeeze out of CPU's. The real growth in computer performance will likely be in the number of cores contained in a computer's CPU. As of the writing of this document, October 25, 2009, Intel i7 computers can be had for around $1000 USD. An i7 computer makes use of a Quadcore Hyperthreading CPU. This results in 8 processes being available to programs running on this computer. It is virtually impossible to buy a single-core desktop computer.
Unfortunately programs will not take advantage of these new multicore machines unless the program is written to be multithread. A non-threaded application will simply consume nearly all of the processing power of one core and leave the remaining cores virtually idle. Writing programs to be multithread can be tricky. You must be able to break the task up into smaller packets that each thread can process. At some point the threads usually must communicate with each other and aggregate the job back together.
Neural network training is a very time consuming task. Computers can run for hours, if not days, on a single training task. Supervised neural networks are generally trained with resilient propagation (RPROP) or back propagation. RPROP is the more modern of the two and is almost always the preferred solution. I wanted to enhance the Encog Artificial Intelligence Framework to make use of multithreading to provide fast training on multicore machines. I began Googling for how others might have done it. Unfortunately I did not find much on the topic of multithreaded implementations of back propagation or resilient propagation. I found some solutions, but I had my doubts as to how effective they would with large numbers of processors. I wanted a solution that would work with a potentially large number of processors. I did not want a great deal of synchronization between the threads either, as I may want to run this from a grid of computers at some point.
At this point, as of Encog 2.2, I have a fairly efficient multithreaded training process in place. It only works on a single computer at this point, I will leave a grid implementation for a future version of Encog. This is implemented in the Encog class MultiPropagation. Multi Propagation, or MPROP, is a special training technique introduced in Encog that is based on Resilient Propagation.
A short example is provided that will train a neural network with both MPROP and RPROP. This allows me to compare the overall performance of MPROP on various computer hardware. The following is the main method performs this comparison.
Logging.stopConsoleLogging();
BasicNetwork network = generateNetwork();
NeuralDataSet data = generateTraining();
double rprop = evaluateRPROP(network,data);
network.reset();
double mprop = evaluateMPROP(network,data);
double factor = rprop/mprop;
System.out.println("Factor improvement:" + factor);Both MPROP and RPROP will be fed the same training data and neural network. However, the neural network will be reset for each training algorithm so that no learning from the previous training carries through. The training data is random.
The neural network is composed with the following parameters.
| Attribute | Value |
|---|---|
| Input Neurons | 40 |
| Output Neurons | 20 |
| Hidden Layer #1 Neurons | 60 |
The training data is composed of 20,000 input and ideal data pairs.
All of the Encog training algorithms implement the Train interface. This makes them fairly interchangeable. The implementation of the evaluateMPROP and evaluateRPROP is very similar. The implementation of evaluateMPROP is shown here.
MultiPropagation train = new MultiPropagation(network,data);
long start = System.currentTimeMillis();
System.out.println("Training 20 Iterations with MPROP");
for(int i=1;i<=20;i++)
{
train.iteration();
System.out.println("Iteration #" + i + " Error:" + train.getError());
}
train.finishTraining();
long stop = System.currentTimeMillis();
double diff = ((double)(stop - start))/1000.0;
System.out.println("MPROP Result:" + diff + " seconds." );
System.out.println("Final MPROP error: " + network.calculateError(data));
return diff;This is a typical Encog training routine. Here we loop through 20 training iterations. We track the number of seconds that this took. A similar process is done for RPROP.
Multi Propagation Implementation
All propagation training techniques work similarly. Whether it be back propagation, resilient propagation or the Manhattan update rule, the technique is similar. There are two three distinct steps:
1. Perform a Regular Feed Forward Pass
2. Process the levels backwards and determine the errors at each level
3. Apply the changes to the weights and thresholds
First, a regular feed forward pass is performed. The output from each level is kept so that the error for each level can be evaluated independent. Second, the errors are calculated at each level, and the derivatives of each of the activation functions are used to calculate gradient descents. These gradients will be used in the third step.
The third step is what varies among the different training algorithms. Backpropagation simply takes the gradient descents and scales them by a learning rate. The scaled gradient descents are then directly applied to the weights and thresholds. The Manhattan Update Rule only uses the sign of the gradient to decide in which direction to affect the weight. The weight is then changed in either the positive or negative direction by a fixed constant.
RPROP keeps an individual delta value for every weight and thresholds and only uses the sign of the gradient descent to increase or decrease the delta amounts. The delta amounts are then applied to the weights and thresholds.
The MPROP algorithm uses threads to perform steps 1 & 2. The training data is broken into packets that are distributed among the threads. At the beginning of each iteration threads are started to handle each of these packets. Once all threads have completed, a single thread aggregates all of the results from the threads and applies them to the neural network. There is a very brief amount of time where only one thread is executing, at the end of the iteration. This can be seen from the following monitor.

As you can see from the above image, the i7 is currently running at 100%. You can clearly see the end of each iteration, where each of the processors falls briefly. Fortunately, this is a very brief time and does not have a large impact on overall training efficiency. I did try implementations where I did not force the threads to wait at the end of the iteration for a resynchronization. However these did not provide as efficient of training because the RPROP algorithm, upon which MPROP is based, needs all changes applied before the next iteration begins.
The MPROP algorithms uses a number of threads equal to the number of processors reported by Java, plus 1. Unless there is a single processor, then the MPROP algorithms falls back to a regular single threaded RPROP algorithm.
Testing Results
I tried MPROP, using the above mentioned network and training data on three different computer platforms. First we will look at it on an i7 quadcore.
Dell Studio XPS 8000, 8gig RAM Intel i7 at 2.79GHz
Training 20 Iterations with RPROP Iteration #1 Error:1.0592062021803321 Iteration #2 Error:1.0112968157018771 Iteration #3 Error:0.9650583848127503 Iteration #4 Error:0.9269433225981621 Iteration #5 Error:0.8947162095367102 Iteration #6 Error:0.8714873694194031 Iteration #7 Error:0.8445288449926142 Iteration #8 Error:0.8186688302191717 Iteration #9 Error:0.7952278955734976 Iteration #10 Error:0.7717422560410586 Iteration #11 Error:0.7475048877257578 Iteration #12 Error:0.7235382011165326 Iteration #13 Error:0.7026047081990957 Iteration #14 Error:0.6843757761100023 Iteration #15 Error:0.6685206160475999 Iteration #16 Error:0.6539311876046258 Iteration #17 Error:0.6412660225209257 Iteration #18 Error:0.630790400329957 Iteration #19 Error:0.6211146795350724 Iteration #20 Error:0.6136882493691617 RPROP Result:128.562 seconds. Final RPROP error: 0.6075224766406004 Training 20 Iterations with MPROP Iteration #1 Error:0.6075212244066446 Iteration #2 Error:0.8665463281875874 Iteration #3 Error:0.8316846996192032 Iteration #4 Error:0.7451195340393163 Iteration #5 Error:0.7005024644028119 Iteration #6 Error:0.6691870245157884 Iteration #7 Error:0.649034289358449 Iteration #8 Error:0.6339114535879514 Iteration #9 Error:0.6208812103003265 Iteration #10 Error:0.6111566730037973 Iteration #11 Error:0.6056166450414902 Iteration #12 Error:0.6003765685919015 Iteration #13 Error:0.5964873091129251 Iteration #14 Error:0.5932816072550446 Iteration #15 Error:0.5905725872184455 Iteration #16 Error:0.5882703219084173 Iteration #17 Error:0.5863667500894574 Iteration #18 Error:0.5848003831853418 Iteration #19 Error:0.5835759158206529 Iteration #20 Error:0.5823759906353797 MPROP Result:31.88 seconds. Final MPROP error: 0.5814082684159508 Factor improvement:4.032685069008783
As you can see this machine had a factor improvement of about 4 times. This quadcore uses hyperthreading so 8 processors are reported to Java. Therefore MPROP used 9 threads. Despite the fact that hyperthreading reports twice the number of cores than are physically present, I do not find that it executes anywhere near as fast as additional cores. However, it does help. The fact that I got 4 times with 4 cores is really very good. Threading introduces overhead, without the hyperthreading it is very unlikely that the factor would have been 4 or higher. A factor of 4 implies that the thread switching was perfect, and not even the single threaded synchronization time at the end of each MPROP iteration affected it. Hyperthreading is to thank for this. Still, hyperthreading did not get us anywhere near 7 or 8 times.
Now we will look at a Dual Core computer, with no hyperthreading. Here you see the results from a Dual Core iMac.
Training 20 Iterations with RPROP Iteration #1 Error:1.0619945526007815 Iteration #2 Error:1.0173279127855563 Iteration #3 Error:0.9728967012381747 Iteration #4 Error:0.933266210736963 Iteration #5 Error:0.902990819036054 Iteration #6 Error:0.8785929993141319 Iteration #7 Error:0.852770802106324 Iteration #8 Error:0.8297385766666532 Iteration #9 Error:0.8038142080023881 Iteration #10 Error:0.7800412962894463 Iteration #11 Error:0.7587560796078602 Iteration #12 Error:0.7356506865399463 Iteration #13 Error:0.7151090444337569 Iteration #14 Error:0.695744709113637 Iteration #15 Error:0.6788368720802751 Iteration #16 Error:0.6642652711631868 Iteration #17 Error:0.6509332872251975 Iteration #18 Error:0.6397801404435584 Iteration #19 Error:0.630090330257044 Iteration #20 Error:0.6216381426133933 RPROP Result:183.834 seconds. Final RPROP error: 0.6146150864453944 Training 20 Iterations with MPROP Iteration #1 Error:0.6146143819685393 Iteration #2 Error:0.861988806667595 Iteration #3 Error:0.8245303438423693 Iteration #4 Error:0.7518132811207181 Iteration #5 Error:0.7081523967374347 Iteration #6 Error:0.6712984917380188 Iteration #7 Error:0.652201535422028 Iteration #8 Error:0.6406601654553405 Iteration #9 Error:0.629090967114433 Iteration #10 Error:0.6175595827587673 Iteration #11 Error:0.6122245715859175 Iteration #12 Error:0.6062311438183174 Iteration #13 Error:0.6013160315144382 Iteration #14 Error:0.5977755359770852 Iteration #15 Error:0.5946842580058522 Iteration #16 Error:0.5920646899231164 Iteration #17 Error:0.5896362050394485 Iteration #18 Error:0.5876920979572654 Iteration #19 Error:0.5859706453734917 Iteration #20 Error:0.5844223947199683 MPROP Result:97.25 seconds. Final MPROP error: 0.5831411341205304 Factor improvement:1.8903239074550129
As you can see the factor improvement over single threaded RPROP is 1.89. There is no hyperthreading, so it is just the two cores executing. Due to threading overhead and iteration synchronization times, we do not get a perfectly efficient factor of 2.0.
We can also see the results on a single core, hyperthreading computer.
Dimension 8100, Intel Pentium 4 CPU, 3.00 ghtz. 1GB ram
Training 20 Iterations with RPROP Iteration #1 Error:1.055330758837017 Iteration #2 Error:1.0086543528834082 Iteration #3 Error:0.9635498434678869 Iteration #4 Error:0.9234062461127825 Iteration #5 Error:0.893359620995546 Iteration #6 Error:0.8676510392528938 Iteration #7 Error:0.8426219917532086 Iteration #8 Error:0.8176701896241206 Iteration #9 Error:0.7900163103983693 Iteration #10 Error:0.7663833064550073 Iteration #11 Error:0.7423741422166754 Iteration #12 Error:0.7186589370627757 Iteration #13 Error:0.6973331339716679 Iteration #14 Error:0.6786518719968172 Iteration #15 Error:0.6632366051944965 Iteration #16 Error:0.647960280068216 Iteration #17 Error:0.637116419836954 Iteration #18 Error:0.6275179497202836 Iteration #19 Error:0.6193774112511847 Iteration #20 Error:0.6130185860535315 RPROP Result:501.408 seconds. Final RPROP error: 0.6072752146745306 Training 20 Iterations with MPROP Iteration #1 Error:0.607274926192379 Iteration #2 Error:0.8534056403923543 Iteration #3 Error:0.8121626914024609 Iteration #4 Error:0.7330129725685066 Iteration #5 Error:0.6952683973977223 Iteration #6 Error:0.6695707142291197 Iteration #7 Error:0.6540724010870805 Iteration #8 Error:0.6338389819166642 Iteration #9 Error:0.6234153712989258 Iteration #10 Error:0.6134344366833728 Iteration #11 Error:0.605518025749937 Iteration #12 Error:0.6015147936004235 Iteration #13 Error:0.5972975056266082 Iteration #14 Error:0.5941906318175534 Iteration #15 Error:0.5912954856379496 Iteration #16 Error:0.5890932691798955 Iteration #17 Error:0.587244744073592 Iteration #18 Error:0.585618048750272 Iteration #19 Error:0.5842085212827279 Iteration #20 Error:0.5829523593081855 MPROP Result:408.015 seconds. Final MPROP error: 0.5819121474538267 Factor improvement:1.228895996470718
This computer is single core, yet has hyperthreading, so Java reports processors as two. Even with only hyperthreading, a factor improvement is still present, and the MPROP threading algorithm is still beneficial. If MPROP were run on a true single core computer, with no hyperthreading, Java would report the processor count as one and the MPROP algorithms would fall back to single threading.
Conclusions
MPROP offers great performance improvements over the single threaded MPROP algorithms in cases where a reasonably large training set is present and multicore hardware is used. If neither of these two factors are present, MPROP will fall back to RPROP for training. Because of this, it is the best general purpose training algorithm for Encog.
MPROP was introduced with Encog 2.2, but will continue to be enhanced with future versions of Encog. Improvements will include further optimization of the end of iteration synchronization and moving as much of this synchronization code to the threads as possible. Also options will be added at some point to allow MPROP to operate on a grid of computers.
For more information on Encog visit http://www.heatonresearch.com/encog/