Table 2 shows the execution times in seconds for running a matrix multiplication application with two input matrices of size 200x200. The application is run on an 8-way SMP machine with 2GB of RAM hosting 200MHz Pentium Pro processors, Linux 2.4.18 and Python2.2. Each Python thread runs 100 byte codes between each context switch.
The last column shows the results for POSH as they appear when a weight of 100 is assigned to the execution time of a single-threaded calculation.
When using only 1 worker POSH performs slightly worse than threads. This is caused by the general overhead associated with operations on shared objects. Each computed value in the result matrix has to be copied once in order to share it, which involves the allocation of a new object, and may require the creation of a new shared memory region. The application represents the matrices as lists of rows, where each row is itself a list. As noted, complications inherent with shared memory forces POSH to reimplement shared lists using the abstraction of memory handles, which inevitably leads to a less efficient implementation. In addition, shared objects are accessed via proxy objects, which adds a level of indirection to the operations and thus degrades performance.
However, POSH achieves the expected scalability on multiple processors, since the computations are performed by separate processes. In this particular application, POSH clearly outperforms threads when more than 1 worker is employed.
The full source code of the matrix multiplication application is available in the POSH source distribution, as noted in Section 6. A point of interest is that the code executed by a single worker is identical, regardless of whether the worker is a thread or a process. This ensures a fair comparison of execution times, and also demonstrates the great degree of transparency offered by POSH.