Synchronization of parallel programs

Free download. Book file PDF easily for everyone and every device. You can download and read online Synchronization of parallel programs file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Synchronization of parallel programs book. Happy reading Synchronization of parallel programs Bookeveryone. Download file Free Book PDF Synchronization of parallel programs at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Synchronization of parallel programs Pocket Guide.

Lastly, they ran experiments to quantify the impact of various architectural support on the performance of a bus-based shared memory multiprocessor running automatically parallelized numerical programs. Both achieve substantial performance improvement over the cases where atomic test and set and exchange-byte operations are supported in shared memory. Similar records in OSTI. GOV collections:. Technical report. Full Record Other Related Research. Abstract This paper studies the performance implications of architectural synchronization support for automatically parallelized numerical programs.

Authors: Anik, S. Publication Date: Research Org. Anik, S.

Synchronization Primitives

United States: N. Copy to clipboard. For regular topologies, however, X 2 does not often yield better results than X 1 as the topol- ogies are usually not amenable to nested parallelization e. Hence, X 1 is most realistic when assessing c and C for these problems. For irregular topologies, however, X 2 can yield better results under specific circumstances as shown later on. Furthermore, application of X 1 is not so obvious due to the absence of a clearly layered structure.

Hence, it is realistic to assume that nested parallelism will be exploited when using SP programming, and X 2 will be used. Analytical model In this section we develop an approximate, analytic model of c to obtain basic insight into the influence of DAG topology and workload distribution. For analytic tractability reasons, we restrict ourselves to symmetric DAGs, and to stochastic node workloads that are independent, and identically distributed iid.

The reason is that negative-exponential dis- tributions or distributions with comparable skewness and kurtosis are rarely found in real workloads [21] due to the fact that at the task level workload distributions are primarily determined by the sum of many smaller terms such as condi- tional data-dependent control flow and contention for resources such as locks, memory, communication links, and CPUs in case of multithreading.

Despite the choice for a symmetric topology, the model captures the most salient factors with respect to topology and workload, that are to be varied in the computation-intensive, numerical experiments that follow later on in Section 5. Thus, workload variability or imbalance which is common in practice, is an important factor.

Purdue's Adapter for Parallel Execution and Rapid Synchronization: The TTL_PAPERS Design

Discussion Despite the fact that the above model is an approximation that only applies to symmetrical DAGs, a number of basic prop- erties of c are uncovered, which, as shown later on, are much more general. First of all, Eq. Similar to the macro pipe- line example in Fig. The logarithmic trend follows from the order statistics that underlie the model, by which it holds that increasing the number of edges the barrier yields logarithmic increase in critical path length for iid workloads with non-zero Fig. Consequently, DAG parallelism P is a key parameter, which indeed is confirmed by the numerical experiments presented later on, its particular region of interest clearly being the high range.

An equally important parameter is the variance in s. The order statistical model indi- cates that for large P the region of interest c scales with r. Again, this trend is confirmed by the numerical experiments presented later on. The model also suggests that a third, important parameter is the synchronization density S.

Clearly, for large S e. The number of nodes that can still execute in parallel i. Hence, c increases with decreasing S. Note, that in this sense the model is overly pessimistic since it is based on X 1 , i. Hence, we expect the maximum to be for small S, but larger than 1. Indeed, these phenomena are confirmed by the numerical experiments presented later on. As explained in Section 2 we consider computation DAGs topologies only, as they provide the salient information on c.

The reason for this is the following. First, the communication workload aspects of synchronization are asynchronous explained in Section 2 , implying that including them will not change our findings. Furthermore, the typically logarithmic cost of implementing synchronization barriers recursive doubling is entirely subsumed by the logarithmic cost due to workload imbalance i. Thus, explicitly modeling synchronization essentially does not affect the findings with respect to c but would increase the complexity of the DAGs and therefore the complexity of the analysis on c. Hence, focusing on compu- tation only does not affect the model.

In summary, although the model has been derived for symmetric DAGs, the theory suggests three key parameters, i. The remaining experiments will be conducted in terms of these parameters to assess 1 the range of c, and 2 to what extent these parameters predict c. Synthetic workloads In this section we numerically investigate the behavior of c for a wide range of NSP DAG topologies, and workload dis- tributions, covering a vast range of application classes.

The purpose of these experiments is to assess 1 what the influence of NSP DAG topology and workload is on c, and 2 to what extent c adheres to the analytical, three-parameter model. While most DAGs represent real applications, the random graphs are included in the study to represent highly unstructured applications and erratically designed parallel programs, which may not be covered by the classical application models. Note, that the above selection represents a very wide spectrum of topologies, ranging from regular to completely irregular, whose shape is controlled by user setpoints.

The reason for selecting synthetic random workloads is that for the regular applications the task workloads are usually identical and highly deterministic. As in such cases c is invariably close to unity independent of topology this would prohibit a sensible analysis of the influence of topology.

Hence, we intentionally randomize s creating imbalance to enable observing the effect of topology on c. A sec- ond reason for using a stochastic workload is that this is a standard performance modeling approach to account for the effects of dynamic data dependent control flow and resource contention scheduling.

Combined with the wide range of topologies, the stochastic workload distributions 4 coefficient of variance settings, 25 samples per setting , the application space, ranging from highly regular DAGs to purely random DAGs, comprises more than , sample DAGs. At this stage of our study we focus on c instead of C. The experiments are partially intended to numerically validate the applicability of the analytic model, a process that would be needlessly complicated if the influence of the mapping process would also have to be taken into account as in the case of C.

Furthermore, as shown later on, applications which are well- programmed and properly mapped, do not exhibit the workload variability that is one of the parameters in such a study, which would lead to much too optimistic results regarding the worst case performance loss when using SP models. The effect of real workload on C is treated in Section 6. Also from the triangle discussion in Section 4. In the following, all experiments are based on DAGs that have sufficient depth to expose worst-case c values.

Four sets of experiments are performed where the parameter r ranges over 0. The figure clearly shows the logarithmic effect of P and the large influence of r. Initially, one might infer that using an SP model entails quite a performance penalty. However, the above values of P and r should be interpreted with great care, as is discussed in Section 5.

P for 3-point stencil. S for S-point stencil. The effect of S in the denominator of Eq. The above measurements are in agreement with the trends predicted by the model. The latter two effects are directly caused by the statistical iid assumption, which, in terms of c, is the most pessimistic assumption that can be made. Again, the model must therefore be seen as establishing a safe upper bound on the actual performance loss incurred in practice. The FFT implements a Butterfly dependency structure with log P levels of depth [45], in which the distance to the neighboring node is doubled on each stage.

D where D equals the number of butterfly stages. However, the limited c values are due to the fact that the depth D of the butterfly DAG is very small, compared to the reference model. While in the reference model c is limited by P, rather than D, in the butterfly DAG the sit- uation is reverse. This corresponds to the fact that the opportunities for parallelism in a P-wide pipeline see Fig. Apart from the low S value, there is another reason for the higher c values. Up to now the layered DAGs have been symmetric, and one parameter, S, has been shown to be sufficient to characterize these DAGs in relation to c.

In the pipeline case, the additional opportunities for parallelism are caused by the directional bias that the edges have in the pipeline case all edges point down or to the right. In the biased case, higher values for D increase the Fig. D for FFT. Thus, although c is still mainly determined by P, there is an additional effect for large D. We accounted for this by pro- portionally decreasing l and r with increasing row index. Again, the c values are somewhat higher compared to our reference model due to the right bias in the direction of all edges.

Unlike the layered topologies, which are generated in terms of P and D , these DAGs are generated in terms of jVj and S, according to the techniques presented in [2]. Also the influence of r essentially agrees with the analytical model.

Scheduling and Synchronization for Multicores

Due to the random synthesis, each point on a curve represents mean values of c for hundreds of DAGs with widely varying topologies, demonstrating that the analytical model is not restricted to regular topologies. Next, we study the impact of S, which is the other topology synthesis parameter next to jVj. Unlike the regular case, however, P and S are not independent. The curves are based on averaging over hundreds of DAGs per data point.

The coinciding curves in the figure clearly show that R is a key parameter, while the size of P for LU. P for macro pipeline. The six datasets represent civil engineering structures. The DAG topologies are somewhat regular in the sense that they are layered where edges are restricted from one node layer to the next node layer the DAG represents an iterative solver which implies that the edges repeat after each layer.

Note that the plots exhibit somewhat more sampling noise than the plots for the random DAGs which were averaged over much more samples. P for random DAGs. R for random DAGs. The six application DAGs are generated from the Diana do- main decomposition module, and are subsequently executed in macro dataflow style. Compared to the iterative solver DAGs studied earlier, these DAGs have a more irregular topology, compris- ing an initial, highly parallel factorization phase, followed by a reduction phase, where the DAG ultimately converges into one end node yielding the solution.

Again, the figure shows that c essentially exhibits the same dependency on R as the irregular DAGs discussed earlier. Summary In the following we summarize our results. For the above measurements we conclude that for normally distributed, iid workloads, the behavior of c is adequately captured by the analytical model. This implies that the model, although developed for regular stencils, applies to a vast range of regular, and irregular NSP DAG topologies, and is therefore much more fundamental.

For negligible variance, c is invariably close to 1. Only for considerable variance e. Note, however, that this only applies when node workloads are independent. For highly correlated workloads, again, c will quickly reduce to 1, as is shown in the next section. Real workloads In the previous section we numerically validated our theory on c for a wide range of DAG topologies using synthetic workload distributions. To what extent the findings in the previous section apply in practice in terms of C highly depends on 1 the actual amount of parallelism at machine level, 2 actual workload distribution variance, and 3 task workload independence.

As mentioned earlier, mapping the typically high levels of DAG parallelism to the lower levels of machine par- allelism decreases P, but also r as mapping aims to optimize workload balance. In this section we present empirical results based on real workloads from regular and irregular computations. We deter- mine C for the same range of applications studied in the previous section.

In our experiments, we use MPI as programming interface. First we manually program a classical, optimized NSP version of each application in the study. They represent the extra synchronizations caused by a nested-parallel construct used in a hypo- thetical structured language which would be systematically translated to the MPI interface in the same way.

In this way we are sure that the rest of the program structure is identical and only the effect of the NSP to SP transformation is observed, factorizing out peculiarities of a particular SP or NSP compilation mapping process as much as possible [27]. The DAGs are mapped onto P processors using a simple data partitioning in all cases, cyclic for LU to maintain load balance. While earlier we intentionally introduced stochastic workloads in order to investigate the influ- ence of topology on c, the real stencil, FFT, macro-pipeline, and LU computations inherently have very regular workloads where the variance between the tasks is virtually zero tasks are identical, despite the fact that for LU the workloads decrease with increasing iteration, variance within one iteration is virtually zero.

Hence, workload balance at the machine level is entirely determined by the load balancing qualities of the mapping process which are near-perfect.

  • The History and Politics of UN Security Council Reform (Routledge Advances in International Relations and Global Politics);
  • Selected Fine Writings;
  • No customer reviews!
  • Tokyo Cancelled.
  • Supersizing the Mind: Embodiment, Action, and Cognitive Extension?
  • Synchronization Between Parallel Loops.

Hence, the issues related to C are essentially run-time effects such as the added cost of the additional barrier synchronizations. The small variations in C are due to execution time variance caused by link contention over the communication networks and stochastic operating system overhead. The figures clearly show that C is quite small. As P and r are negligible, the per- formance loss is almost entirely due to the costs associated with the additional synchronization barriers. The target machines are two heterogeneous Beowulf systems, one of 12 University of La Laguna , and one of 40 nodes University of Valladolid.

To map the DAGs onto the P machine nodes, a simple, static list scheduler is used that cre- ates a very good load balance as all applications are compute-bound negligible communication, as mentioned earlier, com- munication only reduces the performance loss, yielding overly optimistic results on C. In contrast to the FEM solvers, which have application- specific i. The workloads are sampled from a Gaussian distribution with l sufficiently large enough to cause a CPU load that allows the execution to be compute-bound.

The message-passing between the processors only implements the node precedence relations the arcs in the DAG , i. Inter-node data communication is not taken into account as this does not change our findings with respect to C as explained earlier in the paper. C for regular DAGs on Origin.

Synchronization Mechanisms in Parallel Programming

C for regular DAGs on Beowulf. For higher P actual parallelism does no longer increase while the higher degree of sequentialism in the mapped DAG actually reduces C. As mentioned earlier, all our experiments involve compute-bound DAG execution as including inter-node communication will only decrease C and c. Iterative solvers We use the same 30 DAGs as described earlier in Section 5 but now with the real workload.

As the computational load is typically proportional to the number of nodes allocated per processor we can estimate l and r in terms of the number of nodes in each processor partition. As the partitioning methods are designed to create a well-balanced par- tition, we observe very low workload variabilities. The only cases where the values are appreciable are found when the num- ber of nodes per processor becomes small. The above measurements suggest that C will be quite low. The potential impact for larger P e. Although for large P workload variability is appreciable, the large correlation between tasks at the same DAG column the vertical DAG axis is the iteration axis prohibits any logarithmic growth of C.

Problem jVj P 4 8 16 32 64 Tower A border set is the collection of nodes which have a neighbor assigned to another given processor. Border-sets of a given processor are many times slightly overlapped, as there are some vertexes which have connections to several parts of the DAG. The program executes a loop in two stages: a computing the required number of floating point operations per assigned node; and b communicating new values from nodes in the border sets to the corresponding pro- cessors.

The values for each border set are marshalled in a contiguous array. P for two of the direct solver DAGs.

Andre: Synchronization of Parallel Programs by Andre -

In the SP version, a barrier is added after the communication stage, before the next loop iteration. This barrier would be implicit when using a native SP programming model. We have tested executions from to iterations of the loop to ensure that the results are independent of the num- ber of iterations.

Hence, a visual presentation is omitted. Table 2 illustrates the high workload variability of the actual workload per node. However, since the DAGs are scheduled on a small er number of processors the effect of this high variability will be limited. P for two of the six DAGs. The DAGs shown are the ones that have and nodes, respectively, and have been chosen because they yield the best and the worst case in terms of C, respectively, i. These results imply that C is quite limited, and in some cases becomes even less than unity.

This sharp contrast to the values that would be expected according to the variability numbers shown in Table 2 is due to the fol- lowing reasons. First, this variability is mainly across task layers, which stratification is not affected by the SP-ization algo- rithm. Second, the results of Gamma less than 1 are produced because the simple scheduling technique used is not the most appropriate for the NSP DAGs. Conclusion In this paper we studied the inherent impact of using structured parallel programming models on the amount of appli- cation parallelism that can be expressed when compared to programming without imposing any synchronization structure.

For a wide range of regular and irregular appli- cation DAGs with synthetic workloads we establish that the loss of parallelism, if any, is largely affected by DAG topology and workload distribution variability, and is characterized by a simple, analytic model. For stochastic, independent task workloads the loss of parallelism tends to grow logarithmically with DAG parallelism, the effect only being significant for small synchronization densities.

Whereas the merits in terms of, e. Acknowledgments The authors extend their gratitude to Dr. References [1] V. Adve, A. Carle, E. Granston, S. Hiranandani, K. Kennedy, C.

Koebel, U. Kremer, J. Mellor-Crummey, S. Warren, C. Almeida, I. Vasconcelos, J. Rabe, D. Andrews, F. Schneider, Concepts and notations for concurrent programming, Comput. Bilardi, Observations on universality and portability in high-performance computing, in: Proceedings of International Workshop on Innov. Bilardi, K. Herley, A. Pietracaprina, BSP vs. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, Y. Bonorden, B.

Juurlink, I. Chandra, R. Menon, L. Dagum, D. Informatics, Univ. Darlington, Y. Guo, H. To, J. Duff, R.