TECHNICAL REPORT SERIES

Computer and Information Science Research Center
Department of Computer and Information Science

The Ohio State University
2036 Neil Avenue,
Columbus, OH 43210-1277

**********************NOTE*********************

Abstracts for 1993 NOTE: some titles are available ONLY via paper (This is noted after the title/author section.)

If you have trouble accessing a file, please send email to tfletch@cis.ohio-state.edu.

1993

TR46 Closure Properties and Witness Reduction by Sanjay Gupta

Witness reduction has played a crucial role in several recent results in complexity theory. These include Toda's result that PC the "collapsing" of PH into with a high probability; Toda and Ogiwara's results which "colapses" PH into various counting classes with a high probability; and hard functions for various function classes studied by Ogiwara and Hemachandra. Ogiwara and Hemachandra's results establish a connection between functions being hard for #P and functions interacting with the class to effect witness reduction. In fact, we believe that the ability to achieve some form of witness reduction is what makes a function hard for a class of functions. To support our thesis we define new function classes and obtain results analogous to those of Ogiwara and Hemachandra. We also introduce the notion of randomly hard functions and obtain similar results.

TR45 Fault Tolerance in a Multisensor Environment by D.N. Jayasimha. UPDATED VERSION 5/94

Replicating sensors is desirable not only to tolerate sensor failures but also to increase the expected accuracy of the ensemble of replicated sensors beyond that obtainable with a single one. Such replication is used in a multisensor environment or in a distributed sensor network. We model a continuous valued sensor as an interval of real numbers containing the physical value of interest. Given n sensors which at most f of them can suffer arbitrary failures, we present an efficient O(n log n) fault tolerant algorithm whose output is reliable, that is, guaranteed to contain the correct value at all times, and is fairly accurate when f < n/2. The output of the algorithm could be a single interval or a set of intervals depending on the nature of the multisensor environment. This algorithm can be used not only to detect all the possible faulty sensors but to detect all sets (combinations) of possibly faulty sensors. We prove the following results pertaining to the possibly faulty sensors pointed out by the algorithm: 1) The number of sets each containing f possibly faulty sensors is at most (f+1). 2) The number of possibly faulty sensors pointed to by the algorithm is at most 2f. 3) The number of sets each containing f or fewer faulty sensors is at most (2f+1). These results help narrow down the search to detect faulty sensors and to bound the number of intervals needed to construct an accurate and reliable "abstract sensor".

Key Words: Multiple sensors, distributed sensor networks, fault tolerance, fault detection, concrete, abstract, and reliable sensors, intervals, interval graphs.

TR44 An Anticipation-Based Neural Network Model for Temporal Pattern Generation by DeLiang Wang and Budi Yuwono Not yet avail. via e-mail.

A neural network model of complex temporal pattern generation is proposed and investigated analytically and by computer simulation. Temporal pattern generation is based on recognition of the contexts of individual components. Based on its acquired experience, the model actively yields system anticipation, which then compares with the actual input flow. A mismatch triggers self-organization of context learning, which ultimately leads to resolving various ambiguities in producing complex temporal patterns. The architecture of the model incorporates a short term memory for building associations between remote components and recurrent connections for self-organization and component generation in a temporal pattern. Synaptic modification is based on a one-shot normalized Hebbian rule, which is shown to exhibit temporal masking. The major conclusion, namely the network model can learn to generate any complex temporal pattern, is established analytically. An estimate on the efficiency of the training algorithm is provided. Multiple patterns can be incrementally acquired by the system, which also manifests a form of retroactive inter- ference. Neural and cognitive plausibility of the model is discussed at the end of the paper.

TR43 The Effects of Precedence and Priority Constraints on the Performance of Scan Scheduling for Hypercube Multiprocessors by Phillip Krueger and Davender Babbar

In the absence of scheduling constraints, Scan scheduling has been shown to considerably improve performance in hypercube multiprocessors relative to previously-studied job scheduling disciplines. In practice, jobs may have properties that constrain the choices a job scheduler can make, potentially limiting the scheduler's ability to affect performance. Two common types of job scheduling constraints are precedence and priority constraints. We examine the effects of these constraints on the ability of Scan to improve performance in hypercube systems. We find that, even under severe constraints, Scan scheduling retains its large performance advantage.

TR42 Minimal Global Snapshot and Failure Recovery using Infection by Ravi Prakash and Mukesh Singhal

In this paper, we introduce the concept of infection. Infection helps in keeping track of the causal dependencies in a distributed computation. Recovery algorithms with synchronous checkpointing require periodic collection of consistent snapshot of the system to advance the checkpoint. Previous snapshot collection algorithms force every node in the system to take a snapshot. We have used infection to develop two minimal snapshot collection algorithms. If a node initiates snapshot collection, then local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots, need to be taken. Infection is also exploited in a minimal rollback/recovery algorithm. During rollback, only those nodes that are causally dependent on the crashed node should be rolled back. All the three algorithms have low overheads - comparable to or less than the overheads involved in the algorithms proposed in the past.

Key Words: distributed computation, causal dependency, infection, consistent snapshot, checkpointing, recovery.

TR41 The ALPS Kernel for Processor Networks by Mana Mandal, Mahendra Ramachandran, and Prasad Vishnubhotla.

ALPS is an operating system kernel which provides support for developing portable and efficient parallel programs for MIMD machines. It achieves this by supporting a new paradigm for parallel programming, wherein the shared or distributed nature of memory and the topology of the processor inter- connection are transparent to the programmer. Using the ALPS approach, a programmer of a processor network can write a portable parallel program which does not depend on the interconnection topology, and which can still be executed efficiently. This paper presents the ALPS kernel primitives and their implementation on a Transputer network and a Hypercube, and demonstrates how topology independence and efficient scheduling can be achieved in a single framework.

Key phrases: Operating Systems, Parallel Processing, Processor Networks, Topology Independence, Object-Oriented Programming.

TR40 Improving Reusability by Recasting Single Large-Effect Operations as Objects by Bruce W. Weide, William F. Ogden and Murali Sitaraman.

Extending the object-oriented paradigm by "recasting" single large- effect operations as objects markedly improves reusability by increasing flexibility in terms of both functionality and performance. Two examples in this report, a Sorting_Machine and a Spanning_Forest_Machine, illustrate the recasting technique - which might also be called "machine-oriented programming."

TR39 A Universal Framework for Data Transformation by John A. Gawkowski and S.A. Mamrak PAPER ONLY at present

Organization of information manipulated by applications vary widely in their representation. However, there does exist some commonality in the way these differences are realized. A better understanding of such commonality will allow for the development of tools that will simplify the task of reconciling those differences, thereby making it easier for applications to share information.

In an effort to achieve this understanding, we have proposed a Universal Framework for Data Transformation. This framework dissects the task of transforming data into three individual components. These three components are steps common to all transformation efforts. Furthermore, individually each of these three steps can be generalized to some degree.

Keywords: code generation, data translation, schema, data model, encoding.

TR38 Visibility computation plays an important role in computer graphics, vision, and robotics. In this paper, visibility problems are considered for dynamic objects in the plane. Given an observer and a set of polygons that are time-varying (i.e., polygons are allowed to move, shrink, etc.) time intervals are determined for each vertex during which the vertex is visible from a given observer position. Two algorithms are presented to perform the task. The first one is simpler to implement, while the second one runs in near optimal time in the worst case.

TR37 Multicasting using Multidestinateion-Worms Conforming to Base Routing Schemes by Dhabaleswar K. Panda and Pradeep Prabhakaran.

Multicast schemes proposed for wormhole-routed networks so far are either Hamiltonian path-based or unicast-based. The former requires virtual channels to avoid deadlock with unicast messages in the system. The latter scheme, while avoiding intermingling of multicast messages with unicast ones, incurs more latency due to absorb-and-retransmit strategy at intermediate nodes. In this paper, we propose a new multidestination-worm-based scheme where a single worm is used to transmit a message from a source node to a set of destination nodes, which are reachable from it using a single path while conforming to the base routing scheme. Since the movement of these worms is controlled by the base routing scheme of the network, there is no need for any added virtual channel than required by the base routing scheme itself. Algorithms are presented to create such worms for any multicase destination set for popular routing schemes like e-cube, planar-adaptive, and maximally-fully-adaptive using turn model. We show that as routing adaptivity increases, less number of worms with longer path lengths are sufficient to cover a given destination set. The approach provides a new solution to avoid deadlock problem in combining unicast and multicast messages on a wormhole-routed network without added virtual channels.

Keywords: Wormhole-0routing, multicasting, k-are n-cube systems, adaptive routing, path based schemes.

TR36 Broadcasting in k-ary n-cube Wormhole Routed Networks using Path-based Routing by Dhabaleswar K. Panda and Sanjay Singal.

In this paper, we propose a new framework for implementing broadcast on k-ary n-cube wormhole-routed systems. Instead of using the traditional Hamiltonian-path-based or the unicast-based solution, this scheme uses multidestination worms on a set of dimension-disjoint paths. These worms are capable of delivering a message to a set of destination nodes from a single source node on a given path. Using a phase-disjoint scheduling of these worms on the dimension-disjoint paths, we show that one-to-all and all-to-all broadcasting can be implemented on k-ary n-cube systems in maximum n phases, incurring maximum n communication start-ups. The number of startups can be reduced further by using multidimensional paths. Though better broadcasting solutions exist for hypercubes (k=2), our solution is the best for systems with larger k. As low-dimensional systems (2D/3D meshes with larger k) are becoming popular, the proposed solution demonstrates significance in implementing collective communication on these systems efficiently.

Keywords: Wormhole-routing, broadcasting, k-ary n-cubes, path-based schemes, communication port model, and collective communication.

TR35 Volume Rendering Polyhedral Grids by Incremental Slicing by Roni Yagel

Some important research results in science rely on the use of various simulation techniques that operate in a space tessellated by a set of polyhedral cells. These cells which are part of an unstructured grid introduce exceptional problems with respect to data visualization. Volume rendering techniques which have been primarily developed to handle cartesian grids are painfully non-interactive when it comes to unstructured grids. We describe here an efficient method for rendering unstructured grids that is based on incremental slicing. For each view direction, the grid vertices are transformed to screen space using available graphics hardware. We then incrementally compute the 2D polygon-meshes that result from letting a set of equidistant planes, parallel to the screen plane, slice through our grid. Finally, the graphics hardware renders and composes these polygon-meshes in a front-to-back order. This algorithm can handle any grid made of arbitrary, possibly disconnected, convex polyhedra. It can combine hardware rendering, incremental and adaptive rendering, and parallelized implementation to provide real-time interaction with non rectilinear grids.

TR34 Processor Allocation and Job Scheduling for Two- and Three- Dimensional Mesh Connected Multiprocessors. Phil Krueger PAPER ONLY

TR33 On the Embedding of a Class of Regular Graphs in a Faulty Hypercube by Yu-Chee Tseng and Ten-Hwang Lai

A wide range of graphs with regular structures are shown to be embeddable in an injured hypercube with faulty links. These include rings, linear paths, binomial trees, binary trees, meshes, tori, and many others. Unlike many existing algorithms which are capable of embedding only one type of graphs, our algorithm embeds the above graphs in a unified way, all centered around a notion called edge matrix. In many cases, the degree of fault tolerance offered by the algorithm is optimal or near-optimal.

Key Words: graph embedding, processor allocation, hypercube, binary-reflected trees, fault tolerance.

TR32 Parallel Compacting Free Buddy Subcubes in a Hypercube. by Yu-Chee Tseng, Ten-Hwang Lai, and Young Man Kim.

A hypercube can be partitioned into subcubes of various sizes to run independent jobs. As jobs arrive, grabbing subcubes, and leave, releasing subcubes, the system tends to become fragmented. When this happens, one solution that has been proposed is to relocate (or migrate) jobs so as to compact free processors into bigger subcubes. Assuming a circuit-switched or wormhole-routed n-cube using a buddy-system allocation strategy, this paper proposes two algorithms for subcube compaction that are both structure- preserving, adjacency-preserving, path-disjoint, and interference-free, with high concurrency during the migration. The first algorithm uses the simple dimension-ordered (e-cube) routing and needs at most d steps to free up a d-cube, d < n, provided that there are at least 2 to the d power free processors in the cube. The second algorithm exploits more concurrency and cuts down the number of steps to no more than [d/2], but may not use dimension-ordered routing.

Keywords: hypercube, fragmentation problem, subcube compaction, job migration, buddy system, circuit-switching, wormhole-routing, e-cube routing, deadlock-free.

TR31 An Analysis of Data Caching by Duane Buck and Mukesh Singhal. PAPER COPY ONLY

Cache performance is analyzed for the independent reference model (IRM) of data reference in conjunction with the least recently used (LRU) cache page replacement policy. The method has low computational complexity and high accuracy. In addition to the cache hit rate computed by previous analyses, the proposed method computes other important performance measures such as the probabilities of various numbers of references to a page while it is cache resident, the probability of replacing a page, the probability that a page being replaced was modified, and the average number of total references made to all pages since the last reference to a page being replaced.

Keywords: Data caching, independent reference model, hit rate, databases, computer architecture.

TR30 Online Hard Real-Time Scheduling for Hypercube Multiprocessors by Davender Babbar and Phillip Krueger.

Scheduling sporadic tasks is a formidable problem in hard real-time systems, particularly multiprocessor systems. Until now little research has addressed this problem for hypercube multiprocessors. In this paper we study two online scheduling algorithms - Buddy/RT and Stacking, for hard real-time environments which provide guarantees at the time of arrival of a task. Buddy/RT is a straight-forward extension of the well-known Buddy strategy to the real-time environment, while Stacking is a more sophisticated algorithm based on the lessons learned from Buddy/RT. The underlying concept behind the Stacking algorithm is to reduce fragmentation by 'stacking' equal-sized jobs in the time dimension. The performance of the two algorithms is compared through simulation studies. We find that the Stacking algorithm performs significantly better than Buddy/RT over a wide range of workloads, with no increase in complexity.

TR29 Designing Scalable Systems with two-level k-ary n-cube Wormhole-routed Interconnections by Debashis Basak and Dhabaleswar K. Panda.

Recent advancements in VLSI and packaging technologies are making it cost effective to integrate multiple processing units into a chip or a board. This demonstrates attractiveness in building scalable parallel systems using clustered configurations while exploting communication locality. Variety of clustered architectures, proposed in the past, using buses or MINs as the inter-cluster interconnection do not satisfy both the above objectives. This paper focuses on design issues in building two-level clustered systems with k-ary n-cube interconnections at both the intra-cluster and the inter-cluster levels. A new framework of composite-routing is proposed to route messages in such clustered systems in a deadlock-free manner. The interplay between various parameters such as cluster size, inter- and intra-cluster topologies, channel widths, routing schemes, message length, and locality of communication in determining system performance and cost have been analyzed under the constant bisection bandwith constraint. Network latency in such a system is analytically modeled by taking into account demand multiplexing of multiple virtual channels on each physical channel. Predictions from this model together with channel capacity constraints and simulation experiments are used in determining optimal configurations for a given system size. Our analysis indicates that small sized clusters with a ring intra-cluster topology and a 2D/3D/4D inter-cluster network connecting these clusters offer best system performance. This provides guidelines to system designers for building large-scale parallel systems.

Keywords: parallel architectures, interconnection networks, clustered architectures, hierarchical systems, multiprocessors, wormhole- routing.

TR28 Scalable Architectures with k-ary n-cube cluster-c organization by Debashis Basak and Dhabaleswar K. Panda.

This paper focuses on direct cluster interconnection and analyzes the interplay between various parameters such as cluster size, inter- and intra- cluster topologies, channel widths, routing schemes, message length, and locality of communication. Systen throughput and average message latency are used to determine optimal configurations under the constant bisection bandwidth constraint. Our analysis indicates that small sized clusters with a ring intra-cluster topology and a 2D/3D/4D inter-cluster network connecting these clusters offer best system performance. This provides guidelines to system designers for building large-scale parallel systems.

Keywords: parallel architectures, interconnection networks, clustered architectures, hierarchical systems, wormhole-routing.

TR27 Benchmarking of Computing Facilities for Analysis and Prediction of Performance Taking into Account Environmental Factors by Davender Babbar, Mervin E. Muller, and Yibin Yang. PAPER COPY ONLY

Factors which can influence the conclusions about performance of a distributed computing system based upon benchmarking are identified. Attention is given to the identification, understanding, and control of these factors in order to ensure that conclusions based upon benchmarking are valid and can be reproduced by others. Attention is also given to understanding the the extent of the robustness of the conclusions if there are changes in workload, hardware, network, software, or performance metrics. The functional components of a system for identification and control of benchmarking are outlined. The paper draws on the experience of the Performance Analysis Laboratory of the Department of Computer and Information Science at The Ohio State Univesity, in configuring and evaluating the performance of the Interactive Instructional Computing Facilities of the Department.

TR26 Resilient and Flexible Ring Embeeding in an Injured Hyercube by Yu-Chee Tseng and Ten-Hwang Lai.

This paper presents an algorithm for embedding a ring in a damanged n-cube that improves over existing algorithms in both the degree of fault tolerance and the dilation incurred. Existing algorithms are able to tolerate only up to 2n - Theta(SqRt(n log n) ) faulty nodes and may produce an embedding with an arbitrarily large dilation. The new algorithm can tolerate up to Theta(2 to the n/2 power) faulty nodes and it guarantees a dilation of at most two.

Key Words: graph embedding, hypercube, ring, linear path, fault tolerance.

TR25 Measures of the Potential for Load Sharing in Distributed Computer Systems by M.G. Sriram and Mukesh Singhal. PAPER COPY ONLY

In this paper we are concerned with the problem of determining the potential for load balancing in a distributed computing system. We define a precise measure, called the number of sharable jobs, of this potential in terms of the number of jobs that can usefully be transferred across sites in the system. Properties of this measure are derived, including a general formula for its probability distribution, independent of any particular queuing discipline. A normalized version of the number of sharable jobs, called the job sharing coefficient, is defined. From the general formula, the probability distribution of the number of sharable jobs is computed for three important queuing models in which an exact expression for the probability distribution of the Number of Sharable Jobs is difficult to obtain two methods are presented for numerical computation of this distribution. The Job Sharing Coefficient is plotted against traffic intensity for various values of system parameters. Both of these measures are shown to be useful analysis tools for understanding the characteristics of load sharing in distributed systems. They can also aid in the design of such systems.

Key Words: Load Sharing, Queues, Sharable Jobs, Job Sharing Coefficient.

TR24 On Isolating an Odd Number of Elements and Its Applications in Complexity Theory by Sanjay Gupta.

PLEASE NOTE: on this abstract,TR24-1993, some of the symbols cannot be reproduced as this file is set up.

Given an arbitrary set S of n-bit vectors, we construct a random set S' ?is a member of S, with a constant probability, such that if S does not equal 0, then S' has an odd number of elements. We improve bounds on several recent results in complexity theory by using our construction instead of Valiant and Vazirani's theorem [11]. In particular, we obtain better bounds on polynomials which approximate boolean functions; improve bounds on the running of the parity P machine in Toda's result; and improve bounds on the size of threshold circuits accepting languages accepted by AC0?? circuits.

TR23 Task Assignment on Distributed-Memory Systems with Adaptive Wormhole Routing by Vibha A. Dixit-Radiya and Dhabaleswar K. Panda

Assignment of tasks of a parallel program onto processors of a distributed- memory system is critical to obtain minimal program completion time by minimizing communication overhead. Wormhole-routing switching technique, with various adaptive routing strategies, is increasingly becoming the trend to build scalable distributed-memory systems. This paper presents task assignment heuristics for such wormhole-routed systems and analyzes the effect of adaptive routing. A Temporal Communication Graph (TCG) is used to model task graphs and to identify communication steps that conflict both temporally and spatially. Heuristics are proposed to capture temporal link contention and derive optimal assignment in an iterative manner by pairwise exchanging of processors, associated with the critical communication edges, within d hops. The interplay between degree of routing adaptivity, topology, application characteristics, and optimal task assignment are studied through simulation experiments using random task graphs. The study indicates that even for systems supporting fully adaptive routing, optimal task assignment is necessary to reduce program completion time for communication bound applications. The proposed heuristics are general and can be applied to programs with regular/irregular communication and to any distributed- memory host system supporting message communication over minimal path.

Index Terms: task assignment, wormhole-routing, adaptive routing, link contention, temporal communication graph, critical path, message-ordering, distributed memory systems.

TR22 An Efficient and Fair Implementation of Flush Primitives by Ashwani Gahlot and Mukesh Singhal

Flush channels have been proposed as an alternative to FIFO and non-FIFO models of communication and have been found useful in a variety of applications. Flush channels while retaining the elegance and ease of program development in a FIFO environment provide higher communication level concurrency. Flush channels support four message types, namely forward flush, backward flush, two way flush, and ordinary.

In this paper we propose a counter-based implementation of flush primitives. The implementations proposed in the past penalize computations which use fewer flush messages because they increase the information that needs to be maintained by the receiver and they increase the size of information carried by each message. The implementations proposed in this paper alleviate these two problems.

Index terms: Flush primitives, communication channels, distributed computation.

TR21 Concurrency Measures for Distributed Computations by Ashwani Gahlot and Mukesh Singhal

Performance of distributed computation have traditionally been characterized by measures based on number of messages exchanged, total information exchanged by messages, or the total execution time of the computation. Though important, these measures do not completely characterize a distributed computation. Recently, various measures characterizing the synchronization delays and concurrency in the computation have been proposed. These measures assume that a process is a sequence of events. Thus, irrespective of the number of processors available, any two events on a process cannot execute simultaneously.

In this paper, we model a process as a partially ordered set of events. Thus, if more than one processor is available, then some events on a process may execute simultaneously. Viewing a process as a partial order results in identification of all interprocess and intraprocess concurrency in the computation and it presents a unified framework to model both sequential and parallel/distributed computation. It also provides a convenient framework to model processes with multiple threads of flow control and systems with multiple processes at each site by simplifying the data structures kept at each site. In this paper, we propose concurrency measures which capture the concurrency perceived by an event and concurrency perceived by the entire computation, when a process is viewed as a partially ordered set of events. Such measures can be used in exploiting the inherent concurrency in a computation by further distributing/consolidating the events in a distributed computation. Along with traditional measures like total execution time, these measures give a better characterization of a distributed computation.

Keywords: Distributed system, concurrency measures, logical clocks.

TR20 Bounding Logical Clocks in Distributed Systems by Ashwani Gahlot and Mukesh Singhal.

A distributed system is characterized by the lack of a global time. Many models of logical time have been proposed in the past to solve this problem. Logical clocks have been developed to deduce causality and potential causality between events in a distributed computation. These clocks are unbounded and thus results in an unbounded communication overhead in a distributed computation where each message carries with it as its timestamp the value of the local clock at the time of its sending.

In this paper we define the notion of bounded clocks and present algorithms to bound vector clocks in a system that supports the global flush communication primitive. Use of these algorithms results in bounding the communication overhead due to the timestamp carried by a message. Such clocks simplify distributed algorithm development by bounding the logs and other information kept by a process. We briefly discuss applications exemplifying the benefits of bounded clocks.

Key words: Bounded clocks, Causality, F-channels, Global flush primitve, Global snapshot, Debugging.

TR19 Hierarchical Clocks by Ashwani Gahlot and Mukesh Singhal.

Logical clocks and vector clocks have been proposed to capture causality and potential causality between events in a distributed computation. These clocks associate timestamps with events such that the ordering between timestamps captures the causality or potential causality between events. These clocks, however, assume that a process is a sequence of events, i.e., there is no concurrency between events on the same process.

In this paper, we model a process as a partially ordered set of events. Thus, if more than one processor is available, then some events on a process may execute simultaneously. Viewing a process as a partial order results in identification of all interprocess and intra-process concurrency in the computation and it presents a unified framework to model both sequential and parallel/distributed computation. It also provides a better framework to model systems with multiple threads of flow control and systems with multiple processes at each site by simplifying the data-structures kept at each site. Bit-matrix clocks have been proposed to capture causality between events in a distributed system where a process itself is a partially ordered set of events. Bit-matrix clocks have high cost in terms of both storage and communication overheads. We introduce the notion of hierarchical clocks which, like bit-matrix clocks, capture causality between events in a system where a process is a partial order of events. We show that maintaining hierarchical clocks results in substantially reduced storage overhead at processes and reduced communication overhead for each message. We discuss the utility of such clocks in debugging of distributed programs.

Index terms: Logical clocsk, vector clocks, bit-matrix clocks, debugging, distributed computation.

TR18 Task assignment on Distributed-Memory Systems with Adaptive Wormhole Routing by V. A. Dixit-Radiya & D.K. Panda. UPDATED VERSION - 2/94

Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. Wormhole-routing switching technique, with various adaptive routing strategies, is increasingly becoming the trend to build scalable distributed-memory systems. This paper presents task assignment heuristics for such wormhole-routed systems and analyzes the effect of adaptive routing. A Temporal Communication Graph (TCG) is used to model task graphs and to identify communication steps that conflict both temporally and spatially. Heuristics are proposed to capture temporal link contention and derive optimal assignment in an iterative manner by pairwise exchanging of processors, associated with the critical communication edges, within d hops. The interplay between degree of routing adaptivity, topology, application characteristics, and optimal task assignment are studied through simulation experiments using random task graphs. The study indicates that even for systems supporting fully adaptive routing, optimal task assignment is necessary to reduce program completion time for communication bound applications. The proposed heuristics are general and can be applied to programs with regular/irregular communication and to any distributed-memory host system supporting message communication over minimal path.

Index terms: task assignment, wormhole-routing, adaptive routing, link contention, temporal communication graph, critical path, message-ordering, distributed memory systems.

TR17 Supernodal Sparse Cholesky Factorization on Distributed- Memory Multiprocessors by Kalluri Eswar, P. Sadayappan, C.-H. Huang, & V. Visvanathan.

The conccept of supernodes has been widely used in the design of algorithms for the solution of sparse linear systems of equations. This paper discusses the use of supernodes in the design of algorithms for sparse Cholesky factorization on distributed-memory multiprocessors. A new algorithm that is communication efficient, has good load balance, and benefits significantly from supernodes is presented. A taxonomy of distributed sparse Cholesky factorization algorithms is proposed. Performance results on an Intel iPSC/860 multiprocessor are reported.

Keyboards: sparse matrices, sparse Cholesky factorization, parallel algorithms, distributed-memory multiprocessors, supernodes.

TR16 Optimal Fully Adaptive Wormhole Routing for Meshes by Loren Schwiebert & D.N. Jayasimha. THIS technical report has been revised, Aug. 13, 1993.

A deadlock-free fully adaptive routing algorithm for 2D meshes which is optimal in the number of virtual channels required and in the number of restrictions placed on the use of these virtual channels is presented. The routing algorithm imposes less than half as many routing restrictions as any previous fully adaptive routing algorithm. It is also proved that, ignoring symmetry, this routing algorithm is the only fully adaptive routing algorithm that achieves both of these goals. The algorithm exploits the fact that for some adaptive routing algorithms, deadlock freedom is possible even when cycles are present in the channel dependency graph. The implementation of the routing algorithm requires relatively simple router control logic. The routing algorithm requires only the minimum number of virtual channels even when extended to arbitrary dimension meshes, yielding a dramatic reduction in the number of virtual channels needed to support fully adaptive routing. Compared to all previous algorithms which required an exponential number of virtual channels with the dimension of the mesh, the new algorithm, requires only 4n - 2 virtual channels for an n-dimensional mesh.

Keywords: wormhole routing, mesh architectures, routing algorithms, deadlock-freedom, optimality.

TR15 Voxel-Based Morphing (Work in Progress Report) Asish Law and Roni Yagel.

It is a common practice in the field of graphics to exploit the basic laws of nature and apply these laws to the benefit of realistic imagery. So long, these applications have been limited to rendering using realistic light models, animating vibrations and crack propagation and deformation using physical forces, to name a few.

Morphing is defined as the combination of generalized image warping with a cross-dissolve between image elements. In this report, the use of natural laws of physics has been extended to the field of warping. A new method is proposed, in which each particle (pixel or voxel) undergoes motion under the influence of elastic restoring forces generated due to the dislocation of the Control Points (CPs), used for warping. The motion continues till the particle attains equilibrium once again. The set of new equilibrium points forms the warped image. The motivation of this work may be attributed to the vast expanse of ways this method can be potentially applied to the simulation of medical and biological behavior. The string forces model has the potential to be extended to a physical model, in which appropriate string constants can be chosen to simulate the deformation of different elastic materials like bone and muscles. In this progress report, we describe our preliminary effort in achieving this goal. The initial results presented at the end show the versatility of the method in the field of warping.

TR14 Efficient Feed-Forward Volume Rendering Techniques for Vector and Parallel Processors. Raghu K. Machiraju and Roni Yagel (TR14-1993 replaces TR29-1992)

Volumes, represented by a 3D regular grid of voxels, are being increasingly employed in a variety of scientific visualization and computer graphics applications. However, rendering volumes requires an overwhelming amount of processing power. In this paper we investigate efficient techniques for rendering semi-transparent volumes on vector and parallel procssors. Parallelism inherent in a regular grid is obtained by decomposing the volume into geometric primitives called beams, slices and slabs of voxels. By using the adjacency properties of voxels in beams and slices, efficient incremental transformation schemes are developed. These schemes are shown to be very suitable for fast execution on vector and pipelined processors. The slab decomposition of the volume allows the implementation of an efficient parallel feed-forward renderer which includes the splatting technique for image reconstruction and a back-to-front method for creating images. We report the implementation of this feed-forward volume renderer on a hierarchical shared memory machine, the IBM Power Visualization System (PVS). Experimental results are included to show the efficacy of the incremental transformation schemes and the performance of the parallel renderer on the IBM PVS.

TR13 On Formal Verification of Distributed Mutual Exclusion Algorithms. Jyoti Kamal and Mukesh Singhal.

This paper addresses the problem of formal verification of distributed mutual exclusion algorithms. Previously, informal and operational arguments have been used for proving various properties of distributed mutual exclusion algorithms. Such proofs are prone to errors. This paper develops a general framework for formal verification of distributed mutual exclusion algorithms. We use a finite state automaton to specify the execution of a distributed mutual exclusion algorithm. Using this specification and a state-theoretic model of a distributed computation, we develop rigorous criteria for safety, liveness, and fairness properties of distributed mutual exclusion algorithms. We apply this framework for verification of two classical distributed mutual exclusion algorithms, viz., Lamport's algorithm and Ricart-Agrawala's algorithm.

Key Words: Verification, Mutual exclusion algorithms, Distributed algorithms.

TR12 Superseded by TR26.ps.Z

TR11 NOTE!! This one has been changed, effective 2/94, New title, new abstract, new TR now available. "Clustering and Intra- Processor Scheduling for Explicitly-Parallel Programs on Distributed-Memory Systems," by Vibha A. Dixit-Radiya & Dhabaleswar K. Panda.

Programs for distributed-memory systems are explicitly-parallel and comprised of a set of sequential tasks or processes that communicate via message-passing. The sequence of computation in each task together with the intermediate send and receive communication steps exhibit temporal behavior of the program. In this paper, we show that the two common models of program representation, the precedence graph and the interaction graph models, are insufficient to capture this temporal behavior and hence are not ideal for solving the clustering and the scheduling problems. We use a new Temporal Communication Graph (TCG) model to represent such explicitly-parallel programs. This model captures communication dependency and overlap of communication with computation. This provides flexibility to get a better estimate of the program completion time. New measures are developed for quantifying critical communication and inter-task parallelism on this model. We analyze the importance of intra-processor scheduling strategy in addition to clustering for reducing program completion time. A set of clustering heuristics are proposed and are shown to be easily applicable even to logically asynchronous and irregular task graphs. The goodness of our heuristics are verified by applying them to various classes of random task graphs. The proposed heuristics show 10-25% improvement in completion time over the existing heuristics.

Index Terms - clustering, intra-processor scheduling, explicitly-parallel programs, distributed-memory systems, temporal communication graph, critical path, degree of parallelism.

TR10 Accelerating Volume Animation by Space-Leaping Roni Yagel and Zhouhong Shi. Avail. in subdirectory TR10-1993 which has 2 files, one for words, one for figures.

In this paper we present a method for speeding the process of volume rendering a sequence of images. Speedup is based on exploiting coherency between consecutive images to shorten the path rays take through the volume. This is achieved by providing each ray with information needed to leap over the empty space and commence volume traversal at the vicinity of meaningful data. The algorithm starts by projecting the volume into a C-buffer (Coordinates-buffer) which stores, at each pixel location, the object-space coordinates of the first non-empty voxel visible from that pixel. For each change in the viewing parameters, the C-buffer is transformed accordingly. In the case of rotation the transformed C-buffer goes through a process of eliminating coordinates that possibly became hidden. The remaining values in the C-buffer serve as an estimate of the point where the new rays should start their volume traversal. This space-leaping method can be combined with existing acceleration techniques and can support any ray traversal algorithm and material modeling scheme. Unlike other space-leaping techniques it does not require 3D preprocessing, and does not suffer from any image degradation. Therefore, this method can be incorporated into any existing ray-casting-based viewing system to provide additional speedup.

TR09 Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory by Gold G. Richard III and Mukesh Singhal

Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that long-running applications are not adversely affected by procesor failure. This paper presents a technique for achieving recoverable distributed shared memory which utilizes asynchronous process checkpoints and logging of pages accessed via read operations on the shared address space. The proposed scheme supports local process recovery without forcing rollback of operational processes during recovery. Our method is particularly useful in environments where taking process checkpoints is expensive (e.g., in some UNIX [UNIX is a trademark of Novell, Inc.] environments).

Keywords: Distributed shared memory, crash recovery, checkpointing, logging.

TR08 An Articulated Limb Motion Planner Emphasizing Human-Like Movement by David P. Miller and Richard E. Parent, PAPER COPY ONLY at present

Task level animation of articulated figures, such as the human body, requires the ability to generate collision-free goal-directed motion of individual limbs in the presence of obstacles. This paper describes a new articulated limb motion planner for goal-directed point-to-point reaching motions. The produced motion avoids obstacles while optimizing an objective function in an effort to produce human-like movement.

This two phase algorithm utilizes heuristic guided Monte Carlo techniques to create a consistent underlying paradigm. The first phase consists of a potential field based random path planner which generates a population of candidate paths. This initial population is fed into the second phase, a genetic algorithm, which iteratively refines the population as it optimizes the objective function. The refinement process works on the principle of path coherency, the idea that a family of closely related collision-free paths lie in the vicinity of a given collision-free path.

The presented algorithm is flexible in that a wide range of objective functions can be optimized. In addition, the derivatives of the function are not required. Applications of the algorithm include task level animation, ergonomics, and robotics.

Keywords: Computer Animation, Motion Planning, Constrained Optimization, Robotics, Ergonomics.

TR07 Few interfaces of 3D systems have been evaluated empirically. Designers frequently claim that their technique is 'better' than other techniques but with little evidence to support their claim. StEP(3D), a standardized evaluation plan for the usability of 3D interaction techniques, uses discount methods for performance-based evaluation. It is simple enough that anyone can compare interfaces without special equipment of experience. This paper describes the development and empirical evaluation of StEP(3D). Eighteen recommendations are made for developing future StEPs based on our results and experiences.

Keywords: Discount usability evaluation, Standardized evaluation plan, 3D graphics interaction, Task analysis, Usability measures, Performance time, Satisfaction ratings, Validity and reliability

TR06 Stealth: A Liberal Approach to Distributed Scheduling for Networks of Workstations by Phillip Krueger and Davender Babbar PAPER COPY ONLY at present

Over the past several years, distributed systems composed of computer workstations have become common. To make full use of their often enormous aggregate computing capacity, distributed schedulers must be developed that are appropriate to the unique features of these systems. In this paper, we show that the approach (referred to as conservative) commonly used by distributed schedulers designed for such systems is both unduly costly and fails to exploit large portions of their computing capacities. The goal of the Stealth Distributed Scheduler is to explore a liberal approach to distributed scheduling in this environment. We discuss the design and performance of Stealth, showing that it is better able to exploit computing capacity, yet incurs less overhead than conservative distributed schedulers.

TR05 A Versatile Architecture for the Distributed Sensor Integration Problem by S.S. Iyengar, D.N. Jayasimha, & D. Nadig. PAPER COPY ONLY

In recent years there has been increasing interest in the development of distributed sensor networks (DSNs) for information gathering. This is partly because of the availability of new technology which makes the DSNs economically feasible to implement and the increasing complexity of today's information gathering tasks to which they are applied. These tasks are usually time-critical and rely on the reliable delivery of accurate information for their completion. To meet these requirements, a DSN must be able to dynamically respond to fault conditions, reconfiguring its activities as necessary to compensate for disturbances. Thus, the search for efficient, fault-tolerant architectures for DSNs has become an important area in research. A DSN consists of a set of sensors, a set of processing elements (PEs), and a communication network interconnecting the various PEs. One or more sensors is associated with each PE. We refer to the PE and its associated sensor(s) as a node.

The integration of multiple, disparate sensors into a useful sensor network involves the solution of several different problems. For an excellent discussion of the problems and the current state of the art in multisensor integration, the reader is referred to the survey paper by Luo and Kay [LuKa89]. From a computational viewpoint, however the efficient extraction of information from noisy and possibly faulty signals emanating from many sensors requires the solution of problems relating a) to the architecture and the fault tolerance of the distributed sensor network, b) to the proper synchronization of sensor signals, and c) to the integration of information to keep the communication and the processing requirements small.

Wisson et al. [Wess81] were the first to attempt designing efficient networks for distributed sensing. They proposed the hierarchical and committee interconnection topologies. A sensor network based on a fixed number of complete binary trees fully interconnected at their roots (we will refer to this network as a flat tree network) was considered in [JaIK91, PIKM91] and the following issues were studied:

(1) the integration of information in real time when clocks at the nodes are not perfect, (2) the transmission of information without incurring heavy communication costs, and (3) the fault tolerance of the network to certain types of faults.

In this paper, which is a continuation of research reported in [JaIK91, PIKM91], we propose a new versatile architecture which has several advantags over the flat tree network. Specifically, the proposed network has better fault-tolerant properties and supports more nodes than the latter with the same diameter. We show how information integration could be achieved in this network and derive an interesting property related to such integration in the presence of faults.

TR04 Not available.

TR03 Trip-based Multicasting in Wormhole-routed Networks by Yu-Chee Tseng and Dhabaleswar K. Panda THIS TECH PREORT HAS BEEN SUPERSEDED BY 1994/TR47.ps

In this paper, we consider the single-source and multi-source multicasting problem in wormhole-routed networks. We propose a general trip-based model for any network that has at least 2 virtual channels per physical channel. The underlying concept is a node sequence called skirt-based trip, which always exists in graphs of any topology. Using such a sequence, we order nodes and edges to eliminate the possible cyclic dependency and develop a multicasting scheme. This scheme is simple, adaptive, distributed and deadlock-free.

Our model can be applied to both irregular and regular networks. The strength of our model is demonstrated by its capabilities: (1) to support fault-tolerance and (2) to increase adaptivity by using multiple trips. For an n dimensional hypercube with up to [(n+1)/2] faults, we show that using 2 to the n power is possible. For networks with more virtual channels, demonstrate the concept of multiple trips on mesh topology. Simulation experiments indicate no degradation in performance for faulty 8-cubes and an improvement of up to 25% for 16 x 16 mesh with two trips.

Key Words: routing algorithm, interprocessor communication, multicast, virtual channel, wormhole-routing, path-based routing, fault-tolerance.

TR02 Barrier Synchronization in Distributed-Memory Multiprocessors using Rendezvous Primitives by Sandeep K.S. Gupta and Dhabaleswar K. Panda.

This paper deals with barrier synchronization in distributed-memory multiprocessors. We propose new rendezvous and multirendezvous synchron- ization between two and multiple processors, respectively. The rendezvous primitive can work with either wormhole or circuit-switched routing. Two variations of multirendezvous primitives, distance-based and synchronization- worm-based, are proposed for wormhole-routed networks. The former works with deterministic and minimal adaptive routing, while the scope of the latter includes fully adaptive routing. Compared to the traditional send and recv, these primitives take fewer number of communication steps to implement a barrier synchronization between a set of processors. This leads to a significant reduction in synchronization overhead for networks with high communication start-up costs. Using these primitives, we present algorithms for hypercube and mesh networks to barrier synchronize all or an arbitrary subset of processors known at compile time. Using multirendezvous, we present a barrier synchronization algorithm for a k-ary n-cube multi- processor. The optimal number of communication steps in this algorithm depends on the ratio of communication start-up to link-propagation cost.

TECHNICAL REPORT SERIES

Computer and Information Science Research Center Department of Computer and Information Science

The Ohio State University 2036 Neil Avenue, Columbus, OH 43210-1277

1993

Computer and Information Science Research Center
Department of Computer and Information Science

The Ohio State University
2036 Neil Avenue,
Columbus, OH 43210-1277