Technical Report Series Computer and Information Science Research Center Department of Computer and Information Science The Ohio State University 2015 Neil Avenue, Columbus, OH 43210-1277 **********************NOTE********************* Abstracts for 1994 NOTE: some titles may be available ONLY via paper (This is noted after the title/author section.) If you have trouble accessing a file, please send email to tfletch@cis.ohio-state.edu 1994 TR60 Bipartite Permutation Graphs with Application to the Minimum Buffer Size Problem by Ten-Hwang Lai and Shu-Shang Wei This paper presents a new characterizations of bipartite permutation graphs and a structure theorem for (0,1)-matrices with a special consecutive 1's property. These results lead to a linear time algorithm for the minimum buffer size problem when restricted to bipartite permutation graphs; this problem arises in relational database systems and is NP-hard for a general graph. TR59 Designing Clustered Multiprocessor Systems under Pakaging and Technological Advancements by Debashis Basak and D.K. Panda Packaging technologies impose various physical constraints on bisection bandwidth, pinout, an dchannel width of a system whereas processor and interconnect technologies lead to certain demanded throughput on network bisection. Earlier studies in literature have proposed hierarchical and clustered interconnections by considering the effect of limited packaging constrainst. Pinout technologies and capacity of packaging modules have been ignored, often leading to configurations which are not design-feasible. In this paper, we solve this design problem by proposing a new supply-demand optimization framework. This generalized framework uses parameterized representation of processor board area, pinout technologies (periphery or surface), channel width, and channel speed. The family of flat k-ary n-cube topologies and their clustered variations (k-ary n-cube cluster-c) are evaluated to derive optimal configuratios which can lead to cost-effective design of scalable parallel systems using wormhole-routing. The analysis identifies processor board area, channel width, and pinout density as critical parameters. The study indicates that cluster-based parallel systems can deliver better performance with lower cost. It is predicted that optimal configurations for future systems will be cluster-basedd (2-10 processors per cluster) with 3D/4D/5D inter-cluster interconnection. This framework is quite general to capture technological trends of future years. The framework is validated by the design solutions of current machines using contemporary technologies. Keywords: Multiprocessor System Design, Scalable Systems, Clustered Architectures, Hierarchical Organization, k-ary n-cube Inter- Connection, Packaging Constraints, Pinout Technology. TR58 Rapid Previewing via Volume-based Solid Modeling by Naeem Shareef and Roni Yagel Quick previewing of 3D models is necessary for efficient product design and rapid prototyping. An inherent weakness of most solid modeling systems is that as model complexity increases quick 3D viewing suffers, especially on low cost workstations. We explore an alternative approach to surface representation in which object space, called volume, is subdivided into a 3D grid of cubic cells, each containing information on the object(s) which occupy it. A data structure is introduced that consists of a 2D array of pointers each holding a linked list of adjacent non empty cells. To benefit from data coherency along one dimension, we have developed new modeling and rendering algorithms that are beam-oriented, incremental, and integer-based. To illustrate the usefulness of our approach, we use our approach in Constructive Solid Geometry (CSG) modeling. We describe our prototype system and show, by comparing it to existing systems, that our data structure and its associated algorithms while being of finite resolution provide for suitable and more efficient model visualization. TR57 Micro-Architecture vs. Macro-Architecture by Joseph E. Hollingsworth and Bruce W. Weide The field of study commonly known as "software architecture" should be split into two subareas: micro-architecture and macro-architecture. Work to date under the name "software architecture: has concentrated largely on macro-architecture. But micro-architecture, a.k.a. software component engineering, is of equal concern. TR56 VoxelFlow: A Parallel Volume Rendering Method for Scientific Visualization by Asish Law, Roni Yagel, and D.N. Jayasimha Volume Visualization techniques are often applied to sampled data obtained from MRI and CT-scans as well as to simulated data computed by computational fluid dynamics (CFD) methods. In such applications, a sequence of images is animated by incrementally rotating the volumetric object about its center, so that the scientist can view the structure from all angles. In this paper, we propose an efficient method, called VoxelFlow, for generating rendered images of volume data which can be used in animating rotations of objects. The algorithm uses a look-ahead scheme of voxel dataflow and exploits the communication features of modern scalable multi- computers to achieve near linear speedup. Simulated results are shown for implementations on mesh architectures like the INtel Paragon and torus architectures like the Cray T3D. Computational and communicational load balancing issues among processors are also addressed. TR55 Reverse Engineering of Legacy Code is Intractable by Bruce W. Weide, Wayne D. Heym, and Joseph E. Hollingsworth Reverse engineering of large legacy software systems is widely recognized to be a difficult problem. How bad is it? By an argument that identifies key underlying sources of the difficulty, reverse engineering of legacy code is shown to be intractable in the usual computational complexity sense. This conclusion implies that we should not be too enthusiastic about the ultimate value of reverse engineering as the centerpiece of a cost- effective approach to constructing new generations of systems. TR54 On the Synchronization Mechanisms in Distributed Shared Memory Systems by Mahendra Ramachandran and Mukesh Singhal Distributed Shared Memory (DSM) is the implementation of the shared memory programming paradigm on a distributed memory (or multicomputer) system. Programming multicomputer systems using Distributed Shared Memory as the programming model is appealing because it combines the performance advantage of distributed memory systems and the ease of programming of shared memory systems. In DSM systems, cooperating tasks communicate with one another through shared variables. Thus, DSM systems must provide synchronization mechanisms to Coordinate concurrent access to these shared variables. In this paper we describe and classify the synchronization mechanisms supported by several Distributed Shared Memory systems. We classify these systems according to whether they are hardware or software based, whether the mechanism is integrated into the system or not and whether implementation of the synchronization mechanism is centralized or distributed. Key phrases: Distributed Shared Memory, Distributed Memory Architectures, Operating Systems, Synchronization. TR53 Global Reduction in Wormhole k-ary n-cube Networks with Multi- destination Exchange Worms by Dhabaleswar K. Panda This paper presents a new approach to implement global reduction operations (including barrier synchronization) in wormhole k-ary n-cubes. The novelty lies in using multi-destination message passing mechanism instead of single destination (unicast) messages. Using pairwise exchange worms along each dimension, it is shown that global reduction and barrier synchronization operations, as defined by the Message Passing Interface (MPI) standard, can be implemented with n communication start-ups as compared to 2n[log 2 k] start-ups required with unicast-based message passing. This leads to an asymptotic improvement by a factor of [2 log 2 k]. For different values of communication start-up time and system size, the analysis indicates that the new framework can implement fast global reduction compared to the unicast- based scheme. Issues related to the new approach are studied and the required architectural modifications to the router interface are presented. The analysis indicates that as the communication start-up time continues to decrease in new generation systems, the proposed framework can be used for implementing fast global reduction and barrier synchronization in large wormhole-routed systems without a separate control network. Keywords: Global Reduction, Collective Communication, Wormhole Routing, Virtual Cut-through, Broadcasting, Meshes, k-ary n-cubes, and Path-based Routing. TR52 Multi-Phase Redistribution: A Communication-Efficient Approach to Array Redistribution by S.D. Kaushik, C.-H. Huang, J. Ramanujam, and P. Sadayappan. Distributed-memory implementations of several scientific applications require array redistribution. Aray distribution is used in languages such as High Performance Fortran to dynamically change the distribution of arrays across processors. Performing array redistribution incurs two overheads - an indexing overhead for determining the set of processors to communicate with and the array elements to be communicated, and a communication overhead for performing the necessary irregular all-to-many personalized communication. In this paper efficient runtime methods for performing array redistribution are presented. To reduce the indexing overhead, precise closed form for enumerating the processors to communicate with and the array elements to be communicated are developed for two special cases of array redistribution involving block-cyclically distributed arrays. The general array redistribution problem for block- cyclically distributed arrays can be expressed in terms of these special cases. Using the developed closed forms, a distributed algorithm for scheduling the irregular communication for redistribution is developed. The generated schedule eliminates node contention and incurs the least communication overhead. The scheduling algorithm has an asymptotically lower scheduling overhead than techniques presented in the literature. Based on the developed closed forms, a cost model for estimating the communication and the indexing overhead for array redistribution is developed. Using this model, a multi-phase approach for reducing the communication cost for array redistribution is presented. The key idea is to perform the redistribution as a sequence of redistributions such that the total cost of the sequence is less than that of direct redistribution. Algorithms for determining the sequence of intermediate data distributions which minimizes the total redistribution of multi-dimensional arrays are presented. Experimental results on the Cray T3D and IBM-PS2 demonstrate the validity of the developed cost models and the efficacy of the multi-phase approach for array redistribution. Keywords: Array Redistribution, Communication Scheduling, Distributed-Memory Computers, Data Distribution, Data Communication, High Performance Fortran, Node Contention. TR51 Path Failure in Cyulinderical Mesh Networks by J. Ramachandran The problem of node failures in interconnection networks has been extensively studied. In this paper we define the notion of a more catastrophic form of failure - path failure - in which a sequence of nodes from a source to a sink in the network fail. The corresponding collapse in the structural properties of the underlying network is analyzed in quantitative terms, as a function of network size, by using the notion of catastrophic failure. We examine this problem in a fixed setting, that of the cylinderical mesh, a network designed to maximize the number of communication routes between a set of producers and a set of consumers. Our proofs use interesting graph theoretic and combinatorial techniques, and formally establish that connectivity is drastically effected by path failure. TR50 The Polynomial Time Function Hierarchy by J. Ramachandran We study Krentel's polynomial time function hierarchy (PFH) which classifies and characterizes optimization functions using deterministic oracle ransducers and NP sets as oracles. We introduce a quantifier syntax to describe PFH, and explicitly introduce the (PI sub k superscript MM) levels of the hierarchy. We discuss complete functions, metric reducibility, and relationships between different levels of PFH in analogy to Stockmeyer's polynomial time hierarchy (PH). We show that the closure of (PI sub k superscript MM) under metric reductions coincides with that under Turing reductions. In contrast, we show that these two reducibilities differ for FDEXT. We then proceed to build a theory of closure properties of PFH under operators such as sum, product, majority, plurality, etc. For each choice of an exponential arity operator we consider two forms of closure: closure without context and closure under context. We present a series of results, both positive (closure is possible) and negative (closure implies collapse), on the closure of PFH for each choice of operator and form of closure. TR49 Kolmogorov Complexity and Toda's Theorem by J. Ramachandran Toda's thjeorem states that the polynomial time hierarchy is contained in P(PP). An important building block in the proof is the lemma; NP is random reducible ... We present a new proof of a corollary of this lemma, ....., using Kolmogorov complexity. The proof provides a new perspective for understanding Toda's Theorem and adds to the growing corpus of work relating Kolmogorov complexity to Structural complexity. NOTE: where the ellipses exist ( .... ) some of the text has not been reproduced because emacs does not handle the symbols). TR48 Modulo Classes and Advice by J. Ramachandran We prove an upper bound on computing advice for a language A in the class of languages acceptable in deterministic polynomial time with logarithmic advice, A ... P/log), relative to A itself. We then provide conditions under which advice for a language A ... P/ log can be computed in PF(A), and conditions for the existence of a set A in P/log without advice in PF(A). We show that paddable languages in MODk P/log that are also self-reducible are in MODk P, without advice. This is an extension of similar results for self-reducible paddable sets in P/log by Balcazar and Schoening [3]. Keywords: Computational complexity, Kolmogorov Complexity, counting classes, advice. NOTE: where the ellipses exist ( .... ) some of the text has not been reproduced because emacs does not handle the symbols). TR47 A Trip-based Multicasting Model in Wormhole-routed Networks with Virtual Channels. by Yu-Chee Tseng, D.K. Panda, and Ten-Hwang Lai. THIS TECH REPORT IS AN UPDATED VERSION OF 1993/TR03.ps This paper focuses on efficient multicasting in wormhole-routed networks. A trip-based model is proposed to support adaptive, distributed, and deadlock-free multiple multicast on a network with arbitrary topology using at most two virtual channels per physical channel. This model significantly generalizes the path-based model proposed earlier [21, 22], which works only for Hamiltonian networks and can not be applicable to networks with arbitrary topology resulted due to system faults. Fundamentals of the trip-based model, including the necessary and sufficient condition to be deadlock-free, and the use of appropriate number of virtual channels to avoid deadlock are investi- gated. The potential of this model is illustrated by applying it to hypercubes with faulty nodes. Simulation results indicate that the proposed model can implement multiple multicast on faulty hypercubes with negligible performance degradation. Key Words: Routing algorithm, Interprocessor communication, Multicast, Virtual channel, Wormhole-routing, Path-based routing, Collective Communication, and Fault tolerance. TR46 Tracing the Missing Information by Neelam Soundararajan We consider two types of trace based systems, channel-trace systems and process-trace systems. Specifications and proofs in the former type of system are usually considered much simpler than in the latter. We show that if we want to ensure that no essential information is missing, we are forced to add considerable complexity to the channel trace systems. We conclude that specifications and proofs of any program in the process trace systems are no more complex than in the channel trace systems. TR45 Mathematical Foundations and Notation of RESOLVE by Wayne E. Heym, Timothy J. Long, William F. Ogden and Bruce W. Weide The RESOLVE approach to reusable component-based software engineering is mathematically-based. This paper discusses the logical foundations and terminology of RESOLVE, the built-in RESOLVE notation for writing mathematics, and the RESOLVE mechanisms that support description of mathematics that has no built-in notation. It is intended to serve primarily as a reference document. TR44 Optimizing Compilation of Linear Arithmetic in a Class of Constraint Logic Programs by Spiro Michaylov and Bill Pippin A central issue in the optimizing compilation of Constraint Logic Programming (CLP) languages is how to compile away as much general constraint solving as possible. Most such work relies on obtaining mode and type information by global analysis, and uses it to generate specialized code for individual constraints and calls, often with the aid of multiple specialization. Some recent work has augmented these techniques with procedure-level analysis of the inter-relationships between constraints, to detect constraints that subsume other constraints, and variables that cease to be reachable at some point in a computation. In combination, these techniques have been shown to dramatically improve performance for a number of programs. Here we continue this line of investigation by considering a class of programs that accumulate and simplify systems of linear arithmetic contstraints. The programs contain procedures that relate their parameters by an affine transform. For some calling patterns, the procedures repeatedly compose and simply affine transforms, using virtually the full power of the linear arithmetic constraint solver, and incurring considerable expense. We describe source to source translations that make this composition of transforms explicit, replacing constraint solving with ground (imperative) arithmetic. We demonstrate the translations and present experimental data showing substantial improvements in execution speed and space utilization. TR43 Synchronization and Desynchronization in a Network of Locally Coupled Wilson-Cowan Oscillators by Shannon Campbell and DeLiang Wang A network of Wilson-Cowan oscillators is constructed, and its emergent properties of synchronization and desynchronization are investigated by both computer simulation and formal analysis. The network is a two- dimensional matrix, where each oscillator is coupled only to its neighbors. We show analytically that a chain of locally coupled oscillators (the piece-wise linear approximation to the Wilson-Cowan oscillator) synchronizes, and present a technique to rapidly entrain finite numbers of oscillators. The coupling strengths change on a fast time scale based on a Hebbian rule. A global separator is introduced which receives input from and sends feedback to each oscillator in the matrix. The global separator is used to desynchronize different oscillator groups. Unlike many other models, the properties of this network emerge from local connections, that preserve spatial relationships among components, and are critical for encoding Gestalt principles of feature grouping. The ability to synchronize and desynchronize oscillator groups within this network offers a promising approach for pattern segmentation and figure/ground segregation based on oscillatory correlation. TR42 A Hierarchically-Connected Multiprocessor by Jeffrey D. Martens A hierarchically-connected multiprocessor model is presented and compared to existing hierarchical shared-memory multiprocessor models. The key features of this architectural model include that a portion of the hierarchical memory is contained locally within each processing element, all memory is visible to every processing element, and each memory unit is local (but not private) to some processing element. Software and programming issues are explored through discussion of producer-consumer communication, barrier execution, and a hierarchical version of guided self-scheduling. TR41 Fast Barrier Synchronization in Wormhole k-ary n-cube Networks with Multidestination Worms by D.K. Panda UPDATED 3/95 This paper presents a new approach to implement fast barrier synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination message passing mechanism on base-routing-conformed- paths instead of point-to-point unicast messages. Two different multi- destination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of barrier synchronization, respectively. Using these new worms, algorithms for complete and arbitrary set barrier synmchronization are developed. It has been shown that complete barrier synchronization in a k-ary n-cube system with e-cube routing can be implemented with 2n communication start-ups as compared to 2n Log2 k start-ups needed with unicast-based message passing. This leads to an asymptotic improvement by a factor of log2 k. For different values of communication start-up time, link propagation delay, and number of participating nodes, simulation study indicates that the new framework can reduce barrier synchronization cost considerably compared to the unicast-based scheme. Scalability of the proposed scheme is also shown with respect to the above system and technological parameters. The framework demonstrates potential for supporting fast barrier synchronization in large wormhole-routed systems. Keywords: Barrier Synchronization, Collective Communication, Wormhole Routing, Virtual Cut-through, Broadcasting, Multicasting, Meshes, k-ary n-cubes, and Path-based Routing. TR40 On Mapping Data and Computation for Parallel Sparse Cholesky Factorization by Kalluri Eswar, C.-H. Huang, and P. Sadayappan When performing the Cholesky factorization of a sparse matrix on a distributed-memory multiprocessor, the methods used for mapping the elements of the matrix and the operations constituting the factorization to the processors can have a significant impact on the communication overhead incurred. This paper explores how two techniques, one used when mapping dense Chlesky factorizatiyon and the other used when mapping sparse Cholesky factorization, can be integrated to achieve a communication- efficient parallel sparse Chlesky factorization. Two localizing techniques to further reduce the communication overhead are also described. The mapping strategies proposed here, as well as other previously proposed strategies fit into the unifying framework developed in this paper. Communication statistics for sample sparse matrices are included. TR39 Complete Process Recovery in Distributed Systems Using Vector Time by Golden G. Richard, III and Mukesh Singhal Distributed applications typically consist of a group of processes executing on processors which share neither a global clock nor shared memory and which communicate solely via message passing. As the number of processors increases, so does the probability that a processor will fail at some point during application execution in a distributed system; this is particularly true for long-running applications. In order to reduce lost work, processes which make up a distributed application should be recoverable. This paper presents a comprehensive technique for complete process recovery in distributed systems. By complete recovery we mean the restoration of a consistent system state as well as proper handling of lost, duplicate, and delayed messages resulting from the failure and restoration of the system to a consistent state. Message handling has often been treated casually in previous recovery techniques. In contrast, our recovery techniques directly address message handling issues. We classify the message types which a recovery mechanism must deal with to motivate our message-handling techniques. The use of vector time allows us to reduce the overhead associated with recovery and helps expedite the recovery process after a failure. Keywords: Message passing, distributed process recovery, complete process recovery, vector time, causality, checkpointing, message logging, optimistic recovery. TR38 Priority Ethernets by Frank Adelstein and Mukesh Singhal This paper addresses the issues related to the delivery of multimedia traffic on local area networks. Multimedia integrates voice and video along with test and images into existing systems. Traditional local area networks, e.g., Ethernet, were designed to handle regular data traffic which are bursty in nature and for which variable delay is acceptable. Multimedia traffic, on the other hand, requires a constant delay in addition to fast (or real time) delivery. Multimedia traffic also requires a large amount of bandwidth compared to regular traffic. Multimedia is becoming important in commercial as well as non-commercial organizations and Ethernet is a widely used local area network among most of these organizations; therefore, modification of Ethernet protocols to integrate multimedia streams is an important issue. In this paper, we present solutions to the problems associated with transfer of multimedia data on Ethernet and qualitatively study the performance of these solutions and their suitability to multimedia traffic. Specifically, preiority protocols are developed that enable priority transmission of data on Ethernets. Multimedia data can be assigned priority over regular data, and the developed priority protocols can be used to deliver multimedia data over the Ethernet in a timely manner. TR37 Introduction to the Nagiya Editor by Pete Ware The Nagiya Editor is a graphical editor for representing object oriented design. It allows the user to draw objects and indicate the relationship between these objects. Being a graphical editor it supports commands such as printing, moving, (re)drawing, and (re)sizing. It is implemented using Xm++, a C++ library that supports the Motif widget under the X11 Window System. Eventually, the Nagiya Editor will allow one to specify two schemas and indicate how data stored under one representation can be translated into the other representation. The Nagiya Editor will also facilitate this translation process by providing visual feedback to the user during conflict resolution. TR36 Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems by Ravi Prakash and Mukesh Singhal A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating the mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the under- lying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads, and satisfy the low energy consumption and low bandwidth constraints of mobile computing systems. Key words: checkpointing, causal dependency, global snapshot, mobile computing systems, portable computers, recovery. TR35 A Note on Set Bit Enumeration by Thomas Schwentick and J. Ramachandran An $m$-set-bit-enumerator for a function $f$ is a source of information that, in response to the query: ``How many set bits (`1's) are there in the binary representation of $f(x)$?'', computes a list of $m$ numbers, one of which is the correct value. This paper extends previous results on the hardness of set-bit-enumeration for the function classes $\optP$ and $\numberP$. Our main theorems state: If polynomial time computable $n$-set-bit-enumerators exist for all $\optP$ functions, then $\optP =\PF$. and If polynomial time computable $n^{\frac{1}{2} - \epsilon}$-set-bit-enumerators exist for all $\numberP$ functions, then $\numberP =\PF$. In both cases $n$ is essentially the length of the function value of the given functions. TR34 Distributed Semaphores by Mahendra Ramachandran and Mukesh Singhal Semaphores provide a basic synchronization mechanism in uni- and multi- processor systems. Supporting semaphores in distributed systems has not received much attention. DSM systems provide a shared memory programming model on distributed systems and thus increase the appeal of providing efficient, decentralized solutions for the semaphore mechanism in a distributed system. We present a distributed semaphore mechanism in this paper. We use a two-level hierarchy to processors to increase locality of reference and thus the efficiency of access. Key phrases: Operating Systems, Process Synchronization, Semaphores, Distributed Memory Architectures. TR33 Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme by Dhabaleswar K. Panda, Sanjay Singal and Pradeep Prabhakaran This paper porposes a novel concept of multidestination worm mechamism which allows a message to be propagated in a wormhole network conforming to the underlying base routing scheme (ecube, planar, turn, or fully adaptive). Using this model, any source has potential to deliver a message to multiple destinations in any valid path in the system conforming to the base routing scheme while encountering only a single communication start-up. The flexibility of sending unicast messages exists under this model as a subset operation. Two schemes are developed and evaluated under this model to perform fast multicasting on 2D/3D meshes/Tori. Not only do these schemes demonstrate superiority over Umesh (unicast-based multicast) [25] and Hamiltonian-Path-based [24] schemes, they indicate a very unique and interesting result that the cost of multicast operations can be reduced or kept near-coinstant in e-cube systems as the degree of multicast (number of destinations/src) increases. The proposed schemes can also take advantage of routing-adaptivity and irregularity in system topology to further reduce the cost of multicast operations. These results are the first ones in the wormhole-routing literature to propose multicasting schemes with such reduced overhead and provision for taking advantage of adaptivity. This multicast model lays a new foundation to build wormhole-routed distributed-memory systems with fast collective communication operations and distributed-shared- memory systems with fast cache-coherency support. TR31 Repeated Redundant Inequalities in Constraint Logic Programming by Spiro Michaylov A number of Constraint logic Programming systems, including CLP(R) and Prolog III, decide simultaneous linear inequalities as part of the fundamental operational step of constraint solving. While this can contribute tremendously to the usefulness of the systems, it is computationally quite expensive. Non-ground inequalities must generally be tested for consistency with the collected constraing set and then added to it, increasing its size, and thus making the next such test more expensive. Future redundant inequalities in a program are those that are guaranteed to be subsumed after no more than one subsequent procedure call, usually in the context of a recursive procedure. It has been noted that such inequalities need only be tested for consistency with the current constraint set, thus resulting in dramatic savings in execution speed and space usage. In this paper we generalize the notion of future redundancy in a number of ways and thus broaden its applicability. Thus we show how to dramatically improve the performance of a wider class of programs that rely heavily on linear inequalities. TR30 Skeletons and Techniques for the Systematic Development of Constraint Logic Programs by Spiro Michaylov We study the systematic development of Constraint Logic Programs from the viewpoint of Skeletons and Techniques as described by Kirschenbaum and Sterling. We describe a number of fundamental skeleton classes for CLP, and generalize the notion of skeletons to deal with non-structural recursion. Then we describe a range of useful techniques for extending these skeletons. Furthermore, we introduce important classes of techniques that alter the control flow of skeletons in certain well-defined and desirable ways. This work represents a step towards understanding how to develop complex CLP programs easily, and is expected to contribute to the adoption of CLP for applications projects. It may also lead to the development of semi- automated program development tools. Finally, it helps to justify a substantial body of present work on CLP compiler optimizations that depends on the procedure level structure of programs. TR29 Architectural and Communication Issues in Designing Heterogeneous Parallel Systems with Optical Interconnection by Ravi Prakash and Dhabaleswar K. Panda THIS REPORT WAS UPDATED 7/95 This paper investigates architectural and communication issues in designing heterogeneous parallel systems. As the speed of processors and parallel systems keepon increasing over years, electronic interconnections like HIPPI and FDDI are reaching their limit to interconnect several parallel systems to provide heterogeneous parallel cxomputing environment. This paper explores the suitability of passive star-coupled optical interconnection using wave- length division multiplexing as the system interconnect to provide high bandwidth (Gbits/sec) communication demanded by heterogeneous systems. Several different communication strategies (combinations of communication topoloties and protocols) over Wavelength Division Multiplexed (WDM) communication media like optic fiber are investigated. A representative master-slave computational model is used to evaluate and determine suitable communication architecture for such systems. The interplay between system speed, network speed, task granularity, and degree of parallelism are studied. It is shown that a hierarchical ALOHA-based communication strategy between the master and the slaves, implemented on top of the passive star-coupled network, leads to a considerable reduction in channel contention. Compared to direct point-to-point communication, it is demonstrated that hierarchical interconnection provides 50% - 80% reduction in task completion time for applications with medium to high degrees of parallelism. Comparable reduction in channel ocntention is also shown to be achieved by using tunable acoustooptic filters at master nodes. These two strategies demonstrate significant potential for building future generation heterogeneous parallel systems capable of delivering high performance. Keywords: Heterogeneous parallel systems, hierarchical networks, optical interconnection, meta-parallelism, task scheduling, time division multiplexing, tunable filters, wavelength division multiplexing. TR28 Data-Parallel Volume Rendering Algorithms by Roni Yagel and Raghu Machiraju Images generated from volumetric datasets are increasingly being used in many biomedical disciplines, archeology, geology, high energy physics, computational chemistry, computational fluid dynamics, meteorology, astronomy, computer aided design, environmental sciences, and many others. However, given the overwhelming amounts of computations required for accurate manipulation and rendering of 3D raster, it is obvious that special purpose hardware, parallel processing or a combination of both are imperative. In this presentation we consider the image composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders only its data by using any existing volume rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector processing capabilities. The resulting images from all processors are then combined (composited) in visibility order to form the final image. The major advantage of this approach is that, as viewing and shading parameters change, only 2D partial images are communicated among processors and not 3D volume data. Through experimental results and performance analysis we show that our parallel algorithm is amenable to extremely efficient implementations on distributed memory MIMD (Multiple Instruction Multiple Data) vector-processor architectures. It is also very suitable for hardware implementation based on image composition architectures, it supports various volume rendering algorithms, and it can be extended to provide load-balanced execution and to support rendering of unstructured grids. TR27 A Formal Method for Classifying Distributed Systems by Deborah Shands The need for communication and coordination between agent in a distributed system distinguishes distributed problems from uni-processor problems. The level of coordination achievable in a system determines which problems are solvable in that system. Often, it is not clear from the informal and operational descriptions of systems in the literature just what level of coordination is possible. This leaves us with the practical question: Which properties of a given system are relevant to another system? We introduce a method for building a problem-based taxonomy of distributed systems, where a system is classified according to the distributed problems which it can solve. We expect that such a taxonomy would ease the process of applying results from one system to a new system. The method is non-operational, based on specifications of knowledge- transfer among agents over time. It has the following components: 1. A propositional modal language and associated decidable logic with knowledge and temporal operators for specifying the knowledge transfer behavior and requirements of distributed systems, protocols and problems; 2. The concept of generic distributed problems which, along with full-information protocols, allow systems to be studied and compared in isolation from the influences of protocol choice and specific problem requirements; 3. A small model theorem which allows Kripke structures to be mechanically produced from system and protocol specifications, and a bisimulation result which allows those Kripke structures to be formally compared. To classify a system by our method, one would first formally specify the system, full-information protocol and problem. A Kripke structure modeling the system and protocol behavior can then be mechanically generated and checked to determine whether the problem specification is met. Kripke structures modeling various systems can be compared by finding bisimulations between knowledge states of the structures. This classification method is formal, mechanizable and comprehensive with the flexibility to compare diverse systems. TR26 Sphylinders, Cone-Spheres and Rounded Polygons by Karansher Singh and Richard Parent Implicit Surfaces provide an elegant approach to modeling and animating physically deformable and organic looking objects. A major drawback of this approach is its rendering complexity. Some work has been done on developing fast, robust techniques for ray tracing and polygonization of implicit surfaces. While a fast scanline technique has been proposed for rendering spheres, spheres alone are limited as a modeling primitive. This paper addresses techniques for fast scanline rendering of implicit surfaces in general, and provides details for implicit primities: spheres, sphylinders, cone-spheres and rounded polygons. These primitives are particularly useful for modeling articulated figures. TR25 Development in Logic-based Languages by Marc Kirschenbaum, Spiro Michaylov, and Leon Sterling. To help encourage the wide use of a variety of logic-based languages, we advocate the identification and use of systematic program cons methods. Our approach is to capture programming idioms in the form of skeletons, or basic control flows, and standard programming techniques for construction and manipulation of data structures. We argue that the approach, coming initially from Prolog and logic grammars, is equally applicable in the context of constraints, committed-choice and higher order constructs, and indeed logic-based languages in general. This paper intro- duces our generalized view of skeletons and techniques, gives examples of the breadth of their applicability, and explores positive consequences for program development. TR24 Polyhedral Shapes as General Implicit Surface Primitives by Karansher Singh and Richard Parent Implicit Surfaces provide an elegant approach to modeling and animating physically deformable and organic looking objects. Existing implementations use simple analytically defined geometric shapes as modeling primitives. This paper discusses limitations of these shapes and argues the case for a more general primitive. A generalized primitive implicit shape definition is presented and the usage of arbitrary polyhedral structures as primitives described. A major drawback of implicit surfaces is their rendering complexity. Rendering polyhedral primitives efficiently using various techniques, with emphasis on ray tracing is described. Finally, useful applications of the above primitive definition, such as shape change are discussed. TR23 Measures of Message Latency for Wormhole Routed Distributed Memory Multiprocessors by Kant C. Patel and D.N. Jayasimha A traditional measure of the communication performance in a massively parallel processor (MPP) is the average message latency. In this paper we aregue that the standard deviation of message latencies should be considered as an additional important measure in a multi-user MPP environment. Using a simulation based approach, we investigate various schemes to reduce the standard deviation without adversely affecting the average latency. Simula- tions on a 16 X 16 wormhole routed mesh architecture reveal that the standard deviation varies typically from 25% to 50% of the average depending on the message generation rate, the message length, the traffic pattern, and the routing algorithm. Since the differential blocking faced by messages is one of the major causes for the large standard deviation, we take to approaches to reducing it which do not degrade the average. The first controls the relative progress with which different messages proceed towards their destinations by prioritizing them dynamically. The second employs multiple virtual channels per physical channel. We compare the performance of several schemes for each of these two approaches, using deterministic routing as our base case. A distinguishing feature of our investigation is that the simulation model accounts for the degradation in network cycle time with increased router complexity required by the use of the priority schemes and by the virtual channels. In fact, we show how our experimental results could have been misleading had we followed the traditional "cycle counting" approach. Keywords: wormhole routing, message priorities, virtual channels, router design complexity, average message latency, standard deviation, performance, simulation. TR22 A Necessary and Sufficient Condition for Deadlock-Free Wormhole Routing by Loren Schwiebert and D.N. Jayasimha THIS TR WAS UPDATED 12/94 An important open problem in wormhole routing has been to find a necessary and sufficient condition for deadlock-free adaptive routing. Recently, Duato has solved this problem for a restricted class of adaptive routing algorithms. In this paper, a necessary and sufficient condition is proposed that can be used for any adaptive or nonadaptive routing algorithm for wormhole routing, as long as only local information is required for routing. The underlying proof technique introduces a new tyupe of dependency graph, the channel waiting graph, which omits most channel dependencies that cannot be used to create a deadlock configuration. The necessary and sufficient condition can be applied in a straightforward manner to most routing algorithms. This is illustrated by proving deadlock freedom for a partially adaptive nonminimal mesh routing algorithm that does not require virtual channels and a fully adaptive minimal hypercube routing algorithm with two virtual channels per physical channel. Both routing algorithms are more adaptive than any previously proposed routing algorithm with similar virtual channel requirements. Keywords: wormhole routing, routing algorithms, deadlock freedom, channel waiting graph, necessary and sufficient condition, mesh architec- ture, hypercube architecture. TR21 The Effects of Layering and Encapsulation on Software Development Cost and Quality by Stu Zweben, Steve Edwards, Bruce Weide, & Joe Hollingsworth Software engineers often espouse the importance of using abstraction and encapsulation in developing software components. They advocate the "layering" of new components on top of existing components, using only information about the functionality and interfaces provided by the existing components. This layering approach is in contrast to a "direct implementation" of new components, utilizing unencapsulated access to the representation data structures and code present in the existing components. However, there is no empirical evidence that indicates whether the layering approach improves the productivity or quality of the development activity. We discuss three controlled experiments designed to gather such empirical evidence. The results support the contention that layering significantly reduces the effort required to build new components. Furthermore, the quality of the components, in terms of the number of defects introduced during their development, is at least as good using the layered approach. We also discuss some interesting issues related to the statistical analysis of experiments such as these. Index Terms: empirical study, encapsulation, software components, abstract data types, software development, software reuse. TR20 Impact of Message-ORdering in Wormhole-Routed Multicomputers by Dhabaleswar K. Panda and Vibha A. Dixit-Radiya. This paper analyzes the impact of message-ordering, between outgoing messages from a sender to multiple receivers (called multicasts), on the completion time of a program for wormhole-routed distributed-memory systems. We study how best to order a set of outgoing messages by taking into account message criticality and architectural issues like muiiltiple injection ports, routing adaptivity, and link contention. First, the simple message-ordering algorithm of Dikaiakos et al. [8], proposed only for fully-connected one-port system, is enhanced to obtain a static message- ordering algorithm. This enhanced algorithm is based on message criticality obtained from the temporal dependencies of the program and works for non- fully-connected systems. Next, a dynamic message-ordering algorithm is proposed which additionally considers architectural issues including link copntention and multiple ports for wormhole-routed systems. Both algorithms are compared against naive sequential message-ordering on an event-driven simulator for Gaussian Elimination and random task graphs. Simulation experiments show a reduction in program completion time up to 33.1% by using static over sequential ordering. Similarly, the dytnamic ordering shows additional improvement of up to 17.7% over static ordering. This research indicates that good message-ordering schemes need to be used as the last step in the mapping and scheduling process to generate efficient programs for wormhole-routed distributed-memory systems. Index Terms - message-ordering, wormhole-routing, multicast, link contention, multiport, adaptive routing, temporal communication graph, mapping. TR19 Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines by S.K.S. Gupta, S.D. Kaushik, C.-H. Huang and P. Sadayappan (Revised 3/95) Array statements are often used to express data-parallelism in scientific languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, efficient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be efficiently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with a virtual processor approach to give an efficient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to as virtual-block or virtual- cyclic views depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits different schemes based on the combination of the virtual processor views chosen for the different arrays involved in an array statement. These virtualization schemes have different indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand- compiled code for array assignments. These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays. Index Terms: Array statement, compiler optimizations, data communication, distributed-memory machines, data distribution, High Performance Fortran, Fortran 90. TR18 Lossless Compression of Volume Data by James E. Fowler and Roni Yagel Data in volume form consumes an extraordinary amount of storage space. For efficient storage and transmission of such data, compression algorithms are imperative. However, most volumetric datasets are used in biomedicine and other scientific applications where lossy compression is unacceptable. We present a lossless data-compression algorithm which, being oriented specifically for volume data, achieves greater compression performance than generic compression algorithms that are typically available on modern computer systems. Our algorithm is a combination of differential pulse- code modulation (DPCM) and Huffman coding and results in compression of around 50% for a set of volume data files. TR17 High Quality Template-Based Volume Rendering by Roni Yagel and Kim Ciula. Volumetric representation of 3D scenes is becoming widely used in scientific visualization and computer graphics. Volume rendering confronts performance hurdles due to the voluminous nature of the data, and quality hurdles due to its discreteness. In this paper we present a three- phase template-based algorithm for expediting parallel viewing of volumetric datasets. Our approach exploits coherency between rays in parallel views by storing, in a ray-template, information that is common to all rays. From an axis-aligned plane, and within the projected region of the volume, parallel rays are cast into the volume by repeating the sequence of steps specified by the ray-template. We show that the template can also store information that is used to expedite screen-space and object-space supersampling. Finally, we utilize templates to support adaptive volume traversal. TR16 Visibility Computation for Interactive Visualization of Complex Enclosed Environments by Roni Yagel and William Ray. In many visualization applications as well as computer graphics we need to consider large amounts of objects in order to render one image. In many cases rendering can be preceded by a culling phase which employs simple mechanisms to reject most of the objects. As a result, only a very small portion of the model has to actually go through the time consuming process of hidden object removal. We report on such a culling mechanism that is based on regular space subdivision into cells. These cells are classified into opaque and transparent and a cell-to-cell visibility algorithm determines, for a given cell, which other cells are potentially visible. Only the objects in the potentially visible set of cells are actually submitted to the hidden object removal algorithm. We report on the 2D implementation of the algorithm and its performance for walkthrough the interior of capillary blood vessels. We propose several improvements to our methods and suggest a simple way to extend it to 3D. TR15 The Power of Carry-Save Addition by D.R. Lutz and D.N. Jayasimha A carry-save adder (CSA) or 3-2 adder, is a very fast and cheap adder that does not propagate carry bits. Historically, carry-save addition has been used for a limited set of intermediate calculations, with the most common example being the accumulation of the partial products of a multiplication. We examine carry-save addition form a new prespective: as a binary operation in which one of the operands is a number in carry-save form. From this perspective we develop five new uses for CSAs: (1) as fast adder-comparators for evaluating whether or not X + Y = Z; (2) as the basis of an expanded instruction set that can reduce branch and data hazards and decrease the cycles per instructon (CPI) on superpipelined architectures; (3) as linear- time multipliers for very large integers; (4) as arbitrary-length, constant-time, synchronous, up-down counters; and (5) as extremely fast frequency dividers. Index terms: carry-save adders, computer arithmetic, superpipelined architectures, pipeline hazards, multipliers, counters, frequency dividers. TR14 Real-time Causal Message Ordering in Multimedia Systems by Frank Adelstein and Mukesh Singhal In multimedia systems, not only do messages that are sent to and received by multiple sites need to have a consistent order imposed by all sites, but cause and effect relations must be maintained. Causal ordering allows the cause and effect relations of messages to be maintained. This paper presents an algorithm that insures that multimedia data with real-time deadlines are delivered to the application layer in causal order. The algorithm is designed to insure that any message that arrives at a destination site before its deadline will be delivered to the application before the message expires. In addition, by focusing on a form of causal ordering violations caused by "the triangle inequality," this algorithm has a low overhead with respect to the amount of information that must be appended to each message. Keywords: Real-time, causal ordering, multimedia, systems, delta-causality, triangle inequality. TR13 Semantic Interpretation as Classificatory Abduction by Julie Hartigan. This paper proposes a new computational framework, called classificatory abduction, to perform some aspects of natural language understanding. Inherent in its computational design is the ability to cope with the intractability of semantic interpretation. TR12 Maximal Global Snapshot with Concurrent Initiators by Ravi Prakash and Mukesh Singhal In a distributed system multiple nodes may initiate snapshot collection concurrently. In this paper we present a global snapshot collection algorithm that combines the information collected by each initiator. This generates a maximal, consistent global snapshot that is more recent than the snapshot collected by any initiator. Global snapshots are used to establish checkpoints for recovery from node failures. A maximal snapshot implies that the amount of computation lost during roll-back, after node failures, is minimized. We also present an efficient information dissemination strategy that nodes can employ to exchange snapshot information with each other. TR11 Multimedia on Local Area Network by Amr Elsaadany, Mukesh Singhal, and Ming T. Liu. In this paper we address the issues related to the delivery of multimedia streams on local area networks. Multimedia integrates voice and video along with text and images into existing systems. Local area networks were designed to handle regular data traffic which are bursty in nature and for which variable delay is acceptable. Multimedia traffic, on the other hand, requires constant delay in addition to fast (or real time) delivery. Multimedia traffic also requires large amounts of bandwidth compared to regular traffic. Since local area networks are widely in use, their modification to integrate multimedia streams is very important. In our work we consider some possible solutions to the problems associated with multimedia on local area networks as well as studying the performance aspects of these solutions. Key words: Multimedia, local area networks, synchronization, priority, switching hubs. TR10 Decentralized Semaphore Support in a Virtual Shared Memory System by Mahendra Ramachandran and Mukesh Singhal Distributed Shared Memory has increasingly become a desirable programming model on which to program multicomputer systems. Such systems strike a balance between the performance attainable in distributed memory multiprocessors and the ease of programming of shared memory systems. In shared memory systems, concurrent tasks communicate through shared variables and synchronization of access to shared data is an important issue. Semaphores have been traditionally used to provide this synchronization. In this paper we propose a decentralized scheme to support semaphores in a virtual shared memory system. Our method of grouping semaphores into semaphore-pages and caching a semaphore at a processor on demand eliminates the reliability and bottleneck problems associated with centralized schemes. Key phrases: Operating Systems, Process Synchronization, Semaphores, Distributed Memory Architectures, Distributed Shared Memory. TR09 Optimal, Nonmasking Fault-Tolerant Reconfiguration of Trees and Rings by Anish Arora and Ashish Singhai We design two programs that maintain the process of an arbitrary distributed system in a rooted spanning tree and in a unidirectional ring, respectively, in the presence of fail-stop failures and repairs of both processes and communication channels. Our programs are notable as they (i) are fully distributed, (ii) have optimal time complexity, and (iii) demonstrate two different approaches to designing nonmasking fault- tolerant programs. Keywords: distributed computing, nonmasking fault-tolerance, process and channel failure and repair, reconfiguration, stabilization. TR08 Comparing Tools and Techniques for Data Translation by Sandra A. Mamrak and Julie Barnes A translation often is required from one specific electronic encoding of a document to another. For example, an author may wish to translate an article encoded with LATEX to the same article encoded using Scribe, or using a macro version of the troff family. Different techniques and tools exist for achieving such translations. Techniques include using a pairwise or intermediate-form approach to the translation. Tools include program- ming languages, code-generating tools, and integrated, code-generating toolsets. A person faced with a translation problem must choose among the various combinations of techniques and tools, with little guidance as to the comparative effort or quality that is achievable with different approaches. In this paper we discuss the complexity of comparing the various approaches. We describe an experiment that we have undertaken to begin to generate some comparative data. And, we discuss the potential significance of the experimental data. (From the Introduction). Keywords and categories: Software Engineering; Data Storage Representations; Text Processing; Computers in Other Systems; format and notation, publishing; data translation, pairwise translation, intermediate-form translation. TR07 Priority Ethernets: Promises and Challenges by Frank Adelstein and Mukesh Singhal This proposal will address the issues related to the delivery of multimedia traffic on local area networks. Multimedia integrates voice and video along with text and images into existing systems. Traditional local area networks, e.g., Ethernet, were designed to handle regular data traffic which are bursty in nature and for which variable delay is acceptable. Multimedia traffic, on the other hand, requires constant delay in addition to fast (or real time) delivery. Multimedia traffic also requires large amount of bandwidth compared to regular traffic. Multimedia is becoming important in commercial as well as non-commercial organizations and Ethernet is a widely used local area network among most of these organizations; therefore, modification of Ethernet protocols to integrate multimedia streams is an important issue. In this research, the investigator will develop solutions to the problems associated with transfer of multimedia data on Ethernet and will study the performance of these solutions and their suitability to multimedia traffic. Specifically, priority protocols will be developed that will enable priority transmission of data on Ethernets. Multimedia data can be assigned priority over regular data, and the developed priority protocols can be used to deliver multimedia data over the Ethernet in a timely manner. Ther performance of these protocols will be studied using extensive simulation. TR06 Task Scheduling in Heterogeneous Systems in the Presence of Channel Contention by Ravi Prakash & Dhabaleswar K. Panda. This paper investigates some of the communication and scheduling issues in heterogeneous systems. A representative heterogeneous system consisting of a set of computers, each architecturally different and exploiting different kinds of parallelism is considered. These computers are connected by a high speed, passive star-coupled photonic network that employs wavelength division multiplexing (WDM). A model scientific computation, exhibiting meta- parallelism, has been studied on this system. Two scheduling strategies have been considered. One strategy involves both subtask data and code migration to the subtask execution site, while the other (the no-code-migration strategy) requires only the data to be transmitted. Simulations of the execution of this computation on our heterogeneous system provide the following insights into the design of such a system: A master-slave model of execution can reduce the amount of communication, and is therefore scalable. Contention for communication channels makes a significant contribution towards the overall task completion time. Channel contention increases with an increase in the degree of parallelism. In the presence of fast disk I/O, the no-code- migration strategy performs better than the subtask-code-migration strategy. A hierarchical communication topology implemented on the passive star-coupled network leads to a considerable reduction in contention and task completion time. These results can be used as guidelines for developing scalable heterogeneous systems. Keywords: channel contention, heterogeneous sytem, hierarchical interconnec- tion, meta-parallelism, task scheduling, wavelength division multiplexing. TR05 An Optimality Proof for Asynchronous Recovery Algorithms in Distributed Systems by Mukesh Singhal and Friedemann Mattern. We prove the optimality of asynchronous recovery algorithms for distributed systems in the sense that irrespective of the order of roll backs by ties, all sites incrementally converge to a unique consistent cut which is the "latest" of all past consistent cuts formed by the local recovery points of the sites. Key Words: Asynchronous recovery algorithms, distributed systems, consistent cut. TR04 On Local Certifiability of Software Components by Bruce W. Weide and Joseph E. Hollingsworth Large software systems, like other large engineered systems, consist of components that are meant to be independent except at their interfaces. An important aspect of any large system is the need for local certifiability: to be able to establish properties of components out of the context of the larger system(s) in which they are embedded, and to be sure that any properties thus certified are certain to hold even when the components are composed with others to build larger ones. This is especially important for "black-box" reusable components, which need to be prequalified for inclusion in a component library. A good software engineering discipline must support local certifiability of important component properties, or life-cycle costs are inherently doomed to spiral out of control for larger and larger systems. No software engineering discipline in general use today can support local certifiability of most interesting properties. But local certifiability of many important properties - including, crucially, correctness with respect to an abstract specification - is possible in at least one practical discipline [Hollingsworth 92b]. TR03 Compiling for Hierarchical Shared-Memory Multiprocessors by J.D. Martens and D.N. Jayasimha Latency and synchronization overheads have been identified as two fundamental problems in large scale multiprocessing. Recently, a number of hierarchical shared-memory multiprocessors have been proposed to alleviate these problems. These architectures form a bridge between the shared memory and distributed memory approaches; for example, HSMAs support for consumer-initiated communication is analogous to that of shared-memory multiprocessors because of the presence of a global address space, and HSMA exploitation of data locality is analogous to that of distributed-memory multicomputers because of the presence of an explicitly hierarchical memory. Issues related to compiling for such architectures are investigated in this paper. In particular, three strategies for parallelizing nested iterative loops with subscripts of the form i+/- k, (read "i plus or minus k") where i is a loop index and k is a constant are discussed. The first method uses the abstraction of simple or regular sections to place variables in the hierarchical memory and allows vectorized communication. The second exploits the data dependence graph and in some cases results in less communication than the first. The third is a modification of the second, resulting in a larger number of writes to the hierarchical memory, but allowing lower latency reads. Key words: parallelizing compilers, iteration space tiling, data space partitioning, hierarchical memory, shared-memory multiprocessing, latency, synchronization. TR02 Global Snapshots of a Distributed System by Ajay Kshemkalyani, Michel Raynal and Mukesh Singhal. The global state of a distributed system is an important issue in the design of distributed systems and it is important that there be efficient ways of observing the global state of a distributed system. Unfortunately, the lack of both a globally shared memory and a global clock in a distributed system makes it difficult to observe the global state of the system. In this paper, we present, several algorithms to collect a snapshot of a distributed system and compare their features. Keywords: global state, snapshot, distributed system, stable property. TR01 Designing Large Hierarchical Multiprocessor Systems under Processor, Interconnection, and Packaging Advancements by Debashis Basak and Dhabaleswar K. Panda. In this paper we present a general framework for architectural design of large hierarchical multiprocessor systems under rapidly changing packaging, processor, and interconnection technologies. Processor boards with larger area (A) and greater pinouts are becoming feasible. Board interconnection technology has advanced fromm only peripheral connections O(SqRt(A)) to elastomeric surface connections O(A). As processor and interconnection technology is growing, there is a varying demand on the interconnection network of the system. The proposed frame- work is capable of taking into account all these changing technologies. Each technology is captured through one or more parameter(s) which reflects its level of advancement. Depending on a given set of values of these parameters, the framework guides us to choose the most optimum topology. The framework is illustrated by considering the design problem of the currently popular k-ary n-cube cluster-c scalable architecture. These architectures combine the scalability of k-ary n-cube wormhole-routed networks with the cost-effectiveness of processor cluster designs. Each cluster consists of c processors interconnected through a bus/MIN/ direct network. Our results indicate that under surface pinout technology an increase in cluster size (c) is associated with a growth in bisection bandwidth, whereas in periphery pinout it leads to a fall in bisection bandwidth. For systems supporting 16-bit links, cluster sizes of 2 to 3 processors with 3D/4D k-ary n-cube interconnections are optimal. Similarly systems with 32-bit links can support larger cluster sizes (c=7,8) and 3D/4D interconnections. Keywords: packaging constraints, hierarchical systems, clustered systems, pinout constraint, interconnection network, k-ary n-cube systems.