Technical Report Series
		
	Computer and Information Science Research Center
	Department of Computer and Information Science
		The Ohio State University
		    2015 Neil Avenue,
		Columbus, OH  43210-1277

**********************NOTE*********************

	Abstracts for 1994

NOTE:	some titles may be available ONLY via paper (This is noted after
	the title/author section.)

If you have trouble accessing a file, please send email to
tfletch@cis.ohio-state.edu

1994

TR60	Bipartite Permutation Graphs with Application to the Minimum
	Buffer Size Problem  by Ten-Hwang Lai and Shu-Shang Wei

   This paper presents a new characterizations of bipartite permutation 
graphs and a structure theorem for (0,1)-matrices with a special consecutive 
1's property.  These results lead to a linear time algorithm for the
minimum buffer size problem when restricted to bipartite permutation
graphs; this problem arises in relational database systems and is NP-hard 
for a general graph.


TR59	Designing Clustered Multiprocessor Systems under Pakaging and
	Technological Advancements  by Debashis Basak and D.K. Panda

   Packaging technologies impose various physical constraints on bisection
bandwidth, pinout, an dchannel width of a system whereas processor and 
interconnect technologies lead to certain demanded throughput on network 
bisection.  Earlier studies in literature have proposed hierarchical and
clustered interconnections by considering the effect of limited packaging 
constrainst.  Pinout technologies and capacity of packaging modules have been
ignored, often leading to configurations which are not design-feasible.  In 
this paper, we solve this design problem by proposing a new supply-demand
optimization framework.  This generalized framework uses parameterized 
representation of processor board area, pinout technologies (periphery or
surface), channel width, and channel speed.  The family of flat k-ary n-cube
topologies and their clustered variations (k-ary n-cube cluster-c) are 
evaluated to derive optimal configuratios which can lead to cost-effective 
design of scalable parallel systems using wormhole-routing.  The analysis 
identifies processor board area, channel width, and pinout density as 
critical parameters.  The study indicates that cluster-based parallel systems
can deliver better performance with lower cost.  It is predicted that 
optimal configurations for future systems will be cluster-basedd (2-10 
processors per cluster) with 3D/4D/5D inter-cluster interconnection.  This
framework is quite general to capture technological trends of future years.
The framework is validated by the design solutions of current machines using 
contemporary technologies.


Keywords:  Multiprocessor System Design, Scalable Systems, Clustered 
	   Architectures, Hierarchical Organization, k-ary n-cube Inter-
	   Connection, Packaging Constraints, Pinout Technology.


TR58	 Rapid Previewing via Volume-based Solid Modeling  by  Naeem 
	Shareef and Roni Yagel

   Quick previewing of 3D models is necessary for efficient product design
and rapid prototyping.  An inherent weakness of most solid modeling systems is 
that as model complexity increases quick 3D viewing suffers, especially on 
low cost workstations.  We explore an alternative approach to surface 
representation in which object space, called volume, is subdivided into a 
3D grid of cubic cells, each containing information on the object(s) which 
occupy it.  A data structure is introduced that consists of a 2D array of 
pointers each holding a linked list of adjacent non empty cells.  To benefit
from data coherency along one dimension, we have developed new modeling and 
rendering algorithms that are beam-oriented, incremental, and integer-based.  
To illustrate the usefulness of our approach, we use our approach in 
Constructive Solid Geometry (CSG) modeling.  We describe our prototype 
system and show, by comparing it to existing systems, that our data structure
and its associated algorithms while being of finite resolution provide for
suitable and more efficient model visualization.


TR57	Micro-Architecture vs. Macro-Architecture  by Joseph E. 
	Hollingsworth and Bruce W. Weide

   The field of study commonly known as "software architecture" should be 
split into two subareas:  micro-architecture and macro-architecture.  Work
to date under the name "software architecture: has concentrated largely on 
macro-architecture.  But micro-architecture, a.k.a. software component 
engineering, is of equal concern.


TR56	VoxelFlow:  A Parallel Volume Rendering Method for Scientific
	 Visualization   by Asish Law, Roni Yagel, and D.N. Jayasimha

   Volume Visualization techniques are often applied to sampled data 
obtained from MRI and CT-scans as well as to simulated data computed by 
computational fluid dynamics (CFD) methods.  In such applications, a 
sequence of images is animated by incrementally rotating the volumetric 
object about its center, so that the scientist can view the structure from 
all angles.  In this paper, we propose an efficient method, called VoxelFlow,
for generating rendered images of volume data which can be used in animating
rotations of objects.  The algorithm uses a look-ahead scheme of voxel
dataflow and exploits the communication features of modern scalable multi-
computers to achieve near linear speedup.  Simulated results are shown for
implementations on mesh architectures like the INtel Paragon and torus
architectures like the Cray T3D.  Computational and communicational load 
balancing issues among processors are also addressed.


TR55	Reverse Engineering of Legacy Code is Intractable
	by  Bruce W. Weide, Wayne D. Heym, and Joseph E. Hollingsworth

   Reverse engineering of large legacy software systems is widely 
recognized to be a difficult problem.  How bad is it?  By an argument that 
identifies key underlying sources of the difficulty, reverse engineering of
legacy code is shown to be intractable in the usual computational complexity
sense.  This conclusion implies that we should not be too enthusiastic about 
the ultimate value of reverse engineering as the centerpiece of a cost-
effective approach to constructing new generations of systems.


TR54	On the Synchronization Mechanisms in Distributed Shared Memory
	Systems	  by Mahendra Ramachandran and Mukesh Singhal

   Distributed Shared Memory (DSM) is the implementation of the shared memory
programming paradigm on a distributed memory (or multicomputer) system.  
Programming multicomputer systems using Distributed Shared Memory as the 
programming model is appealing because it combines the performance advantage
of distributed memory systems and the ease of programming of shared memory
systems.  In DSM systems, cooperating tasks communicate with one another 
through shared variables.  Thus, DSM systems must provide synchronization 
mechanisms to Coordinate concurrent access to these shared variables.  In 
this paper we describe and classify the synchronization mechanisms supported 
by several Distributed Shared Memory systems.  We classify these systems 
according to whether they are hardware or software based, whether the 
mechanism is integrated into the system or not and whether implementation of 
the synchronization mechanism is centralized or distributed.

Key phrases:  Distributed Shared Memory, Distributed Memory Architectures,
	      Operating Systems, Synchronization.


TR53	Global Reduction in Wormhole k-ary n-cube Networks with Multi-
	destination Exchange Worms   by  Dhabaleswar K. Panda

   This paper presents a new approach to implement global reduction operations
(including barrier synchronization) in wormhole k-ary n-cubes.  The novelty
lies in using multi-destination message passing mechanism instead of single
destination (unicast) messages.  Using pairwise exchange worms along each 
dimension, it is shown that global reduction and barrier synchronization 
operations, as defined by the Message Passing Interface (MPI) standard, can
be implemented with n communication start-ups as compared to 2n[log 2 k] 
start-ups required with unicast-based message passing.  This leads to an 
asymptotic improvement by a factor of [2 log 2 k].  For different values of
communication start-up time and system size, the analysis indicates that the
new framework can implement fast global reduction compared to the unicast-
based scheme.  Issues related to the new approach are studied and the required
architectural modifications to the router interface are presented.  The 
analysis indicates that as the communication start-up time continues to 
decrease in new generation systems, the proposed framework can be used for
implementing fast global reduction and barrier synchronization in large 
wormhole-routed systems without a separate control network.

Keywords:  Global Reduction, Collective Communication, Wormhole Routing,
	   Virtual Cut-through, Broadcasting, Meshes, k-ary n-cubes, and 
	   Path-based Routing.


TR52	Multi-Phase Redistribution:  A Communication-Efficient Approach
	to Array Redistribution  by S.D. Kaushik, C.-H. Huang, J. 
	Ramanujam, and P. Sadayappan.

   Distributed-memory implementations of several scientific applications 
require array redistribution.  Aray distribution is used in languages 
such as High Performance Fortran to dynamically change the distribution
of arrays across processors.  Performing array redistribution incurs two
overheads - an indexing overhead for determining the set of processors to 
communicate with and the array elements to be communicated, and a 
communication overhead for performing the necessary irregular all-to-many
personalized communication.  In this paper efficient runtime methods for 
performing array redistribution are presented.  To reduce the indexing 
overhead, precise closed form for enumerating the processors to 
communicate with and the array elements to be communicated are developed
for two special cases of array redistribution involving block-cyclically 
distributed arrays. The general array redistribution problem for block-
cyclically distributed arrays can be expressed in terms of these special 
cases.  Using the developed closed forms, a distributed algorithm for
scheduling the irregular communication for redistribution is developed.  The
generated schedule eliminates node contention and incurs the least 
communication overhead.  The scheduling algorithm has an asymptotically 
lower scheduling overhead than techniques presented in the literature.  Based
on the developed closed forms, a cost model for estimating the communication 
and the indexing overhead for array redistribution is developed.  Using this
model, a multi-phase approach for reducing the communication cost for array
redistribution is presented.  The key idea is to perform the redistribution 
as a sequence of redistributions such that the total cost of the sequence is
less than that of direct redistribution.  Algorithms for determining the 
sequence of intermediate data distributions which minimizes the total 
redistribution of multi-dimensional arrays are presented.  Experimental 
results on the Cray T3D and IBM-PS2 demonstrate the validity of the developed
cost models and the efficacy of the multi-phase approach for array 
redistribution.

Keywords:  Array Redistribution, Communication Scheduling, Distributed-Memory
	   Computers, Data Distribution, Data Communication, High Performance
	   Fortran, Node Contention.


TR51	Path Failure in Cyulinderical Mesh Networks  by J. Ramachandran

   The problem of node failures in interconnection networks has been 
extensively studied.  In this paper we define the notion of a more 
catastrophic form of failure - path failure - in which a sequence of nodes
from a source to a sink in the network fail.  The corresponding collapse in 
the structural properties of the underlying network is analyzed in quantitative
terms, as a function of network size, by using the notion of catastrophic 
failure.  We examine this problem in a fixed setting, that of the cylinderical
mesh, a network designed to maximize the number of communication routes 
between a set of producers and a set of consumers.  Our proofs use 
interesting graph theoretic and combinatorial techniques, and formally
establish that connectivity is drastically effected by path failure.


TR50	The Polynomial Time Function Hierarchy  by J. Ramachandran

   We study Krentel's polynomial time function hierarchy (PFH) which
classifies and characterizes optimization functions using deterministic oracle 
ransducers and NP sets as oracles.  We introduce a quantifier syntax to 
describe PFH, and explicitly introduce the (PI sub k superscript MM) levels
of the hierarchy.  We discuss complete functions, metric reducibility, and 
relationships between different levels of PFH in analogy to Stockmeyer's
polynomial time hierarchy (PH).  We show that the closure of (PI sub k 
superscript MM) under metric reductions coincides with that under Turing
reductions.  In contrast, we show that these two reducibilities differ for
FDEXT.  We then proceed to build a theory of closure properties of PFH
under operators such as sum, product, majority, plurality, etc.  For each 
choice of an exponential arity operator we consider two forms of closure:
closure without context and closure under context.  We present a series of
results, both positive (closure is possible) and negative (closure implies
collapse), on the closure of PFH for each choice of operator and form of 
closure.


TR49	Kolmogorov Complexity and Toda's Theorem   by J. Ramachandran

  Toda's thjeorem states that the polynomial time hierarchy is contained
in P(PP).  An important building block in the proof is the lemma; NP is 
random reducible ...   We present a new proof of a corollary of this
 lemma, ....., using Kolmogorov complexity.  The proof provides a new
perspective for understanding Toda's Theorem and adds to the growing 
corpus of work relating Kolmogorov complexity to Structural complexity.

NOTE:  where the ellipses exist ( .... ) some of the text has not been
reproduced because emacs does not handle the symbols).


TR48 	Modulo Classes and Advice  by J. Ramachandran

  We prove an upper bound on computing advice for a language A in the 
class of languages acceptable in deterministic polynomial time with 
logarithmic advice, A ... P/log), relative to A itself.  We then provide
conditions under which advice for a language A ... P/ log can be computed in 
PF(A), and conditions for the existence of a set A in P/log without advice
in PF(A).  We show that paddable languages in MODk P/log that are also
self-reducible are in MODk P, without advice.  This is an extension of 
similar results for self-reducible paddable sets in P/log by Balcazar and 
Schoening [3]. 

Keywords:  Computational complexity, Kolmogorov Complexity, counting 
  	      classes, advice.

NOTE:  where the ellipses exist ( .... ) some of the text has not been
reproduced because emacs does not handle the symbols).


TR47	A Trip-based Multicasting Model in Wormhole-routed Networks with
	Virtual Channels.  by Yu-Chee Tseng, D.K. Panda, and Ten-Hwang Lai.
	THIS TECH REPORT IS AN UPDATED VERSION OF 1993/TR03.ps

   This paper focuses on efficient multicasting in wormhole-routed networks.
A trip-based model is proposed to support adaptive, distributed, and 
deadlock-free multiple multicast on a network with arbitrary topology using 
at most two virtual channels per physical channel.  This model significantly
generalizes the path-based model proposed earlier [21, 22], which works only
for Hamiltonian networks and can not be applicable to networks with arbitrary
topology resulted due to system faults.  Fundamentals of the trip-based model,
including the necessary and sufficient condition to be deadlock-free, and the
use of appropriate number of virtual channels to avoid deadlock are investi-
gated.  The potential of this model is illustrated by applying it to hypercubes
with faulty nodes.  Simulation results indicate that the proposed model can
implement multiple multicast on faulty hypercubes with negligible performance
degradation.

Key Words:  Routing algorithm, Interprocessor communication, Multicast, 
	    Virtual channel, Wormhole-routing, Path-based routing, Collective
	    Communication, and Fault tolerance.


TR46	Tracing the Missing Information  by Neelam Soundararajan

   We consider two types of trace based systems, channel-trace systems
and process-trace systems.  Specifications and proofs in the former type
of system are usually considered much simpler than in the latter.  We show
that if we want to ensure that no essential information is missing, we 
are forced to add considerable complexity to the channel trace systems.  We
conclude that specifications and proofs of any program in the process trace
systems are no more complex than in the channel trace systems.


TR45	Mathematical Foundations and Notation of RESOLVE
	by  Wayne E. Heym, Timothy J. Long, William F. Ogden and
	Bruce W. Weide

   The RESOLVE approach to reusable component-based software engineering is
mathematically-based.  This paper discusses the logical foundations and 
terminology of RESOLVE, the built-in RESOLVE notation for writing mathematics,
and the RESOLVE mechanisms that support description of mathematics that has 
no built-in notation.  It is intended to serve primarily as a reference
document.


TR44	Optimizing Compilation of Linear Arithmetic in a Class of Constraint
	Logic Programs  by Spiro Michaylov and Bill Pippin

   A central issue in the optimizing compilation of Constraint Logic 
Programming (CLP) languages is how to compile away as much general constraint
solving as possible.  Most such work relies on obtaining mode and type
information by global analysis, and uses it to generate specialized code for
individual constraints and calls, often with the aid of multiple 
specialization.  Some recent work has augmented these techniques with 
procedure-level analysis of the inter-relationships between constraints, to
detect constraints that subsume other constraints, and variables that cease
to be reachable at some point in a computation.  In combination, these 
techniques have been shown to dramatically improve performance for a number
of programs.
   Here we continue this line of investigation by considering a class of 
programs that accumulate and simplify systems of linear arithmetic 
contstraints.  The programs contain procedures that relate their parameters by
an affine transform.  For some calling patterns, the procedures repeatedly 
compose and simply affine transforms, using virtually the full power of the 
linear arithmetic constraint solver, and incurring considerable expense.
We describe source to source translations that make this composition of 
transforms explicit, replacing constraint solving with ground (imperative)
arithmetic.  We demonstrate the translations and present experimental data 
showing substantial improvements in execution speed and space utilization.


TR43	Synchronization and Desynchronization in a Network of Locally
	Coupled Wilson-Cowan Oscillators   by Shannon Campbell and 
	DeLiang Wang

   A network of Wilson-Cowan oscillators is constructed, and its emergent 
properties of synchronization and desynchronization are investigated by 
both computer simulation and formal analysis.  The network is a two-
dimensional matrix, where each oscillator is coupled only to its neighbors.
We show analytically that a chain of locally coupled oscillators (the 
piece-wise linear approximation to the Wilson-Cowan oscillator) synchronizes,
and present a technique to rapidly entrain finite numbers of oscillators.
The coupling strengths change on a fast time scale based on a Hebbian rule.
A global separator is introduced which receives input from and sends feedback
to each oscillator in the matrix.  The global separator is used to 
desynchronize different oscillator groups.  Unlike many other models, the 
properties of this network emerge from local connections, that preserve 
spatial relationships among components, and are critical for encoding Gestalt
principles of feature grouping.  The ability to synchronize and desynchronize 
oscillator groups within this network offers a promising approach for 
pattern segmentation and figure/ground segregation based on oscillatory 
correlation.


TR42	A Hierarchically-Connected Multiprocessor    by Jeffrey D.
	Martens

   A hierarchically-connected multiprocessor model is presented and 
compared to existing hierarchical shared-memory multiprocessor models.
The key features of this architectural model include that a portion of
the hierarchical memory is contained locally within each processing
element, all memory is visible to every processing element, and each
memory unit is local (but not private) to some processing element.  
Software and programming issues are explored through discussion of 
producer-consumer communication, barrier execution, and a hierarchical
version of guided self-scheduling.


TR41	Fast Barrier Synchronization in Wormhole k-ary n-cube Networks
	with Multidestination Worms	by D.K. Panda  UPDATED 3/95

   This paper presents a new approach to implement fast barrier 
synchronization in wormhole k-ary n-cubes.  The novelty lies in using
multidestination message passing mechanism on base-routing-conformed-
paths instead of point-to-point unicast messages.  Two different multi-
destination worm types, gather and broadcasting, are introduced to implement
the report and wake-up phases of barrier synchronization, respectively.
Using these new worms, algorithms for complete and arbitrary set barrier 
synmchronization are developed.  It has been shown that complete barrier 
synchronization in a k-ary n-cube system with e-cube routing can be 
implemented with 2n communication start-ups as compared to 2n Log2 k 
start-ups needed with unicast-based message passing.  This leads to an 
asymptotic improvement by a factor of log2 k.  For different values of
communication start-up time, link propagation delay, and number of 
participating nodes, simulation study indicates that the new framework
can reduce barrier synchronization cost considerably compared to the 
unicast-based scheme.  Scalability of the proposed scheme is also shown
with respect to the above system and technological parameters.  The 
framework demonstrates potential for supporting fast barrier synchronization
in large wormhole-routed systems.

Keywords:  Barrier Synchronization, Collective Communication, Wormhole
	   Routing, Virtual Cut-through, Broadcasting, Multicasting,
	   Meshes, k-ary n-cubes, and Path-based Routing.


TR40	On Mapping Data and Computation for Parallel Sparse Cholesky
	Factorization   by Kalluri Eswar, C.-H. Huang, and P. Sadayappan

   When performing the Cholesky factorization of a sparse matrix on a 
distributed-memory multiprocessor, the methods used for mapping the elements
of the matrix and the operations constituting the factorization to the 
processors can have a significant impact on the communication overhead
incurred.  This paper explores how two techniques, one used when mapping
dense Chlesky factorizatiyon and the other used when mapping sparse
Cholesky factorization, can be integrated to achieve a communication-
efficient parallel sparse Chlesky factorization.  Two localizing techniques
to further reduce the communication overhead are also described.  The mapping
strategies proposed here, as well as other previously proposed strategies fit
into the unifying framework developed in this paper.  Communication
statistics for sample sparse matrices are included.


TR39	Complete Process Recovery in Distributed Systems Using Vector
	Time  by Golden G. Richard, III  and  Mukesh Singhal

   Distributed applications typically consist of a group of processes
executing on processors which share neither a global clock nor shared
memory and which communicate solely via message passing.  As the number
of processors increases, so does the probability that a processor will
fail at some point during application execution in a distributed system;
this is particularly true for long-running applications.  In order to 
reduce lost work, processes which make up a distributed application should
be recoverable.
   This paper presents a comprehensive technique for complete process 
recovery in distributed systems.  By complete recovery we mean the 
restoration of a consistent system state as well as proper handling of 
lost, duplicate, and delayed messages resulting from the failure and 
restoration of the system to a consistent state.  Message handling has 
often been treated casually in previous recovery techniques.  In contrast,
our recovery techniques directly address message handling issues.  We
classify the message types which a recovery mechanism must deal with to
motivate our message-handling techniques.  The use of vector time allows
us to reduce the overhead associated with recovery and helps expedite the
recovery process after a failure.

Keywords:  Message passing, distributed process recovery, complete 
	   process recovery, vector time, causality, checkpointing, 
	   message logging, optimistic recovery.

TR38	Priority Ethernets	by Frank Adelstein and Mukesh Singhal

   This paper addresses the issues related to the delivery of multimedia
traffic on local area networks.  Multimedia integrates voice and video
along with test and images into existing systems.  Traditional local area
networks, e.g., Ethernet, were designed to handle regular data traffic which
are bursty in nature and for which variable delay is acceptable.  Multimedia
traffic, on the other hand, requires a constant delay in addition to fast
(or real time) delivery.  Multimedia traffic also requires a large amount
of bandwidth compared to regular traffic.  Multimedia is becoming important in 
commercial as well as non-commercial organizations and Ethernet is a widely
used local area network among most of these organizations; therefore, 
modification of Ethernet protocols to integrate multimedia streams is an 
important issue.  In this paper, we present solutions to the problems 
associated with transfer of multimedia data on Ethernet and qualitatively 
study the performance of these solutions and their suitability to multimedia 
traffic.  Specifically, preiority protocols are developed that enable 
priority transmission of data on Ethernets.  Multimedia data can be assigned
priority over regular data, and the developed priority protocols can be used
to deliver multimedia data over the Ethernet in a timely manner.


TR37	Introduction to the Nagiya Editor   by Pete Ware

   The Nagiya Editor is a graphical editor for representing object oriented
design.  It allows the user to draw objects and indicate the relationship 
between these objects.  Being a graphical editor it supports commands such 
as printing, moving, (re)drawing, and (re)sizing.  It is implemented using
Xm++, a C++ library that supports the Motif widget under the X11 Window 
System.
   Eventually, the Nagiya Editor will allow one to specify two schemas and 
indicate how data stored under one representation can be translated into
the other representation.  The Nagiya Editor will also facilitate this
translation process by providing visual feedback to the user during conflict
resolution.


TR36	Low-Cost Checkpointing and Failure Recovery in Mobile Computing
	Systems	   by Ravi Prakash and Mukesh Singhal

   A mobile computing system consists of mobile and stationary nodes, 
connected to each other by a communication network.  The presence of 
mobile nodes in the system places constraints on the permissible energy
consumption and available communication bandwidth.  To minimize the lost
computation during recovery from node failures, periodic collection of a 
consistent snapshot of the system (checkpoint) is required.  Locating the
mobile nodes contributes to the checkpointing and recovery costs.  Synchronous
snapshot collection algorithms, designed for static networks, either force
every node in the system to take a new local snapshot, or block the under-
lying computation during snapshot collection.  Hence, they are not suitable 
for mobile computing systems.  This paper presents a synchronous snapshot 
collection algorithm for mobile systems that neither forces every node to take 
a local snapshot, nor blocks the underlying computation during snapshot 
collection.  If a node initiates snapshot collection, local snapshots of 
only those nodes that have directly or transitively affected the initiator 
since their last snapshots need to be taken.  We also propose a minimal 
rollback/recovery algorithm in which the computation at a node is rolled back
only if it depends on operations that have been undone due to the failure of 
node(s).  Both the algorithms have low communication and storage overheads,
and satisfy the low energy consumption and low bandwidth constraints of mobile
computing systems.

Key words:  checkpointing, causal dependency, global snapshot, mobile
	    computing systems, portable computers, recovery.


TR35	A Note on Set Bit Enumeration  by Thomas Schwentick and J.
	Ramachandran  

	An $m$-set-bit-enumerator for a function $f$ is a source of
information that, in response to the query: ``How many set bits (`1's)
are there in the binary representation of $f(x)$?'',  computes a list
of $m$ numbers, one of which is the correct value. This paper extends
previous results on the hardness of set-bit-enumeration  for the function 
classes $\optP$ and $\numberP$. Our main theorems state:

	If polynomial time computable $n$-set-bit-enumerators exist for 
all $\optP$ functions, then
$\optP =\PF$.  

and
	If polynomial time computable $n^{\frac{1}{2} -
\epsilon}$-set-bit-enumerators exist for all $\numberP$ functions, then
$\numberP =\PF$.  

In both cases $n$ is essentially the length of the function value of the given functions.


TR34	Distributed Semaphores  by  Mahendra Ramachandran and Mukesh
	Singhal

   Semaphores provide a basic synchronization mechanism in uni- and multi- 
processor systems.  Supporting semaphores in distributed systems has not
received much attention.  DSM systems provide a shared memory programming 
model on distributed systems and thus increase the appeal of providing 
efficient, decentralized solutions for the semaphore mechanism in a 
distributed system.  We present a distributed semaphore mechanism in this 
paper.  We use a two-level hierarchy to processors to increase locality of
reference and thus the efficiency of access.

Key phrases:  Operating Systems, Process Synchronization, Semaphores,
	      Distributed Memory Architectures.

TR33	Multidestination Message Passing Mechanism Conforming to Base
	Wormhole Routing Scheme   by Dhabaleswar K. Panda, Sanjay Singal and 
	Pradeep	Prabhakaran

   This paper porposes a novel concept of multidestination worm mechamism 
which allows a message to be propagated in a wormhole network conforming to 
the underlying base routing scheme (ecube, planar, turn, or fully adaptive).  
Using this model, any source has potential to deliver a message to multiple
destinations in any valid path in the system conforming to the base routing 
scheme while encountering only a single communication start-up.  The 
flexibility of sending unicast messages exists under this model as a subset
operation.  Two schemes are developed and evaluated under this model to 
perform fast multicasting on 2D/3D meshes/Tori.  Not only do these schemes 
demonstrate superiority over Umesh (unicast-based multicast) [25] and 
Hamiltonian-Path-based [24] schemes, they indicate a very unique and 
interesting result that the cost of multicast operations can be reduced or kept
near-coinstant in e-cube systems as the degree of multicast (number of 
destinations/src) increases.  The proposed schemes can also take advantage of
routing-adaptivity and irregularity in system topology to further reduce the
cost of multicast operations.  These results are the first ones in the 
wormhole-routing literature to propose multicasting schemes with such reduced 
overhead and provision for taking advantage of adaptivity.  This multicast
model lays a new foundation to build wormhole-routed distributed-memory 
systems with fast collective communication operations and distributed-shared-
memory systems with fast cache-coherency support.


TR31	Repeated Redundant Inequalities in Constraint Logic Programming
	by Spiro Michaylov

   A number of Constraint logic Programming systems, including CLP(R) and
Prolog III, decide simultaneous linear inequalities as part of the fundamental
operational step of constraint solving.  While this can contribute tremendously
to the usefulness of the systems, it is computationally quite expensive.
Non-ground inequalities must generally be tested for consistency with the 
collected constraing set and then added to it, increasing its size, and thus
making the next such test more expensive.
   Future redundant inequalities in a program are those that are guaranteed
to be subsumed after no more than one subsequent procedure call, usually in 
the context of a recursive procedure.  It has been noted that such 
inequalities need only be tested for consistency with the current constraint
set, thus resulting in dramatic savings in execution speed and space usage. 
In this paper we generalize the notion of future redundancy in a number of
ways and thus broaden its applicability.  Thus we show how to dramatically 
improve the performance of a wider class of programs that rely heavily on 
linear inequalities.


TR30	Skeletons and Techniques for the Systematic Development of Constraint
	Logic Programs   by Spiro Michaylov

   We study the systematic development of Constraint Logic Programs from the 
viewpoint of Skeletons and Techniques as described by Kirschenbaum and 
Sterling.  We describe a number of fundamental skeleton classes for CLP, 
and generalize the notion of skeletons to deal with non-structural recursion.  
Then we describe a range of useful techniques for extending these skeletons.
Furthermore, we introduce important classes of techniques that alter the 
control flow of skeletons in certain well-defined and desirable ways.
  This work represents a step towards understanding how to develop complex
CLP programs easily, and is expected to contribute to the adoption of CLP
for applications projects.  It may also lead to the development of semi-
automated program development tools.  Finally, it helps to justify a 
substantial body of present work on CLP compiler optimizations that depends 
on the procedure level structure of programs.


TR29	Architectural and Communication Issues in Designing Heterogeneous
	Parallel Systems with Optical Interconnection
	by  Ravi Prakash  and  Dhabaleswar K. Panda
	THIS REPORT WAS UPDATED  7/95

   This paper investigates architectural and communication issues in designing
heterogeneous parallel systems.  As the speed of processors and parallel 
systems keepon increasing over years, electronic interconnections like HIPPI
and FDDI are reaching their limit to interconnect several parallel systems to
provide heterogeneous parallel cxomputing environment.  This paper explores
the suitability of passive star-coupled optical interconnection using wave-
length division multiplexing as the system interconnect to provide high
 bandwidth (Gbits/sec) communication demanded by heterogeneous systems.  
Several different communication strategies (combinations of communication 
topoloties and protocols) over Wavelength Division Multiplexed (WDM) 
communication media like optic fiber are investigated.  A representative
master-slave computational model is used to evaluate and determine suitable 
communication architecture for such systems.  The interplay between system
speed, network speed, task granularity, and degree of parallelism are studied.
It is shown that a hierarchical ALOHA-based communication strategy between 
the master and the slaves, implemented on top of the passive star-coupled
network, leads to a considerable reduction in channel contention.  Compared
to direct point-to-point communication, it is demonstrated that hierarchical
interconnection provides 50% - 80% reduction in task completion time for
applications with medium to high degrees of parallelism.  Comparable reduction
in channel ocntention is also shown to be achieved by using tunable 
acoustooptic filters at master nodes.  These two strategies demonstrate 
significant potential for building future generation heterogeneous parallel
systems capable of delivering high performance.

Keywords:  Heterogeneous parallel systems, hierarchical networks, optical 
	   interconnection, meta-parallelism, task scheduling, time 
	   division multiplexing, tunable filters, wavelength division
	   multiplexing.


TR28	Data-Parallel Volume Rendering Algorithms
	by Roni Yagel and Raghu Machiraju

   Images generated from volumetric datasets are increasingly being used in 
many biomedical disciplines, archeology, geology, high energy physics, 
computational chemistry, computational fluid dynamics, meteorology, astronomy,
computer aided design, environmental sciences, and many others.
   However, given the overwhelming amounts of computations required for 
accurate manipulation and rendering of 3D raster, it is obvious that special 
purpose hardware, parallel processing or a combination of both are imperative.
In this presentation we consider the image composition scheme for parallel
volume rendering in which each processor is assigned a portion of the volume.
A processor renders only its data by using any existing volume rendering 
algorithm.  We describe one such parallel algorithm that also takes advantage
of vector processing capabilities.  The resulting images from all processors
are then combined (composited) in visibility order to form the final image.
   The major advantage of this approach is that, as viewing and shading 
parameters change, only 2D partial images are communicated among processors 
and not 3D volume data.  Through experimental results and performance 
analysis we show that our parallel algorithm is amenable to extremely 
efficient implementations on distributed memory MIMD (Multiple Instruction 
Multiple Data) vector-processor architectures.  It is also very suitable for
hardware implementation based on image composition architectures, it supports
various volume rendering algorithms, and it can be extended to provide 
load-balanced execution and to support rendering of unstructured grids.


TR27	A Formal Method for Classifying Distributed Systems
	by  Deborah Shands

   The need for communication and coordination between agent in a 
distributed system distinguishes distributed problems from uni-processor
problems.  The level of coordination achievable in a system determines 
which problems are solvable in that system.  Often, it is not clear from 
the informal and operational descriptions of systems in the literature 
just what level of coordination is possible.  This leaves us with the 
practical question:  Which properties of a given system are relevant to 
another system?
   We introduce a method for building a problem-based taxonomy of
distributed systems, where a system is classified according to the
distributed problems which it can solve.  We expect that such a
taxonomy would ease the process of applying results from one system to
a new system.
   The method is non-operational, based on specifications of
knowledge- transfer among agents over time.  It has the following
components:
	1. A propositional modal language and associated decidable
logic with knowledge and temporal operators for specifying the
knowledge transfer behavior and requirements of distributed systems,
protocols and problems;
	2. The concept of generic distributed problems which, along
with full-information protocols, allow systems to be studied and
compared in isolation from the influences of protocol choice and
specific problem requirements;
	3. A small model theorem which allows Kripke structures to be
mechanically produced from system and protocol specifications, and a
bisimulation result which allows those Kripke structures to be
formally compared.
   To classify a system by our method, one would first formally
specify the system, full-information protocol and problem.  A Kripke
structure modeling the system and protocol behavior can then be
mechanically generated and checked to determine whether the problem
specification is met.  Kripke structures modeling various systems can
be compared by finding bisimulations between knowledge states of the
structures.  This classification method is formal, mechanizable and
comprehensive with the flexibility to compare diverse systems.


TR26	Sphylinders, Cone-Spheres and Rounded Polygons  by Karansher
	Singh and Richard Parent

   Implicit Surfaces provide an elegant approach to modeling and 
animating physically deformable and organic looking objects.  A major 
drawback of this approach is its rendering complexity.  Some work has been
done on developing fast, robust techniques for ray tracing and polygonization
of implicit surfaces.  While a fast scanline technique has been proposed for 
rendering spheres, spheres alone are limited as a modeling primitive.  This 
paper addresses techniques for fast scanline rendering of implicit surfaces
in general, and provides details for implicit primities:  spheres, sphylinders,
cone-spheres and rounded polygons.  These primitives are particularly useful
for modeling articulated figures.


TR25	Development in Logic-based Languages  by Marc Kirschenbaum, 
	Spiro Michaylov, and Leon Sterling.

   To help encourage the wide use of a variety of logic-based languages,
we advocate the identification and use of systematic program cons
methods.  Our approach is to capture programming idioms in the form of 
skeletons, or basic control flows, and standard programming techniques
for construction and manipulation of data structures.  We argue that the 
approach, coming initially from Prolog and logic grammars, is equally 
applicable in the context of constraints, committed-choice and higher order
constructs, and indeed logic-based languages in general.  This paper intro-
duces our generalized view of skeletons and techniques, gives examples of the 
breadth of their applicability, and explores positive consequences for 
program development.


TR24	Polyhedral Shapes as General Implicit Surface Primitives  by 
	Karansher Singh and Richard Parent

   Implicit Surfaces provide an elegant approach to modeling and animating
physically deformable and organic looking objects.  Existing implementations
use simple analytically defined geometric shapes as modeling primitives.
This paper discusses limitations of these shapes and argues the case for a
more general primitive.  A generalized primitive implicit shape definition
is presented and the usage of arbitrary polyhedral structures as primitives
described.  A major drawback of implicit surfaces is their rendering 
complexity.  Rendering polyhedral primitives efficiently using various
techniques, with emphasis on ray tracing is described.  Finally, useful
applications of the above primitive definition, such as shape change are 
discussed.


TR23 	Measures of Message Latency for Wormhole Routed Distributed Memory 
	Multiprocessors   by  Kant C. Patel and D.N. Jayasimha

   A traditional measure of the communication performance in a massively 
parallel processor (MPP) is the average message latency.  In this paper we 
aregue that the standard deviation of message latencies should be considered
as an additional important measure in a multi-user MPP environment.  Using a
simulation based approach, we investigate various schemes to reduce the 
standard deviation without adversely affecting the average latency.  Simula-
tions on a 16 X 16 wormhole routed mesh architecture reveal that the 
standard deviation varies typically from 25% to 50% of the average depending
on the message generation rate, the message length, the traffic pattern, and
the routing algorithm.  Since the differential blocking faced by messages is
one of the major causes for the large standard deviation, we take to 
approaches to reducing it which do not degrade the average.  The first 
controls the relative progress with which different messages proceed towards 
their destinations by prioritizing them dynamically.  The second employs
multiple virtual channels per physical channel.  We compare the performance
of several schemes for each of these two approaches, using deterministic 
routing as our base case.  A distinguishing feature of our investigation is 
that the simulation model accounts for the degradation in network cycle time 
with increased router complexity required by the use of the priority schemes
and by the virtual channels.  In fact, we show how our experimental results 
could have been misleading had we followed the traditional "cycle counting"
approach.


Keywords:  wormhole routing, message priorities, virtual channels, router 
	   design complexity, average message latency, standard deviation,
	   performance, simulation.


TR22	A Necessary and Sufficient Condition for Deadlock-Free Wormhole
	Routing  by  Loren Schwiebert  and  D.N. Jayasimha  THIS TR WAS
	UPDATED  12/94

   An important open problem in wormhole routing has been to find a necessary
and sufficient condition for deadlock-free adaptive routing.  Recently, Duato
has solved this problem for a restricted class of adaptive routing algorithms.
In this paper, a necessary and sufficient condition is proposed that can be 
used for any adaptive or nonadaptive routing algorithm for wormhole routing,
as long as only local information is required for routing.  The underlying 
proof technique introduces a new tyupe of dependency graph, the channel
waiting graph, which omits most channel dependencies that cannot be used to
create a deadlock configuration.  The necessary and sufficient condition can
be applied in a straightforward manner to most routing algorithms.  This is
illustrated by proving deadlock freedom for a partially adaptive nonminimal
mesh routing algorithm that does not require virtual channels and a fully 
adaptive minimal hypercube routing algorithm with two virtual channels per
physical channel.  Both routing algorithms are more adaptive than any 
previously proposed routing algorithm with similar virtual channel
requirements.

Keywords:  wormhole routing, routing algorithms, deadlock freedom, channel
	   waiting graph, necessary and sufficient condition, mesh architec-
	   ture, hypercube architecture.


TR21	The Effects of Layering and Encapsulation on Software Development
	Cost and Quality   by  Stu Zweben, Steve Edwards, Bruce Weide, &
	Joe Hollingsworth

   Software engineers often espouse the importance of using abstraction and
encapsulation in developing software components.  They advocate the 
"layering" of new components on top of existing components, using only 
information about the functionality and interfaces provided by the existing
components.  This layering approach is in contrast to a "direct implementation"
of new components, utilizing unencapsulated access to the representation data 
structures and code present in the existing components.  However, there is no 
empirical evidence that indicates whether the layering approach improves the 
productivity or quality of the development activity.
   We discuss three controlled experiments designed to gather such empirical 
evidence.  The results support the contention that layering significantly 
reduces the effort required to build new components.  Furthermore, the 
quality of the components, in terms of the number of defects introduced
during their development, is at least as good using the layered approach.
   We also discuss some interesting issues related to the statistical analysis
of experiments such as these.

Index Terms:  empirical study, encapsulation, software components, abstract
	      data types, software development, software reuse.


TR20    Impact of Message-ORdering in Wormhole-Routed Multicomputers by
	Dhabaleswar K. Panda and Vibha A. Dixit-Radiya.

   This paper analyzes the impact of message-ordering, between outgoing 
messages from a sender to multiple receivers (called multicasts), on the 
completion time of a program for wormhole-routed distributed-memory 
systems.  We study how best to order a set of outgoing messages by taking
into account message criticality and architectural issues like muiiltiple 
injection ports, routing adaptivity, and link contention.  First, the 
simple message-ordering algorithm of Dikaiakos et al. [8], proposed only 
for fully-connected one-port system, is enhanced to obtain a static message-
ordering algorithm.  This enhanced algorithm is based on message criticality 
obtained from the temporal dependencies of the program and works for non-
fully-connected systems.  Next, a dynamic message-ordering algorithm is 
proposed which additionally considers architectural issues including link 
copntention and multiple ports for wormhole-routed systems.  Both algorithms
are compared against naive sequential message-ordering on an event-driven 
simulator for Gaussian Elimination and random task graphs.  Simulation 
experiments show a reduction in program completion time up to 33.1% by using
static over sequential ordering.  Similarly, the dytnamic ordering shows
additional improvement of up to 17.7% over static ordering.  This research
indicates that good message-ordering schemes need to be used as the last 
step in the mapping and scheduling process to generate efficient programs for
wormhole-routed distributed-memory systems.

Index Terms - message-ordering, wormhole-routing, multicast, link contention,
	      multiport, adaptive routing, temporal communication graph, 
	      mapping.


TR19	Compiling Array Expressions for Efficient Execution
	on Distributed-Memory Machines   by S.K.S. Gupta, S.D. Kaushik,
	C.-H. Huang and P. Sadayappan   (Revised 3/95)

  Array statements are often used to express data-parallelism in scientific 
languages such as Fortran 90 and High Performance Fortran.  In compiling array
statements for a distributed-memory machine, efficient generation of
communication sets and local index sets is important.  We show that for 
arrays distributed block-cyclically on multiple processors, the local memory 
access sequence and communication sets can be efficiently enumerated as closed
forms using regular sections.  First, closed form solutions are presented for 
arrays that are distributed using block or cyclic distributions.  These closed 
forms are then used with a virtual processor approach to give an efficient
solution for arrays with block-cyclic distributions.  This approach is based 
on viewing a block-cyclic distribution as a block (or cyclic) distribution on 
a set of virtual processors, which are cyclically (or block-wise) mapped to 
physical processors.  These views are referred to as virtual-block or virtual-
cyclic views depending on whether a block or cyclic distribution of the array
on the virtual processors is used.  The virtual processor approach permits 
different schemes based on the combination of the virtual processor views 
chosen for the different arrays involved in an array statement.  These 
virtualization schemes have different indexing overhead.  We present a
strategy for identifying the virtualization scheme which will have the best
performance.  Performance results on a Cray T3D system are presented for hand-
compiled code for array assignments.  These results show that using the virtual
processor approach, efficient code can be generated for execution of array 
statements involving block-cyclically distributed arrays.

Index Terms: Array statement, compiler optimizations, data communication,
	distributed-memory machines, data distribution, High Performance 
	Fortran, Fortran 90.


TR18	Lossless Compression of Volume Data  by  James E. Fowler and 
	Roni Yagel

	Data in volume form consumes an extraordinary amount of
storage space.  For efficient storage and transmission of such data,
compression algorithms are imperative.  However, most volumetric
datasets are used in biomedicine and other scientific applications
where lossy compression is unacceptable.  We present a lossless
data-compression algorithm which, being oriented specifically for
volume data, achieves greater compression performance than generic
compression algorithms that are typically available on modern computer
systems.  Our algorithm is a combination of differential pulse- code
modulation (DPCM) and Huffman coding and results in compression of
around 50% for a set of volume data files.


TR17 	High Quality Template-Based Volume Rendering by Roni Yagel and 
	Kim Ciula.

   Volumetric representation of 3D scenes is becoming widely used in 
scientific visualization and computer graphics.  Volume rendering 
confronts performance hurdles due to the voluminous nature of the data, and 
quality hurdles due to its discreteness.  In this paper we present a three-
phase template-based algorithm for expediting parallel viewing of volumetric
datasets.  Our approach exploits coherency between rays in parallel views by 
storing, in a ray-template, information that is common to all rays.  From an
axis-aligned plane, and within the projected region of the volume, parallel
rays are cast into the volume by repeating the sequence of steps specified
by the ray-template.  We show that the template can also store information
that is used to expedite screen-space and object-space supersampling.
Finally, we utilize templates to support adaptive volume traversal.


TR16 	Visibility Computation for Interactive Visualization of Complex
	Enclosed Environments  by Roni Yagel and William Ray.

   In many visualization applications as well as computer graphics we need 
to consider large amounts of objects in order to render one image.  In many
cases rendering can be preceded by a culling phase which employs simple
mechanisms to reject most of the objects.  As a result, only a very small 
portion of the model has to actually go through the time consuming process
of hidden object removal.  We report on such a culling mechanism that is 
based on regular space subdivision into cells.  These cells are classified 
into opaque and transparent and a cell-to-cell visibility algorithm determines,
for a given cell, which other cells are potentially visible.  Only the objects
in the potentially visible set of cells are actually submitted to the hidden
object removal algorithm.  We report on the 2D implementation of the algorithm
and its performance for walkthrough the interior of capillary blood vessels.
We propose several improvements to our methods and suggest a simple way to 
extend it to 3D.


TR15	The Power of Carry-Save Addition  by D.R. Lutz and D.N. Jayasimha

   A carry-save adder (CSA) or 3-2 adder, is a very fast and cheap adder
that does not propagate carry bits.  Historically, carry-save addition has been
used for a limited set of intermediate calculations, with the most common 
example being the accumulation of the partial products of a multiplication.
We examine carry-save addition form a new prespective:  as a binary operation 
in which one of the operands is a number in carry-save form.  From this 
perspective we develop five new uses for CSAs:  (1) as fast adder-comparators
for evaluating whether or not X + Y = Z;  (2) as the basis of an expanded
instruction set that can reduce branch and data hazards and decrease the 
cycles per instructon (CPI) on superpipelined architectures; (3) as linear-
time multipliers for very large integers;  (4) as arbitrary-length, 
constant-time, synchronous, up-down counters;  and (5) as extremely fast
frequency dividers.

Index terms: carry-save adders, computer arithmetic, superpipelined
	     architectures, pipeline hazards, multipliers, counters, 
	     frequency dividers.


TR14	Real-time Causal Message Ordering in Multimedia Systems  by Frank
	Adelstein and Mukesh Singhal

   In multimedia systems, not only do messages that are sent to and 
received by multiple sites need to have a consistent order imposed by all
sites, but cause and effect relations must be maintained.  Causal ordering
allows the cause and effect relations of messages to be maintained.  This
paper presents an algorithm that insures that multimedia data with real-time
deadlines are delivered to the application layer in causal order.  The 
algorithm is designed to insure that any message that arrives at a 
destination site before its deadline will be delivered to the application 
before the message expires.  In addition, by focusing on a form of causal 
ordering violations caused by "the triangle inequality," this algorithm has
a low overhead with respect to the amount of information that must be 
appended to each message.

Keywords: Real-time, causal ordering, multimedia, systems, delta-causality,
	  triangle inequality.


TR13	Semantic Interpretation as Classificatory Abduction by
	Julie Hartigan.

   This paper proposes a new computational framework, called classificatory
abduction, to perform some aspects of natural language understanding. 
Inherent in its computational design is the ability to cope with the
intractability of semantic interpretation.


TR12 	Maximal Global Snapshot with Concurrent Initiators  by
	Ravi Prakash and Mukesh Singhal

   In a distributed system multiple nodes may initiate snapshot collection
concurrently.  In this paper we present a global snapshot collection algorithm
that combines the information collected by each initiator.  This generates a
maximal, consistent global snapshot that is more recent than the snapshot
collected by any initiator.  Global snapshots are used to establish checkpoints
for recovery from node failures.  A maximal snapshot implies that the amount
of computation lost during roll-back, after node failures, is minimized.  We 
also present an efficient information dissemination strategy that nodes can 
employ to exchange snapshot information with each other.


TR11 	Multimedia on Local Area Network by Amr Elsaadany, Mukesh
	Singhal, and Ming T. Liu.

   In this paper we address the issues related to the delivery of 
multimedia streams on local area networks.  Multimedia integrates voice
and video along with text and images into existing systems.  Local area 
networks were designed to handle regular data traffic which are bursty in 
nature and for which variable delay is acceptable.  Multimedia traffic, on
the other hand, requires constant delay in addition to fast (or real time)
delivery.  Multimedia traffic also requires large amounts of bandwidth 
compared to regular traffic.  Since local area networks are widely in use,
their modification to integrate multimedia streams is very important.  In 
our work we consider some possible solutions to the problems associated with
multimedia on local area networks as well as studying the performance 
aspects of these solutions.

Key words: Multimedia, local area networks, synchronization, priority, 
	   switching hubs.


TR10	Decentralized Semaphore Support in a Virtual Shared Memory
	System  by Mahendra Ramachandran and Mukesh Singhal

  Distributed Shared Memory has increasingly become a desirable
programming model on which to program multicomputer systems. Such
systems strike a balance between the performance attainable in
distributed memory multiprocessors and the ease of programming of
shared memory systems.  In shared memory systems, concurrent tasks
communicate through shared variables and synchronization of access to
shared data is an important issue. Semaphores have been traditionally
used to provide this synchronization. In this paper we propose a
decentralized scheme to support semaphores in a virtual shared memory
system.  Our method of grouping semaphores into semaphore-pages and
caching a semaphore at a processor on demand eliminates the
reliability and bottleneck problems associated with centralized
schemes.

Key phrases: Operating Systems, Process Synchronization,
Semaphores, Distributed Memory Architectures, Distributed Shared
Memory.


TR09 	Optimal, Nonmasking Fault-Tolerant Reconfiguration of Trees and
	Rings  by Anish Arora  and  Ashish Singhai

   We design two programs that maintain the process of an arbitrary 
distributed system in a rooted spanning tree and in a unidirectional ring, 
respectively, in the presence of fail-stop failures and repairs of both
processes and communication channels.  Our programs are notable as they
(i) are fully distributed,  (ii) have optimal time complexity,  and (iii)
demonstrate two different approaches to designing nonmasking fault-
tolerant programs.

Keywords:  distributed computing, nonmasking fault-tolerance, process and
	   channel failure and repair, reconfiguration, stabilization.


TR08   Comparing Tools and Techniques for Data Translation by 
	  Sandra A. Mamrak and Julie Barnes

   A translation often is required from one specific electronic encoding
of a document to another.  For example, an author may wish to translate
an article encoded with LATEX to the same article encoded using Scribe, or
using a macro version of the troff family.  Different techniques and tools
exist for achieving such translations.  Techniques include using a pairwise
or intermediate-form approach to the translation.  Tools include program-
ming languages, code-generating tools, and integrated, code-generating
toolsets.
   A person faced with a translation problem must choose among the various
combinations of techniques and tools, with little guidance as to the 
comparative effort or quality that is achievable with different approaches.
   In this paper we discuss the complexity of comparing the various approaches.
We describe an experiment that we have undertaken to begin to generate some
comparative data.  And, we discuss the potential significance of the 
experimental data.  (From the Introduction).

Keywords and categories:  Software Engineering; Data Storage Representations; 
   Text Processing; Computers in Other Systems; format and notation, 
   publishing; data translation, pairwise translation, intermediate-form 
   translation.


TR07	Priority Ethernets:  Promises and Challenges  by Frank Adelstein
	and Mukesh Singhal

   This proposal will address the issues related to the delivery of multimedia
traffic on local area networks.  Multimedia integrates voice and video along
with text and images into existing systems.  Traditional local area networks, 
e.g., Ethernet, were designed to handle regular data traffic which are bursty
in nature and for which variable delay is acceptable.  Multimedia traffic, on
the other hand, requires constant delay in addition to fast (or real time)
delivery.  Multimedia traffic also requires large amount of bandwidth compared
to regular traffic.  Multimedia is becoming important in commercial as well 
as non-commercial organizations and Ethernet is a widely used local area 
network among most of these organizations; therefore, modification of Ethernet
protocols to integrate multimedia streams is an important issue.  In this 
research, the investigator will develop solutions to the problems associated
with transfer of multimedia data on Ethernet and will study the performance
of these solutions and their suitability to multimedia traffic.  Specifically,
priority protocols will be developed that will enable priority transmission
of data on Ethernets.  Multimedia data can be assigned priority over regular 
data, and the developed priority protocols can be used to deliver multimedia
data over the Ethernet in a timely manner.  Ther performance of these protocols
will be studied using extensive simulation.


TR06	Task Scheduling in Heterogeneous Systems in the Presence of 
	Channel Contention  by  Ravi Prakash & Dhabaleswar K. Panda.

   This paper investigates some of the communication and scheduling issues
in heterogeneous systems.  A representative heterogeneous system consisting 
of a set of computers, each architecturally different and exploiting different
kinds of parallelism is considered.  These computers are connected by a high
speed, passive star-coupled photonic network that employs wavelength division
multiplexing (WDM).  A model scientific computation, exhibiting meta-
parallelism, has been studied on this system.  Two scheduling strategies have
been considered.  One strategy involves both subtask data and code migration 
to the subtask execution site, while the other (the no-code-migration strategy)
requires only the data to be transmitted.  Simulations of the execution of 
this computation on our heterogeneous system provide the following insights 
into the design of such a system:  A master-slave model of execution can 
reduce the amount of communication, and is therefore scalable.  Contention for
communication channels makes a significant contribution towards the overall
task completion time.  Channel contention increases with an increase in the 
degree of parallelism.  In the presence of fast disk I/O, the no-code-
migration strategy performs better than the subtask-code-migration strategy.
A hierarchical communication topology implemented on the passive star-coupled
network leads to a considerable reduction in contention and task completion 
time.  These results can be used as guidelines for developing scalable 
heterogeneous systems.

Keywords: channel contention, heterogeneous sytem, hierarchical interconnec-
	  tion, meta-parallelism, task scheduling, wavelength division 
	  multiplexing.


TR05	An Optimality Proof for Asynchronous Recovery Algorithms in 
	Distributed Systems  by  Mukesh Singhal  and  Friedemann Mattern.

   We prove the optimality of asynchronous recovery algorithms for 
distributed systems in the sense that irrespective of the order of roll
backs by ties, all sites incrementally converge to a unique consistent
cut which is the "latest" of all past consistent cuts formed by the local
recovery points of the sites.

Key Words:  Asynchronous recovery algorithms, distributed systems, 
	    consistent cut.


TR04	On Local Certifiability of Software Components  by Bruce W.
	Weide and Joseph E. Hollingsworth

   Large software systems, like other large engineered systems, consist of
components that are meant to be independent except at their interfaces.
An important aspect of any large system is the need for local certifiability:
to be able to establish properties of components out of the context of the 
larger system(s) in which they are embedded, and to be sure that any 
properties thus certified are certain to hold even when the components are 
composed with others to build larger ones.  This is especially important for
"black-box" reusable components, which need to be prequalified for inclusion
in a component library.  A good software engineering discipline must support
local certifiability of important component properties, or life-cycle costs 
are inherently doomed to spiral out of control for larger and larger systems.
No software engineering discipline in general use today can support local 
certifiability of most interesting properties.  But local certifiability of 
many important properties - including, crucially, correctness with respect to
an abstract specification - is possible in at least one practical discipline
[Hollingsworth 92b].


TR03	Compiling for Hierarchical Shared-Memory Multiprocessors  by
	J.D. Martens and D.N. Jayasimha

   Latency and synchronization overheads have been identified as two 
fundamental problems in large scale multiprocessing.  Recently, a number
of hierarchical shared-memory multiprocessors have been proposed to 
alleviate these problems.  These architectures form a bridge between the 
shared memory and distributed memory approaches; for example, HSMAs
support for consumer-initiated communication is analogous to that of 
shared-memory multiprocessors because of the presence of a global address 
space, and HSMA exploitation of data locality is analogous to that of 
distributed-memory multicomputers because of the presence of an explicitly 
hierarchical memory.  Issues related to compiling for such architectures
are investigated in this paper.  In particular, three strategies for 
parallelizing nested iterative loops with subscripts of the form i+/- k, 
(read "i plus or minus k") where i is a loop index and k is a constant are
discussed.  The first method uses the abstraction of simple or regular 
sections to place variables in the hierarchical memory and allows vectorized
communication.  The second exploits the data dependence graph and in some
cases results in less communication than the first.  The third is a 
modification of the second, resulting in a larger number of writes to the 
hierarchical memory, but allowing lower latency reads.

Key words: parallelizing compilers, iteration space tiling, data space
	   partitioning, hierarchical memory, shared-memory multiprocessing,
	   latency, synchronization.


TR02 	Global Snapshots of a Distributed System  by Ajay Kshemkalyani, 
	Michel Raynal and Mukesh Singhal.

   The global state of a distributed system is an important issue in the 
design of distributed systems and it is important that there be efficient 
ways of observing the global state of a distributed system.  Unfortunately,
the lack of both a globally shared memory and a global clock in a 
distributed system makes it difficult to observe the global state of the 
system.  In this paper, we present, several algorithms to collect a snapshot 
of a distributed system and compare their features.

Keywords: global state, snapshot, distributed system, stable property.


TR01	Designing Large Hierarchical Multiprocessor Systems under
	Processor, Interconnection, and Packaging Advancements  by
	Debashis Basak and Dhabaleswar K. Panda.

   In this paper we present a general framework for architectural design
of large hierarchical multiprocessor systems under rapidly changing
packaging, processor, and interconnection technologies.  Processor 
boards with larger area (A) and greater pinouts are becoming feasible.
Board interconnection technology has advanced fromm only peripheral 
connections O(SqRt(A))  to elastomeric surface connections O(A).  As 
processor and interconnection technology is growing, there is a varying 
demand on the interconnection network of the system.  The proposed frame-
work is capable of taking into account all these changing technologies.  
Each technology is captured through one or more parameter(s) which reflects
its level of advancement.  Depending on a given set of values of these
parameters, the framework guides us to choose the most optimum topology.
The framework is illustrated by considering the design problem of the 
currently popular k-ary n-cube cluster-c scalable architecture.  These
architectures combine the scalability of k-ary n-cube wormhole-routed 
networks with the cost-effectiveness of processor cluster designs.  Each
cluster consists of c processors interconnected through a bus/MIN/ direct
network.  Our results indicate that under surface pinout technology an 
increase in cluster size (c) is associated with a growth in bisection 
bandwidth, whereas in periphery pinout it leads to a fall in bisection 
bandwidth.  For systems supporting 16-bit links, cluster sizes of 2 to 3 
processors with 3D/4D k-ary n-cube interconnections are optimal.  Similarly
systems with 32-bit links can support larger cluster sizes (c=7,8) and 
3D/4D interconnections.

Keywords: packaging constraints, hierarchical systems, clustered systems,
	  pinout constraint, interconnection network, k-ary n-cube systems.