4LP - LOW LEVEL LANGUAGE FOR THE LINE PROCESSOR - SYMPATI 2 -

Pascal ADAM - Didier JUVIN - Hassan ESSAFI
Pascal FERNANDEZ - Jean-Luc BASILLE

CEA SACLAY - D.LETI/DEIN/SIR 91191 GIF/YVETTE FRANCE
Tél : (1) 69.08.56.26

Laboratoire CERFIA - 50A Chemin des maraîchers 31077 TOULOUSE FRANCE
Tél : 61.52.13.52.

ABSTRACT

Image processing can take advantage of line processor structures. We have conceived a language that makes it possible to program a line processor in a way that eliminates many of the problems previously found when using the non-conventional structures.

We shall briefly remind the line processor concept and give some details about the structure our two laboratories are developing in a collaboration. Then we shall present the programming language facilities and illustrate these facilities with some examples. Finally some performances from simulated results will show processing time. This machine with up to 128 Processing Elements (P.E) has a theoretical power of 10 Mips per P.E.

RESUME

Les structures de processeur ligne sont particulièrement bien adaptées au traitement d'images. Aussi nous présentons dans ce papier un langage de bas niveau permettant de programmer le processeur ligne SYMPATI 2.

Ce langage élimine la plupart des problèmes de programmation rencontrés en général sur les calculateurs parallèles.

Nous rappellerons la structure de SYMPATI 2 calculateur SIMD développé conjointement par le CEA/D.LETI et le CERFIA Toulouse, avant d'introduire le langage 4LP et enfin nous présenterons les temps de calcul obtenus pour différents algorithmes.

Ce calculateur pouvant disposer de 16 à 128 processeurs élémentaires a une puissance théorique de 10 Mips par P.E.

1. THE LINE PROCESSOR STRUCTURE

Parallel structures may be considered as VON NEUMANN variations when duplicating some part of the so-called Von Neumann structure (BAS185). Among these different structures that might be involved in an image processing system, the line processor concept is a good choice particular for pixel-level processing when considering the iconic domain, that is extracting features from the images (BAS186).

A line processor is a kind of processor array, a broader generic term that could include pyramids as well as the more traditional bidimensional processor arrays.

In accordance with the two fundamental criteria, efficiency and cost, we have good reason to believe that the line processor concept is a satisfactory middle road, particularly if we wish to design a image processor not only at a reasonable price, but also compatible with a personal computer configuration (BAS186).

A line processor exactly matches the linear structure as shown on figure 1. It also has certain characteristics that provide more possibilities than a simple linear structure would offer (BAS187).

![Figure 1: The linear structure](image)

Let us present the main characteristics available in our system.

(i). We have adopted the helicoidal scheme for organizing the data in the processing element's memory.

According to the addressing system taking place in each processing element, it makes it possible to access, either by row or by column, to any N pixels segment, where N is the number of processing elements involved. This data organization preserves the topological properties of the image. So, two connected pixels of the 4 adjacency will belong to connected PE. This property permit the line processor to work either by row, either by column without any access conflict (two P.E will never require a value belonging to the same memory bank).
(ii). The processor to processor connections makes it possible to access directly in one cycle to any point of the 3 x 3 neighborhood when a processing element processes a pixel of a row or a column segment. Furthermore it provides fast access to larger neighborhoods up to 31 x 31.

Each processing element comprises two parts:
- addressing,
- processing.

The addressing system transforms the relative Euclidean coordinates of the row/column segment to be processed, into memory addresses for the considered pixels. So the data organization is transparent to the user. This addressing system also includes a masking module, which is very useful when only a window of the image is to be processed.

The processing part is more classical. It consists of a 16 bit wide ALU, a multiplier, a shifting module, a 16 bit registers Scratch-pad and an indicator set for the conditional sequences.

The different data paths make it possible for each processing element to access data in its neighbor's memory up to a distance 2 , or in the neighbor’s scratch-pad, up to a distance of 3, right or left.

An independent data path is reserved for input/output operations.

2. DESCRIPTION OF THE PROGRAMMING ENVIRONMENT

One major difficulty for user with the non-conventional structure is the way to program it. Let us present now the language that we have conceived in order to eliminate such difficulties.

4LP is a low level language developed for the command unit with its own instruction set, some instructions are directly executable by the processing elements, others are executable by the command unit itself for the parallelism management.

Due to the SIMD structure 4LP is also a highly parallel language. Furthermore it makes it possible to focus one's attention on the job to be processed on one pixel. This job will be automatically processed for all the pixels of the considered window, in any of the scanning modes presented below (figure 2 - 3).

3. PRESENTATION OF 4LP

The basic program structure is the following one:

```plaintext
PROG  Name; /* comments BLOC structure
.
Instructions;
.
BLOC 1  Instructions;
.
Instructions;
.
BLOC n  Instructions;
.
Instructions;
.
STOP;  LEND;
```

A 4LP program consists of PROG followed by a name and a semi-colon, lists of instructions, BLOCS, and a STOP directive followed by a semi-colon.

One BLOC consists of a LEXT (EXTERNAL LOOP) directive followed by an instruction list, a LINT (INTERNAL LOOP) directive an other instruction list and finally a LEND (END of LOOP) directive (figure 4).

LEXT, LINT are the command unit directives for the external and internal scanning loop start control, respectively.

LEND is the command unit directive for the end of LEXT and LINT loops (figure 4).

The number of blocks is unlimited. Initialization of the following parameters (images, windows, scanning, ...) can't be modified inside a micro-program.

THE DIFFERENT KINDS OF INSTRUCTIONS

. ELEMTARY INSTRUCTION
. MCP : NO OPERATION
. BRANCH INSTRUCTION : GOTO (COND) LABEL
. CONDITIONAL INSTRUCTION

These instructions are all executed in a cycle of 120 ns.

THE ELEMTARY INSTRUCTION

It consists of one destination, an ALU function, 0,1 or 2 operands, an optional indicator set up.

The functions are the usual logic and arithmetic ones, as well as the multiplication.

There are 4 binary indicators that may memorize one ALU flag.

The principal destinations are the following ones:

- A PE's personal scratch pad register.
- SD,SP,S7.SP (8 bit registers)
- D0,SP,D3.SP (16 bit registers)
- XD,SP,X1.SP (16 bit)
- A PE's personal memory.
- A PE's special register for the addressing part.
- The ALU latch.
The principal operands are following ones:

- A scratch pad register belonging to either the PE, either one of its neighbors at a distance 1, 2, 3.
  . S0.SP -- PE's scratch-pad register.
  . S0.Li -- left neighbor PE, distance i = 1,2,3.
  . S0.Ri -- right neighbor PE, distance i = 1,2,3.
- A memory belonging to any PE, at a distance right or left up to 2.
- A constant value given by its address in the Command Unit memory.
- The ALU Latch

**Memory description**

Each PE may only write in its own memory. Depending on the memory organization (helicoidal or tabulated mode), the syntax is different.

1.) Tabulated mode :

Each PE has a special register named INDEX which contains the address of the memory value in a tabulated memory organization

EX : S0.SP = TAB_i the memory value at the address pointed by INDEX is loaded in the S0.SP register.

2.) Helicoidal mode :

In this mode, the segment propagation is managed by the command unit (LEXT and LIND instructions). For each segment position, we access to any pixel by giving its relative coordinates i, j.

A PE can read any pixel value (except that i, j are limited to (+ 31)). Due to the architecture, a direct memory transfer is possible between two PE if the distance is +1 or +2. This is very useful because most of the time the PE asking for a pixel value is not the one which belongs it.

*Example of an elementary instruction :

S0.SP, L = AND M1.0 :-3, +3), L IO = C ;
S0.SP and the Latch (L) receive the result of the AND function between the LATCH value and the pixel (-3, 3) of the image number 0 window type 1. The indicator IO is set according to the Carry ALU flag.

This instruction is executed by all of the PE at the same time in 120 ns.

The BRANCH Instruction GOTO (cond) LABEL
(1) Branch when all PE's indicator selected are set up.
(2) Branch when any PE's indicator selected is set up.
(3) Unconditional branch.
4. ALGORITHMS AND RESULTS

The prototype of the machine will be available by June 1989. We already have the complete development environment including the compiler, desassembler, debugger and the simulator giving the exact computing time that will be required on the machine.

The algorithms, which results are presented now, have been simulated.

Results on 256x256x8 images.

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>32PE</th>
<th>128PE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Thresholding</td>
<td>1,2 ms</td>
<td>0,4 ms</td>
</tr>
<tr>
<td>Local variance</td>
<td>14,5 ms</td>
<td>3,6 ms</td>
</tr>
<tr>
<td>Convolution 3x3</td>
<td>5,3 ms</td>
<td>1,3 ms</td>
</tr>
<tr>
<td>Local minimum</td>
<td>4,5 ms</td>
<td>1,2 ms</td>
</tr>
<tr>
<td>Motion Extractor</td>
<td>9,2 ms</td>
<td>2,2 ms</td>
</tr>
<tr>
<td>Reduction 256^2 -&gt; 128^2</td>
<td>7 ms</td>
<td>6,1 ms</td>
</tr>
<tr>
<td>Axial Symmetric (horizontal axe)</td>
<td>10,8 ms</td>
<td>9,3 ms</td>
</tr>
</tbody>
</table>

5. CONCLUSION

We tried to show that 4LP is relevant for the pixel level processing. It provides an easy way for programming any algorithm operating on a window up to 31x31 without taking care of the data organisation.

The results obtained as for now lead to a very favorable efficient cost/ratio.

We reach the Quality factor of 6 for the Abingdon Cross Benchmark (Preston Kundall - CARNegie MELlon University) this is the best evaluated class for existing multi-processor machine.

We used YACC and LEX UNIX compiler development tools to write the 4LP compiler (kern86).

6. REFERENCES

- (BASIL87) BASILLE J.L., JUVIN D., ESSAFI H., LATIL J.Y. - Caractéristiques Architecturales d'un processeur ligne pour le traitement d'images, 6ème Congrès AFNET RFIA, Antibes 16-20 novembre 1987.