Budapest, Hungary, 17-19 September 2007
?EDA Publishing/THERMINIC 2007 -page- ISBN: 978-2-35500-002-7
Reducing Average and Peak Temperatures of VLSI
CMOS Digital Circuits by Means of Heuristic
Scheduling Algorithm
Wladyslaw Szczesniak
Faculty of Electronics, Telecommunications and Informatics
Gdansk University of Technology, 80-952 Gdansk, Poland
Abstract-This paper presents a BPD (Balanced Power
Dissipation) heuristic scheduling algorithm applied to VLSI
CMOS digital circuits/systems in order to reduce the global
computational demand and provide balanced power
dissipation of computational units of the designed digital
VLSI CMOS system during the task assignment stage. It
results in reduction of the average and peak temperatures of
VLSI CMOS digital circuits. The elaborated algorithm is
based on balanced power dissipation of local computational
(processing) units and does not deteriorate the throughput of
the whole VLSI CMOS digital system.
I. INTRODUCTION
This paper concerns application of heuristic scheduling
algorithm to balance the load of tasks onto computation units
(cu
i
) uniformly with reduction of the total cost of the digital
VLSI CMOS system (co) but without deteriorating its global
computational efficiency measured e.g. by its throughput
(Th). In this paper, the universal measure, named the cost
(co) represents the consumption of power supply of VLSI
CMOS digital system.
In the literature we can find a variety of methods
concerning computational task assignment to different
computation units (cu
i
) [1], [2], [4], [5], [11]. It is especially
important for reducing the supply power demand [6], [7],
[8], [9]. In the paper the design objective function taken into
account is the cost of the system (co). This measure has the
straightforward influence on average and peak temperature
of IC.
The presented BPD (Balanced Power Dissipation)
algorithm can be applied to reduce global computational
demand and provides balanced power dissipation of the
digital VLSI CMOS system at the task assignment stage.
II. PROBLEM FORMULATION
The task to be computed is described by the tasks? graph
G
T
(V,E) as presented in Fig.1. An individual computational
task (ct) in graph G
T
is represented by vertex v
i
in the set of
vertices (e.g. in Fig. 1 ?, ?, ?). Each ct has to be assigned
to one of the computational units of a given resource type
2 dtu
1 dtu
1 dtu
a
bc d e
f g
h k
l m
n
o p
qq r
h
i
j
s
t
u
44 3
6
0 22
4
1
2 0
0
Fig. 1. An example of a computational task represented by the tasks graph
GT (where a, b, c, d, e, f, g, k, m represent the sets of input data) annotated
with p-labels (the number of discrete time units dtu for each cu type ?,?,?
are given in the rectangle).
cu_t
j
?{cu_t
j
} in the proper discrete time (dtu).
The value |{cu_t
j
}| is equal to the number of
computational units of j-type available during design
process, e.g. for the graph in Fig.1 {cu_t
j
}={?,?}. Each
edge e
ij
?E represents a data set. The process of assigning
v
i
?V onto cu_t
j
is constrained by a limited set of resources
({cu_t
1
}?...?{cu_t
j
}?...) and the maximal number of time
slots dtu
M
.
Each type of processing unit is capable of processing a
specified type of computational tasks. The problem of
assigning tasks to processors as specified here is NP
complete [3].
While assigning ct
i
to cu_t
j
in BPD algorithm, different
Budapest, Hungary, 17-19 September 2007
?EDA Publishing/THERMINIC 2007 -page- ISBN: 978-2-35500-002-7
values of throughput (T
h
) of each cu_t
j
are taken into
consideration. The function describing normalized
throughput (T
hn
i
= T
h
i
/ min
j
T
h
j
) of different computational
units versus their costs is shown in Fig. 2, where
0 0 )
3. foreach ( C
k
? C
su
)
4. if ( ! fsc_fulfiled( C
k
) )
5. C
su
= C
su
\ C
k
6. if ( |C
su
| == 0 )
7. break
8. if ( |C
su
| > 1 )
9. foreach ( C
k
? C
su
)
10 if ( there_is_same_cu( C
k
))
11 C
l
= row_of_the_same_cu(C
k
)
12 foreach ( v
i
? C
k
)
13 if (fvi(v
j
)>fvi( v
i
) )
14 interchange( v
i
, v
j
)
15 C
ms
= argmax
Ck ? Csu
fck( C
k
)
16 else
17 C
ms
= HEAD( C
su
)
18 SG
backup
= SG
19 foreach ( v
i
? C
ms
)
20 insert_idle_tasks(v
i
)
21 if( ! delay_all_successors(v
i
) )
22 SG = SG
backup
23 C
su
= C
su
\ C
ms
24 break
fsc_fulfiled( C
k
) (line 4)
This function performs the check for the free space
condition (fsc), defined by the following formula:
oki
rln ??
(1)
where n
i
is the number of dtu needed to perform the task,
l
k
is the number of tasks assigned to C
k
computational unit
row, r
o
is the number of free task slots after the first
occurrence of a task in C
k
computational unit row.
Formula (1) checks if the number of free dtu slots is
sufficient for the idle tasks to introduce longer processing
time of a cui. Every C
k
selected for the increased number of
dtu has to fulfil condition (1). Despite the fact, that
cascading tasks from the other C
k
's are not taken into
consideration while calculating fsc, condition (1) is sufficient
to pre-reject quickly some C
k
from the C
su
set before starting
the time-consuming delaying process.
there_is_same_cu(C
k
) (line 10)
This function simply indicates whether there is another cu
i
of exactly the same type as the one assigned to C
k
, i.e. being
capable of performing the same type of task in the same
time.
row_of_the_same_cu(C
k
) (line 11)
1
2
3
4
5
6
7
8
T
hn
co
1
co
2
co
3
co
4
c
Fig. 2. Normalized throughput (T
hn
) of different computational units
versus their cost (co
i
).
Budapest, Hungary, 17-19 September 2007
?EDA Publishing/THERMINIC 2007 -page- ISBN: 978-2-35500-002-7
This function returns a row containing the tasks of exactly
the same type as C
k
.
fvi( v
i
) (line 16)
This function calculates fvi factor for the v
i
task according
to the following formula:
1
)(
+
??++
=
i
iviavivivi
vi
p
nsfoi
f
(2)
where i
vi
is the number of independent inputs of task v
i
, o
vi
is the number of system outputs of task v
i
,
f
avi
= dtu
M
? (cs
e
+ n
i
), dtu
M
is the maximal number of dtu
admissible, cs
e
is the number of dtu which v
i
is assigned to,
s
vi
is the number of tasks of the same type as task v
i
in the
path of G
T
below task v
i
, n
i
is the number of dtu needed to
perform the task, p
i
is the p - label of task v
i
, the minimal p -
label of a task equals 0, hence addition of 1 in the
denominator is necessary to avoid dividing by 0.
The f
vi
value of a task indicates its suitability for being
slowed down. When there is more than one row of the same
type it is used to create the best interconnect rows by
interchanging tasks.
interchange(v
i
, v
j
) (line 14)
This function swaps the v
i
and v
j
tasks, so that v
i
is located in
task slots earlier occupied by v
j
and vice versa.
fck( C
k
) (line 15)
The value of the function is given by:
Ck
n
dCCk
lPf
k
?=
(3)
where
n
dCk
P is the normalised computational load of C
k
computational unit row (cu-assigned to C
k
row,
n
dCk
P is
normalised to cu
i
, having the lowest value of P
di
), l
Ck
is a
number of tasks in C
k
computational unit row.
The fck function is responsible for selecting the most
suitable row (C
ms
) for inserting idle tasks, from the C
su
set. It
chooses the row assigned to cu_ti that has the highest
throughput demand, hence it gives the highest throughput
demand reduction when the processing element
x
y
p assigned to
row C
k
is slowed down.
insert_idle_tasks(v
i
) (line 20)
This function simply adds new task slots with idle tasks after
the v
i
task. If there is an empty task slot after the last dtu
occupied by v
i
, then an idle task is added there. However,
when there is no empty room for a new idle task, then the next
task in the row of v
i
is delayed. Next the data interconnections
between v
i
and its successor tasks must be checked. This is
done by the delay_all_successors function described below.
delay_all_successors(v
i
) (line 21)
This function checks if all the data needed to perform
successor task (s
i
) of v
i
is available on time, by checking the
condition:
end_dtu(v
i
) ? start_step(s
i
) (4)
If it is not fulfilled, then the successor is delayed as many
dtu as needed, so that start_step(s
i
)=end_dtu(v
i
).
Such a delay implies the need for checking all the sets of
data and interconnections between the successors of s
i
. If the
delay is not possible due to the dtu
M
constraint, increasing
the number of dtu of v
i
(and the computational unit it is
assigned to) fails.
In such a case the row containing v
i
, i.e. C
ms
is removed
from the C
su
set F, and the process starts from the beginning
with the decreased |C
su
|.
For a simple benchmark shown in Fig. 1, the results are
presented in details. The obtained results are shown in Fig. 3
in a form of scheduling graphs SG for the first stage of the
CU
_C
CU
_A
_1
CU
_A
_2
CU
_B
0
1
2
3
4
5
6
7
8
9
c.
u
.
a
ssi
g
n
e
d
j
q
o
ih
h
j
p
p
q
Fig. 3. Results (represented as the scheduling graph SG) of
assignment computational task represented by graph G
T
in Fig.1 for
resources {?,?, ?, ?} obtained after the first stage of BPD
algorithm.
Budapest, Hungary, 17-19 September 2007
?EDA Publishing/THERMINIC 2007 -page- ISBN: 978-2-35500-002-7
algorithm, while the second stage is given in Fig. 4.
There in Fig. 4 lowering the cost of the appropriate
computational units is represented by inserting the symbol ?.
It means that its throughput can be twice as low without
deteriorating the efficiency of the whole computational
system. Therefore our example for computational units
cu_a_2 and cu_c show that their throughput can be lowered
twice resulting in cost reduction of the designed
computational system. Moreover, the value of the throughput
obtained earlier does not deteriorate.
IV. EXPERIMENTAL RESULTS
This section presents the results obtained by applying the
BPD algorithm on selected benchmarks [12].
Cost reduction is calculated for each computational task
based on the number of computational units of each type
before and after application of BPD algorithm. The cost of
each computational unit type assumed for cost calculation is
directly proportional to its throughput. To simplify the
comparison of computational efficiency we assume that T
hi
can be lowered by the factor 0.5. Table II presents the
assumed normalized cost due to the throughput of each
computational unit type.
TABLE II
THE ASSUMED NORMALIZED COST DUE TO THE THROUGHPUT OF EACH
COMPUTATIONAL UNIT TYPE
cu_a cu_b cu_c cu_d cu_e
T
hi
8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1
co 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1
TABLE III
THE ASSUMED NORMALIZED POWER DISSIPATION DUE TO THE THROUGHPUT OF
EACH COMPUTATIONAL UNIT TYPE
cu_a cu_b cu_c cu_d cu_e
T
hi
8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1
P
di
8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1
Tables III and IV and Fig. 5 show number of
computational units of each type before and after applying
BPD algorithm, respectively.
TABLE IV
NUMBER OF COMPUTATIONAL UNITS OF EACH TYPE BEFORE APPLYING BPD
ALGORITHM
cu_a cu_b cu_c
T
hi
8 4 2 1 8 4 2 1 8 4 2 1
s298 17 5 10
c5315 158 56 68
s382 4 9 7
s444 4 17 7
s526 32 9 14
s5378 47
cu_d cu_e
T
hi
8 4 2 1 8 4 2 1
s298 7 31
c5315 7 169
s382 6 30
s444 8 30
s526 14 38
s5378 119 333
TABLE V
NUMBER OF COMPUTATIONAL UNITS OF EACH TYPE AFTER APPLYING BPD
ALGORITHM
cu_a cu_b cu_c
T
hi
8 4 2 1 8 4 2 1 8 4 2 1
s298 3 11 3 2 3 3 2 5
c5315 100 12 8 38 53 3 62 2 4
s382 3 1 7 1 1 7
s444 2 2 11 6 2 1 1 3
s526 6 18 8 3 4 2 1 10 3
s5378 28 18 1
cu_d cu_e
T
hi
8 4 2 1 8 4 2 1
s298 3 3 1 23 2 5 1
c5315 7 130 14 3 22
s382 5 1 13 9 5 3
s444 8 18 7 5
s526 7 6 1 9 22 6 1
s5378 93 1 4 21 285 20 4 24
CU
_C
CU
_A
_1
CU
_A_2
CU
_B
0
1
2
3
4
5
6
7
8
9
c.
u
.
as
s
i
g
n
e
d
l
q
q
o o
j
j
t
i
i
p
p
r
Fig. 4. Results (represented as the scheduling graph SG) of
assignment computational task represented by graph GT in Fig.1 for
resources {?, ?, ?, ?} obtained after the second stage of BPD
algorithm (the symbol ? describes decreasing the computational
efficiency of the chosen computational unit).
Budapest, Hungary, 17-19 September 2007
?EDA Publishing/THERMINIC 2007 -page- ISBN: 978-2-35500-002-7
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Percentage of computational units of each type after applying BPD
algorithm for s298 (a), c5315 (b), s382 (c), s444 (d), s526 (e), s5378 (f).
Budapest, Hungary, 17-19 September 2007
?EDA Publishing/THERMINIC 2007 -page- ISBN: 978-2-35500-002-7
TABLE VI
EXPERIMENTAL RESULTS
Benchmark name Number of graph vertices Cost reduction %
s298 119 31.25
c5315 1994 17.66
s382 158 23.44
s444 181 26.52
s526 193 42.87
s5378 2779 13.15
The resulting cost reduction together with the number of
the G
T
graph vertices of each benchmark computational task
is reported in Table VI.
V. CONCLUSIONS
In this paper the BPD heuristic scheduling algorithm for
load balanced power dissipation resulting in reduction of
average and peak temperatures of the digital VLSI CMOS
digital circuits/systems was presented. The objective
function introduced is measured by cost reduction of VLSI
CMOS digital circuits which directly depends on power
dissipated in IC. The main idea of BPD algorithm is based
on decreasing the cost of chosen computational units by
adjusting their efficiency to real needs without deterioration
of the computational efficiency of the whole system.
The applied BPD algorithm has been verified for the
chosen set of benchmarks. Experimental results proved 13 to
43 per cent cost reduction of the computing system achieved
without deterioration of the system throughput with the
assumed cost to throughput dependency. This reduction has
a straightforward influence on decreasing the average and
peak temperatures of VLSI CMOS system and results in
increasing its reliability.
ACKNOWLEDGMENT
The work was partially supported by the State Committee
for Scientific Research (MNiI, Poland) through the grant
3 T11B 015 27.
REFERENCES
[1] L. Benini, A. Boglio, and G. De Micheli, ?A survey of design
techniques for system-level dynamic power management,? IEEE
Transactions on Very Large Scale of Intergartion (VLSI) Systems, 8,
2000, pp. 287-298.
[2] S. Borkar, ?Low power design challenges for the decade,? Proc.
Design Automation Conf. ASP-DAC, Asia and South Pacific, 2001,
pp. 293-296.
[3] J. B?a?ewicz, K. Ecker, E. Pesch, G. Schmidt, and J. W?glarz,
Scheduling computer and manufacturing processes, Springer,
Heidelberg, 1996.
[4] P. Capello, and D. Mourlouks, ?A scalable, robust network for
parallel computing,? Proceedings of the 2001 joint ACM-ISCOPE
conference on Java Grande, pp. 78-86, June 2001.
[5] V. Karamcheti, and A. A. Chien, ?A hierarchical load-balancing
framework for dynamic multithreaded computations,? Proceedings
of the 1998 ACM/IEEE conference on Supercomputing, 1998,
(CDROM).
[6] L. Kruse, et al., ?Estimation of lower and upper bounds on the
power consumption from scheduled data flow graphs,? IEEE Trans.
on Very Large Scale Integration Systems, Vol. 9. 1, 2001, pp. 3-14.
[7] W-Ch. Kwon, and T. Kim, ?Low-power embedded system design:
Optimal voltage allocation techniques for dynamically variable
voltage processors,? Proceedings of the 40th conference on Design
automation, pp.125-130, June 2003.
[8] P. Michael, U. Lautcher, and P. Duzy, The synthesis approach to
digital system design, Kluwer Academic Publisher, Dordrecht,
1992.
[9] W. Szcze?niak, B. Voss, M. Theisen, J Becker, and M. Glesner,
?Influence of high - level synthesis on average and peak
temperatures of CMOS circuits,? Microelectronics Journal 32 2001
pp.855-862.
[10] W. Szcze?niak, and P. Szcze?niak, ?Low lower digital CMOS VLSI
circuit design with different heuristic algorithms, ? Proceedings of
2nd IEEE Int. conference on Circuits and Systems for
Communications, ICCSC 2004 (CDROM).
[11] Y.-J. Yeh, S.-Y. Kuo, and J.-Y. Jou, ?Converter-free multiple-
voltage scaling techniques for low-power CMOS digital design,?
IEEE Trans. CAD Int. Circuits Syst., 20, 2001, pp.172- 176.
[12] Collaborative Benchmarking Laboratory, ftp.cbl.ncsu.edu