The Share Scheduler Revisited
Software Systems Research Group
Department of Computer Science
University of Sydney
The Fair Share Scheduler allocates computing resources between users,
independant of the number of processes.
Succeeding versions have all used the same algorithm,
but implementation details have affected its accuracy.
Changes are described that affect process priority decay,
and group share adjustement.
The Fair Share Scheduler is described in [Kay88],
but basically it consists of a user-level scheduler
that controls the operation of a process-level scheduler.
Users are given `shares' to control their entitlement to resources.
Resource usage is remembered over time,
so that entitlement is reduced for heavy usage,
and increased for light usage,
with respect to other users.
CPU time is scheduled among processes according to their owners' entitlements,
independent of the number of processes each owner possesses.
Simulations of the algorithm demonstrate the fair distribution of the CPU
amongst users with different entitlements and processes,
but the implementation is less than perfect,
primarily brought about by compromises made
when interfacing the algorithm to the existing process-level scheduler.
This paper discusses the changes made to improve the first implementation (Share1)
- the second implementation is called Share2.
The user-level scheduler runs occasionally to calculate
the user-level parameters controlling the process-level scheduler.
Decay the accumulated usage for each user, and add recent resource usage.
Work down the share hierarchy adjusting effective share
so that each group gets its allocated share.
Ensure no user gets too great a share due to recent low usage.
Use the effective share and usage numbers to calculate each user's process priority increment.
The process-level scheduler selects processes to run off the priority queue,
and charges the current process for CPU usage.
It contains the following steps:
Choose next process to run:
the priority queue is sorted by value
- the lower the priority value
- the more frequently the process will be selected.
Adjust priority of current process:
CPU usage by a process incurs a `priority' cost proportional to usage and `nice',
that is added to its priority value and moves it down the priority queue.
Charge user for process resource consumption:
each resource used by the current process adds
a charge to the user to be processed later by the user-level scheduler.
this step is necessary to prevent unbounded growth in the priority value,
and also to prevent new processes monopolising the queue with their low initial values.
The implementation of the `fair-share' algorithm
led to some discrepancies in its operation,
primarily brought about by compromises made
when interfacing the algorithm to the existing low-level scheduler:
The CPU usage of each process was charged by a statistical sampling method:
each `tick' of the clock would charge the whole `tick' to the current process.
Processes which ran synchronised to the clock (eg: by sleep system calls) could
avoid being charged, as also could processes which ran for very short periods.
The actual process priority used in the priority queue was very small
(an effective range of about 32).
The `real' process priority was kept in a larger number,
and normalised into the actual priority,
but even so,
the small number led to processes of users with small differences in share
having equal priority during periods between share scheduler processing.
Process priorities provide points on a linear scale,
with the effective share of a process represented both by its position on the priority queue,
and by how much it moves down the queue after being charged for resource usage.
Process priorities were decayed at a high rate,
much faster than the share `usage',
so that processes belonging to low share
users (with large numbers for priority)
were consequently decayed faster than high-share users,
thereby obtaining a greater share than allocated.
There was no easy way to count the number of `active' processes on the priority queue,
so a `rate' number was kept for each user which represented recent active process activity,
decayed over time.
This `rate' was also used in the group adjustment,
by biasing it depending on the recent share of the group,
in order to change the effective share.
Inter-process sharing was allowed by altering the process decay rate
depending on the `nice' value for each process,
and at the same time charging less for `nicer' processes.
The effects of this mechanism were difficult to predict,
especially the relationship between nice and charge,
which if too extreme would allow a suffiently `nice' process
to obtain a greater share of the machine than intended
(since the charge, and hence the usage of the user, would be incrementing slower).
Root got 100% of the machine whenever it wanted.
This required normal machine management practices to be modified in favour
of running any CPU-bound management tasks as some non-`root' user,
as otherwise `root's processes were capable of hogging the machine completely.
Simulations of the algorithm as implemented
showed that process priority decay adversely affected share [Hog91].
A rapid decay of `usage' also adversely affects share,
but the reason for decaying usage is to forget past share of the machine,
so this is considered an intended `feature', rather than a problem.
Some sites also experienced problems with runaway user `rate' figures under certain conditions.
The delivery of the operating system source code
for a new range of machines with good floating point
provided the impetus for a fresh start.
It was decided that the scheduler should be re-implemented
to benefit from our experience gained with Share 1.
The new CPU architecture allows accurate measurments of the CPU usage of each process,
therefore providing accurate charging even for very small time-slices.
The process priority was changed to be a 64-bit floating point number.
(The old priority field was maintained as for the previous algorithm,
but only for the benefit of programs that needed it, such as `process status'.)
Process priorities for `active' processes are still decayed,
although at a slower rate than previously.
The old scheduler had a process priority half-life of 10 seconds,
the new one uses about 30 seconds.
This still causes a bias of share in favour of low-share users.
If process priorities are not decayed, then they grow without bound.
The problem then becomes one of dealing with new processes
- what priority should a new process have,
if there are active processes on the queue
whose long run time means that have accumulated enormous priorities?
This would also allow a high-share process to monopolise the CPU
for a long period while its priority climbed from an initial low value.
The problem becomes acute when dealing
with a user who starts many short-lived processes.
It is clear that the purpose of the process queue
is to quantise the CPU sharing with a small enough granularity
such that all users get a chance to achieve their share of the CPU
within small intervals of time.
A fast process priority decay seems unavoidable.
It should instead be possible to bias the charge a process incurs for CPU usage
to take account of its effective share
- the smaller the share, the greater the bias, to account for priority decay.
This isn't done.
Instead the bias in favour of small-share users is considered a `feature',
that can be countered if necessary
by decreasing the allocated share for small share users,
`Nice' is now handled by an extra variable that affects how far
down the priority queue a process is moved when it is charged for using the CPU.
The `nicer' the process, the further down the queue it is moved.
In addition, it is now possible for a user to distribute share between
processes without affecting overall share (at the same resource charge),
as well as giving share away to other users in return for being charged less.
The decrease in share for each increment of `niceness' is logarithmic,
whereas the decrease in resource charge is linear.
The adjustment of share down the hierarchy of groups and users now
uses a concept of `active' users.
The whole share of a group is divided between those users
that were active in the last share scheduler period, rather
than all users, allowing groups with a single active user to
maintain the group's share of the resources.
There are problems with this method, in that
a group consisting of short-lived active users
will get a greater share of the machine than one
with a single long-running user, since the `usage'
of each new user will be small compared with the long-running user.
There needs to be a stronger mechanism for maintaining group share,
that is invariant with type of user.
The super-user `root' is now shared,
with `root's share of the machine settable as a percentage at run time,
independant of the shares of active users.
Two problems still remain:
Process priority decay biases the alorithm in favour of small-share users.
This bias is quite small, and would seem to be acceptable as a `feature',
rather than calling it a `bug'.
If necessary, small share users can be given a smaller number of shares to take account of the bias.
Futher work is necessary to determine if the algorithm can be made to offset this bias,
perhaps by increasing the per-user process priority increment for small-share users.
The group adjustment is still susceptible
to variations in resource consumption behaviour between groups
if the decay in per-user `usage' is short
compared to the average length of time a user is active.
For instance, if one group has steady users active for long periods,
while another exhibits a non-repeating series of short-duration consumers,
the current implementation gives a greater share to the latter group.
A benchmark program is needed to test the implementation and compare changes.
D. Hogan ,
`Analysis and Simulation of SHARE'
Software Systems Research Group, 91/9/18.1 (1991).
J. Kay and P. Lauder ,
`A Fair Share Scheduler',
Communications of the ACM, 31(1):44-55 (1988).
- Author's address:
|Department of Computer Science|
|University of Sydney|
|NSW 2006, Australia|