do not necessarily reflect the views of UKDiss.com.
Hyperheuristics using Reinforcement Learning for the Examination Timetabling Problem
Abstract Selection hyperheuristics as generalpurpose search methods controlling a set of low level heuristics have been successfully applied to various problem domains. A key to designing an effective selection hyperheuristic is to find the right combination of heuristic selection and move acceptance methods which are invoked successively under an iterative singlepointbased search framework. The examination timetabling problem is a wellknown challenging real world problem faced recurrently by many educational institutions across the world. In this study, we investigate various reinforcement learning techniques for heuristic selection embedded into a selection hyperheuristic using simulated annealing with reheating for examination timetabling. Reinforcement learning maintains a utility score for each low level heuristic. At each iteration, a heuristic is selected based on those adaptively updated utility scores and applied to the solution in hand with the goal of improvement. All selection hyperheuristics using different reinforcement learning schemes are tested on the examination timetabling benchmark of ITC 2007. The results show that decayGreedy reinforcement learning which chooses a low level heuristic with the maximum utility score with a decaying probability rate, otherwise choosing a random low level heuristic performs the best. The proposed tuned approach although does not perform as good as the stateoftheart, it delivers a better performance than some existing hyperheuristics.
Keyword: reinforcement learning, hyperheuristic, exam timetabling
1 Introduction
The research involved in the search for suitable solutions for Examination timetabling problems requires a combination of practical and experimentalbased techniques [1]. The main focus of current hyperheuristic research is towards the design and adaptation of heuristic methods in automating optimisation for hard computational search problems. Over the last few decades a lot of research has been conducted on the application of hyperheuristic techniques in solving examination timetabling problems. A number of review papers highlighting these efforts have have been published [2], [3]. Experimental results generated by various different techniques in the literature have been reported, focussing on success measures including generality of the solver and the time cost [2].
For a given examination timetabling problem, whether a technique can generate a feasible and workable solution within a practical time limit is an important factor in assessing the success level of the technique employed. An examination scheduling track, based on the postenrolment examination timetabling problem was introduced in the second International Timetabling Competition (ITC2007) [4]. This track introduced 12 realworld datasets which are drawn from anonymised data from several institutions worldwide and is used to test our designed solver in this paper.
The solving process of the examination timetabling problem normally commonly conforms to two main stages; the initial construction stage and the subsequent improvement stage. In this first stage, a feasible solution is constructed. Along with feasibility, some focus is given to the relative quality of the solution in terms of satisfying the problem, to benefit the improvement stage which is more computationally timeconsuming. Then, in the second stage, the solution undergoes a series of improvement transformations until the process terminates. In this case this will occur when the time limit is reached. The improvement process outlined in this paper describes a hyperheuristic search methodology used to search for higher quality solutions when given specific objectives [6].
In the literature, research on the application of hyperheuristic techniques to solve examination timetabling problems has proved the potential of their effectiveness in this area. Certain research has focussed on combining reinforcement learning with hyperheuristics, yielding competitive results which suggest that reinforcement learning can be applied to improve the performance of hyperheuristics in terms of solution quality. Also, reinforcement learning has also been effective in solving resource allocation problems. In this work, rather than applying reinforcement learning directly to solving examination timetabling problems, reinforcement learning algorithms are combined with hyperheuristics within the process of improvement. The learning process when using reinforcement learning on solving examination timetabling problems directly could be rather time and resource consuming due to the complexity of the problem. This can also be considered as an interesting approach to explore how the generality of the solution process can be improved using reinforcement learning. An assessment can be made on how the combination of reinforcement learning algorithms with hyperheuristics may widen the application of solvers to other types of resource allocation problems.
In the approach described here, the initial feasible solution is created using squeaky wheel construction, with hyperheuristics employed for the optimization and improvement phase. The hyperheuristic methods consists of two levels; one is the higher heuristic search from the low heuristics pool, and the other is the lowlevel heuristic search process within the problem domain. For the higherlevel heuristics, a Simulated Annealing (SA) method is combined with different reinforcement learning methods to enable intelligencebased selection of the low level heuristics. In each iteration information is gathered to establish and record the current total performance of low level heuristics. The selection methods inspired by the reinforcement learning are three different greedy search methods and the softmax method, which are compared with a simple random method to analysis performance difference. For the low level heuristics, we designed six smallmove low heuristics, three largemove low heuristics and two directed low heuristics.
The remainder of the paper is as follows: Section 2 briefly describes the background of this paper, the examination timetabling problem and the hyperheuristic method. Section 3 describes the Squeaky Wheel constructor and our SARL hyperheuristic method. Section 4 describes the experimental environment and time parameters used during experimentation. Section 5 presents and discusses the results and section 6 concludes the paper with a brief discussion on the effectiveness of the technique and potential future research areas.
2 Background
2.1 The Examination Timetabling Problem
Examination timetabling is one of the academic timetabling problems and has been proven to be NPhard [7]. Solving an examination timetabling problem requires allocating the given exams with suitable timeslots and rooms under both hard and soft constraints. Among these constraints, all hard constraints must be satisfied in order to achieve a feasible solution, with the satisfaction of the soft constraints used to determine the quality of a given solution. Therefore, the solving process consists of two main parts: first, the generation of a feasible solution that meets all the hard constraints; second, minimization of violations to the soft constraints of the current solution while maintaining feasibility. What’s more, the quality of (feasible) timetables can be calculated numerically based on the satisfaction level of the soft constraints which can be used to help in the evaluation of the corresponding algorithms. In the examination timetabling problems, before the scheduling process all the student enrolment data is generally known, which allows an accurate use of the available resources during the examination period.
In realworld examination timetabling problems, the type and parameter values of the hard and soft constraints varies between institutions. In particular, the soft constraints usually reflect an institutions’ different preferences. As interest has increased on the application of research to examination timetabling problems, benchmarks which identify the common soft and hard constraints from real world institutions using various standard defined measures have emerged. These are used to advance the field of research in solving Examination Timetabling problems in providing a way to allow meaningful scientific comparison and the public exchange of research achievements and expertise in the domain.
Carter et al. introduced a set of 13 benchmark examination datasets in 1996 [5], drawn from three Canadian high schools, five Canadian universities, one American university, one British university and one university in Saudi Arabia. These datasets have been widely tested and used in examination timetabling research [2]. These datasets were supplemented by a series of new datasets, drawn from anonymised data provided by several institutions worldwide for the 2007 International Timetabling Competition (ITC2007).
2.1.1 Related work in examination timetabling problems
The research on solving the examination timetabling problems involves a variety of different methodologies. Combining the use of Memetic algorithms with hill climbing method was proposed to solve the timetabling problem by Ender Ozcan and Alpay Alkan in 2007 [9], which offered an interesting approach to evolutionary timetabling and experimental results which were initially promising. In [10] and [11], large neighbourhood search techniques were introduced to solve the problem. One such approach was based on an improvement graph construction of solution adjacency and identified improvement moves for timetabling by solving negative cost subsetdisjoint graph cycles using a modified shortest path labelcorrecting algorithm [10]. Another investigated a constructive local search approach for examination timetabling, employing an adaptive decomposition strategy [11]. In 2010, Genetic algorithms [8] were utilized to solve a real world timetabling problems in Nigerian Universities using a hierarchy of constraints. This was introduced in conjunction with a new realworld examination timetabling dataset at the University of Agriculture, Abeokuta Nigeria. Tabu search as a Meta heuristic algorithm was combined with large neighbourhood search for solving the examination timetabling problem in [12]. Here the capacitated examination timetabling problem is considered as a partitioning problem and improvement moves are kept in a Tabu list for a certain number of iterations.
Two Ant algorithm methods were proposed to solve the examination timetabling problems in [13]. The timetabling problems were formulated as combinatorial optimization problems, as ant algorithms had been demonstrated to be a powerful solution approach to such problems. The proposed ant algorithm methods were tested on the Toronto benchmark and the experimental results turned out to be competitive as compared to the literature.
The use of simulated annealing in the solution of the multiobjective examination timetabling problem was described in [14]. The algorithm proposed was based on graph theory, with the application of three variants of the proposed simulated annealing method.
An extended Great deluge algorithm was employed in a twophased approach in solving the Examination Timetabling problem data sets introduced as part of the ITC2007 Competition [15]. The approach attempts to exploit the inherent advantages with this Extended Great Deluge technique in escaping from local optima while also maintaining a relatively simple set of neighbourhood moves. Also, Variable neighbourhood techniques in [16] and multiobjective approaches in [17] were also implemented to solving examination timetabling problems.
Late Acceptance Hyperheuristics were introduced by Burke and Bykov [18]. Traditionally, the approach in hyperheuristics was to compare the current solution with the solution immediately preceding within the neighbourhood search process. In late acceptance, the current solution is compared with what was the current solution a number of iterations previously. Late acceptance techniques are able to produce competitive results in a short timeframe. Reinforcement Learning techniques are used to influence heuristic selection for hyperheuristics. A memory log of heuristic actions is kept during execution, with successful actions being rewarded and unsuccessful actions being punished. With this log, successful actions are chosen more often and unsuccessful actions are chosen less often across the search space.
Other various methodologies which have produced relatively promising experimental results on solving the examination timetabling problems are Casebased reasoning [19], Fuzzy approach [20] and Neural networks [21].
2.2 Hyperheuristics
The use of Hyperheuristics combines a set of approaches that share the common goal of automating the design and adaptation of heuristic methods in order to solve hard computational search problems [23]. A heuristic refers to a whole search algorithm or used to refer to a particular decision process sitting within some repetitive control structure, while metaheuristic refers to an overall control strategy exploring a solution searchspace [23]. Unlike metaheuristics, hyperheuristics search in a space of low level heuristics [24].
Specifically, a hyperheuristic can be seen as a process which, when given a particular problem instance and a number of lowlevel heuristics, manages the selection of which low level heuristic to apply at any given time, until a stopping condition is met [23]. The hyperheuristic operates at a higher level of abstraction without knowledge of the domain in which it operates. It only has access to a set of low level heuristics [24], simple local search operators or domain dependent heuristics [24], while the higher level strategy (High Heuristics, for selecting or generating heuristics) can either be a heuristic, a sophisticated knowledgebased technique, or a learning mechanism [22].
A variety of hyperheuristic categories have been proposed in the literature based on different rules. When considering the nature of the heuristic search space, the hyperheuristic can be split into two groupings: 1) heuristic selection, heuristics choose heuristics process, the hyperheuristic uses a set of known domaindependent low level heuristics; 2) heuristic generation, heuristics generate heuristics process, the hyperheuristic evolves new low level heuristics by making use of the components of the existing set. Meanwhile, the hyperheuristic can be categorized into three types considering the source of feedback during learning: 1) Online learning hyperheuristic; learning while solving a given instance of a problem, 2) Offline learning hyperheuristic; learning from a set of training instances, a method that would generalize to unseen instances; 3) Nolearning hyperheuristic; using no feedback from the search process [22] & [23].
Both the traditional singlepoint based search hyperheuristic, where a single low level heuristic is selected at a time and applied to a single working solution, and cooperative hyperheuristic framework which allows the possibility of having parallel execution of multiple low level heuristics that can cooperate by sharing many different working solutions to initiate the low level heuristics are sufficiently researched in the literature [24]. The latter enables the strengths of one low level heuristic to compensate for the weaknesses of another [25]. The traditional singlepoint based search hyperheuristic framework relies on two key components; heuristic selection method and move acceptance criteria. Operating on a single solution, lowlevel heuristics are repeatedly selected and applied and a decision made as to whether to accept the move until some termination criteria is met [25].
2.2.1 Related work in Hyperheuristics
Graph colouring hyperheuristics were investigated as a new constructive hyperheuristic method to solve the examination timetabling problems in [26]. The author utilized the hierarchical hybridizations of four low level graph colouring heuristics which including largest degree, saturation degree, largest coloured degree and largest enrolment to produce four ordered lists. The sequence of exams selection for scheduling is based on the generated lists. Furthermore, a roulette wheel selection mechanism is included in the algorithm to improve the effectiveness of timeslot selection, aiming at probabilistically selecting an appropriate timeslot for the chosen exam. The proposed approach was tested upon the Carter benchmarks as well as the ITC2007 benchmarks. The experimental results proved that the graph colouring constructive hyperheuristic is capable of producing good results and outperforming other approaches on various benchmark instances.
Inspired by the musical improvisation process, the Harmony Search Algorithm (HSA) is a relatively new metaheuristic algorithm and is used within the Harmony Searchbased Hyperheuristic (HSHH) for solving examination timetabling problems. The Harmony Search algorithm will operate at a high level of abstraction which intelligently evolves a series of lowlevel heuristics in the improvement process. Each lowlevel heuristic equates to a move and swap strategy. The authors tested the proposed method using ITC2007 benchmark datasets, with 12 different datasets of various complexity and size. The proposed method produced strong competitively comparable results.
Monte Carlo based hyper heuristics are designed and implemented to solve the examination problems in [28]. Simulated annealing involving a reheating scheme was introduced to this Monte Carlo based hyper heuristic. However, this method compared poorly as compared to other hyper heuristic methods. It was believed that the reason for this is due to fundamental weaknesses such as the choice of appropriate rewarding mechanism and the evaluation of the utility values used for heuristic selection. The conclusions suggested that the use of a choice function rather than incorporating Simulated Annealing might improve the method itself.
E Burke brings forward the concept of the reinforcement learning process to hyper heuristics in [29]. In this paper, Hyperheuristics methodologies were identified to search the heuristic selection space and use selected low level heuristics to search the problem domain. According to the general definition, their proposed method is an iterative hyperheuristic framework formed of a single candidate solution and multiple perturbative low level heuristics. Basically, two parts including the heuristic selection and the move acceptance with certain termination criteria formed this algorithm. Inspired by related work in this area, one of the main focuses of this study is to analyse the working processes of learning heuristic selection within the automatic search for examination timetabling solutions..
3 RLSA based hyper heuristic Algorithm
The hyperheuristic construction for this research consists of two stages; first an initial construction phase which generates a feasible solution, secondly a further improvement phase to ultimately achieve a better solution. In the construction stage, we use squeaky wheel construction while the optimization stage is based on the Simulated Annealing process, combining reinforcement learning with the use of hyperheuristics.
3.1 Squeaky Wheel Construction
Squeaky Wheel (adaptive) construction [30] is an iterative construction process, building an initial schedule by placing one exam at a time, in the order determined by a weighted sequence. There are a number of different methods for determining the initial order of the weighted list. The technique presented here calculates the initial ordering based on examination size and the number of conflicts. Each exam is assigned to the first available time and room combination, where possible, ensuring that a feasible solution is maintained while minimizing soft constraint violations. If an exam cannot be scheduled in the current iteration, it is left unscheduled and the constructor moves onto the next exam. When an exam is scheduled a weighting based upon its current penalty, as defined by the various soft constraint violations, is added to the stored weighting in the weighted list. If an exam cannot be scheduled a suitably large weighting is used instead. Once an attempt has been made to schedule all exams, the weighted list is resorted and subsequently those exams with the highest weighted value (or most difficult exams) are first to be scheduled on the next iteration or ”run” of the constructor. The weightings held in the list evolve over the duration of the entire construction process.
Algorithm 1: Squeaky Wheel 
Read in the problem file into memory and build conflict and suitability matrices; 
Calculate an initial weighting based on predefined criteria; 
WeightedList is used for store the scheduling ordering of all the exams 
1. while stopping criteria not met do 
2. foreach exam ei in weightedList do 
3. for all suitable timeslots ti of ei do 
4. if CanSchedule(ei, ti) then 5. best penalty and store best (bestti); //minimize violation to soft constraints 
6. end 
7. end 
8. if multipleBest found then 9. Schedule(ei, randomBestti) and store associated weighting in weightedList; 
10. end
11. else if bestTi found then 
12. Schedule (ei, bestti) and store associated weighting in weightedList; 
13. end 
14. else 
15. Leave exam unscheduled and add large weighting in weightedList; 
16. end 
17. end 
18. Sort(weightedList); //leave hardtoschedule exam to later 
19. end 
3.2 RLSA based hyperheuristic Optimization
The hyperheuristic consists of three main parts; the move acceptance Simulated Annealing algorithm, heuristic selection method Reinforcement Learning algorithms and a set of various low level heuristics. The selection method will select low heuristics to optimize the current solution and the selected low heuristic will propose a move to optimize the current solution. The move acceptance is the reference for the algorithm in deciding whether to accept this move or not. For each low heuristic, a utility value is used to represent its performance and is updated dynamically. The utility update methods are also based on reinforcement learning algorithms.
Algorithm 2:RL–GD ALGORITHM 
Input – n: number of heuristics, u: array holding utility value for each heuristic, totalTime, 0: initial temperature 
1. // initialisation 
2. Generate a random complete solution Scurrent ; 
3. Initialise utility values; 
4. fcurrent = f0 = quality( Scurrent ); 
5. startTime = t = time(); level = fcurrent 
6. ← 0;sinceImp=0; numNonImprovingReheat=0;maxReheatTimes,=5; 
7. bool bImprovementFound=false; 
8. // main loop executes until total running time allowed is exceeded 
9. while ( t < totalTime ) { 
10. // heuristic selection 
11. i = selectHeuristic( u ); // select a heuristic using the utility values 
12. Stemp = applyHeuristic( i ); 
13. ftemp = quality( Stemp ); 
14. t = time() – startTime; 
15. // move acceptance 
16. if (ftemp < fcurrent ) then { 
17. ui = reward( ui ); // improving move 
18. Scurrent = Stemp;} 
19. else { 
20. ui = punish( ui ); // worsening move} 
21. Δ = ftemp − fcurrent; 
22. ← ∈ [0,1]; 
23. if ( Δ < 0   < (Δ, ) ) then{// accept if nonworsening or with the Boltzmann probability of P 
24. if (Δ < 0) {Scurrent = Stemp; }else sinceImp++;} 
25. if (sinceImp==reheatFrequency) {sinceImp = 0; //reset 
26. if(!bImprovementFound) numNonImprobingReheat++; 
27. else bImprovementFound=false; 
28. if(numNonImprobingReheat>maxReheatTimes) { 
29. numNonImprovingReheat = 0; 
30. temperture = (initialTemperature – temperture) / 2.0; 
31. bestEval = (long)Math.Round(bestEval * 1.1);}} 
32. Temp=Temp*0.9;// decrease temperature according to LundeAndMees rule 
33. UNTIL( terminatewhenthegivenMAXCPUtimeisexceeded); 
34. return Scurrent;}} 
3.2.1 Simulated Annealing
Simulated Annealing (SA) is a global stochastic optimization technique that has been widely used in several types of combinatorial optimization problems ([32], [33], [34], [35]). It is a variant of local search which allows uphill moves to be accepted in a controlled manner. The typical SA algorithm accepts a new solution if its cost is lower than the cost of the current solution. However, even if the cost of the new solution is greater, there is a probability that this solution will also be accepted. This acceptance criteria, along with a reheating scheme, is particularly effective in helping the improvement process in escaping from local optima.
For the hyper heuristic, simulated annealing is used for the move acceptance. Therefore, the decision as to whether to accept any new solution depends on whether it improves on the current solution or the acceptance random probability is greater than the Boltzman probability. The current “temperature” of the improvement process is reduced for each iteration of the search process and directly influences the acceptance probability. The reheating scheme is invoked when the iteration count reaches the given reheat criteria, at which point the temperature will be increased as opposed to reduced.
3.2.2 Reinforcement Learning
Reinforcement Learning (RL) is a general term for a set of widely used approaches which provide a learning mechanism in determining how to behave against certain outcomes, or “how to map situations to actions” through trialanderror interactions [31]. The proposed heuristic selection method involved the design of four different selection methods as well as four different low heuristic utility reward/punish methods inspired by reinforcement learning algorithms.
3.2.2.1 Selection methods
(0) Equal Sequence
In this selection method, each low heuristic will be selected in sequence to ensure each will be applied in equal measure. This is inspired by the Maximize Explore theory used within reinforcement learning.
(1) Greedy
In this selection method, the process checks whether all the low heuristics have been selected at least once. If not, a random low heuristic is selected, with this process continuing until all low heuristics have been selected. From this point the low heuristic with the highest current utility value will be selected. Where there are more than one low heuristic selected, one is chosen randomly from the contenders.
(2) Greedy
In this selection method, random selection is employed until all low heuristic has been selected at least once. After that, if the random probability is smaller than greedy probability , low heuristics are selected at random. Otherwise, the low heuristic with the highest utility value is selected. In determining the most effective value for the greedy probability , experimental tests were performed with set to 0.1 as well as set to 0.01.
(3) DecayGreedy
Similar to the previous method, although the value of the greedy probability is dynamic, decreasing in value as the optimization process continues. The rational for this is based on the principle that as the hyper heuristic algorithm learns, the random low heuristic will become less necessary. The algorithm for this selection method is outlined in detail below.
Algorithm 3: edecaygreedy 
input: iteration: current iteration times;llhList: arraylist of all the low heuristics 
1. int index;//index of low heuristic 
2. double edecay = 1 / (System.Math.Sqrt(iteration)); 
3. Random randDouble = new Random(); 
4. if (randDouble.NextDouble() < edecay  !checkSelected(llhList)){ 
5. //under edecay probablity, do random select 
6. Random randInt = new Random(); 
7. index = randInt.Next(0, llhList.Count);} 
8. else{ 
9. index = llhMaxUtilityIndex(llhList); 
10. index = randomDuplicateSelectionSum(index, llhList);} 
11. //select one of the highest utility heuristic 
12. return index;} 
(4) softmax selection
Unlike the previously described selection methods, this calculates the softmax probability value for each low heuristic and uses this probability value for selecting among the low heuristics.
3.2.2.2 Utility update methods
Within the process of updating the utility values, a positive “reward” (increment) is given to a low heuristic if it was effective in generating a better solution, otherwise a “punish” value is applied. For all of the utility update methods, the lower bound of the utility value is set to 0, upper bound set to 40 and initial value of 20. These values were reached as a result of trial experimentation.
(5) Sum utility
In this method the reward value is +1 and the punish value is 1.
(6) Average utility value/Monte Carlo
In this method the reward value is +1 and the punish value is 1. The sum utility is then divided by the iteration count to set as an average.
(7) Discount sum
This method is similar to the first utility update method, although the positive reward and negative punish values are dynamic. The positive reward value is calculated as (+1)*(0.9^ (iteration times)), with the negative punish value calculated as (1)*(0.9^ (iteration times)).
Algorithm 4: Discount Sum Utility method 
Input: bool flag: reward or punish, int index, llhList: heuristics arrylist, SelectedTimes, discountFactor=0.9. 
1. for (int i = 0; i < llhList.Count; i++){ 
2. if (llhList[i].LlhId == index){ 
3. if(flag) llhList[i].utility+= (+1) * discountFactorR^selectedTimes; 
4. else llhList[i].utility+= (1) * discountFactorR^selectedTimes;}} 
(8) Temporal difference
The theory of this utility update method is that it will estimate the future possible reward based the current performance of each low heuristics. First, this method will calculate the sum utility with the reward and punish values (again set as +1 and 1 respectively). Secondly, an estimate reward value is calculated. The selected probability of the current low heuristic in the next step is 1.0 / (number of low heuristics) and the reward Probability (the probability it might make an improvement in the next iteration if selected) is 0.5 (50%). Therefore the estimate reward is (selected Probability)*(reward Probability). This estimate reward is then added to the current calculated sum utility as the low heuristic’s final utility value. The step size value has been experimentally tested between 1 and 5 to determine the most effective setting.
3.3.3 Low heuristics design
For this work, 19 low heuristics were implemented, grouped into four main types of heuristic and described below. Most of the above lowlevel heuristics are either 1opt or 2opt and there is also a mixture of some randomness and deterministic selection of exams and slots. We purposely test lowlevel heuristics with simple moves rather than lowlevel heuristic with intelligence and complex moves because we want to make sure that the hyperheuristic can recognise good moves and make an intelligent decision based on these simple moves. Furthermore, we want to make the problemdomain knowledge heuristics easy to implement and the hyperheuristic more generalised.
3.3.3.1 Small perturbative moves
Table 1: small heuristics 
H1 (random period assignment): 
Select a Random Exam in Period P1 with Room assignment R1.
Assign Exam to Random Period P2 (if feasible), do not change Room assignment. 
H2 (random room assignment): 
Select a Random Exam in Period P1 with Room assignment R1.
Move the Exam to a feasible Random Room R2 – do not change the Period. 
H3 (random assignment): 
Select a Random Exam in Period P1 with Room assignment R1.
Assign Exam to Random Period P2 (if feasible) and Random Room R2 which can accommodate the exam (if feasible – exclusive constraint). 
H4 (random period swap): 
Select 2 Random Exams, one in Period P1 with Room assignment R1 and the other in Period P2 (≠P1) with Room assignment R2.
Swap them both to the others Periods (if both feasible) – no swapping of Rooms. 
H5 (random room swap): 
Select 1 Random Exam with Room assignment R1 (ignore the period – ensure not exclusive).
Select another Random Exam with similar capacity requirement assigned to Room R2 (≠R1). Swap them both to the others Room (if both feasible, e.g. both not exclusive) – no swapping of Periods. 
H6 (random exam swap): 
Select 1 Random Exam with Room assignment R1 in Period P1.
Select another Random Exam Room assignment R2 in Period P2, Swap them both to the others Room (if both feasible, e.g. both not exclusive) and swapping their Periods. 
3.3.3.2 Large perturbative moves
Table 2: large heuristics 
H7 (random group of exams swap based on period): 
Select two Random Periods P1 and P2 and swap ALL exams between them. 
H8 (random group of exams move based on period): 
Select one Period P1 and move each exam assigned to that period to a new feasible period, one by one (APPLY H1). 
H7_2 (random group of exams swap based on room): 
Select two Random Room R1 and R2 and swap ALL exams between them. 
H8_2 (random group of exams move based on period): 
Select one Room R1 and move each exam assigned to that room to a new feasible room, one by one (APPLY H2). 
H7_3 (random group of exams swap based on timeslot): 
Select two Random Timeslot T1 and T2 and swap ALL exams between them. 
H8_3 (random group of exams move based on period): 
Select one Timeslot T1 and move each exam assigned to that timeslot to a new feasible timeslot, one by one (APPLY H3). 
Note: Timeslot in the above table refers to the combination of period and room.
3.3.3.3 Very large moves
Table 3: shuffle heuristics 
H9 (random group of exams shuffle based on period): 
Shuffle all periods at random. Loop over all Periods i=P_1..P_K; H6(i, Random(i+1,K)). 
H9_2 (random group of exams shuffle based on room): 
Shuffle all rooms at random. Loop over all Rooms i=R_1..R_K; H6(i, Random(i+1,K)). 
H9_3 (random group of exams shuffle based on timeslot): 
Shuffle all timeslots at random. Loop over all Timeslots i=T_1..T_K; H6(i, Random(i+1,K)). 
3.3.3.4 Directed moves
Table 4: directed heuristics 
H10 (ranked exams group move – consider both hard and soft constraint violations): 
Generate a weightedList based on all hard and soft constraint violations for exams.
For the top 10 exams in the list, reallocate them to a random new timeslot. The new timeslot is selected from the suitabletimeslotList. Update weightedList whenever a move is made. 
H10_2 (ranked exams group swap – consider both hard and soft constraint violations): 
Generate a weightedList based on all hard and soft constraint violations for exams.
From the top 10 exams in the list, select two random exams, swap their timeslot. Update weightedList whenever a move is made 
H11 (ranked exams group move – consider only soft constraint violations): 
Generate a softconstraintList based on all soft violations for exams.
For the top 10 exams in the list, reallocate them to a random new timeslot. The new timeslot is selected from the suitabletimeslotList. Update softconstraintList whenever a move is made. 
H11_2 (ranked exams group swap – consider only soft constraint violations): 
Generate a softconstraintList based on all soft constraint violations for exams.
From the top 10 exams in the list, select two random exams, swap their timeslot. Update softconstraintList whenever a move is made. 
4 Experimental Environment
The algorithm was implemented and tested on a PC with a 2.9 GHz Intel Core i7processor, 8GB RAM and macOS 10.12.05. The program was coded in C# targeting the .NET Framework 4.5. For each problem set, the program was executed for ten iterations, with a 270 second time limit per iteration determined by a benchmarking application released by the competition organizers. During initial experimentation it was found that allowing adaptive construction to execute for approximately 10% of the total execution time provided the best results.
5 Results and Analysis
As described above, four different selection methods and four different utility update methods were designed and implemented. Note that for the exam timetabling solution, the lower the score achieved the better the solution quality as it represents the total violations of hard and soft constraints. The least amount and penalty of violations, the higher quality the resultant solution.
5.1 Benchmark data sets
The algorithm was tested with the 2007 International Timetabling Competition (ITC2007) benchmark data sets. These were chosen due to the fact they are based on real examination data problem sets with a richer set of constraint considerations, to a certain extent helping to determine how this approach may assist in realworld similar problems.
Table 5: benchmark attributes  
Datasets

Exams  Students

Periods  Rooms  Conflict Density  Period Hard Constraints  Room Hard Constraints 
Exam 1  607  7891  54  7  5.05%  12  0 
Exam 2  870  12743  40  49  1.17%  12  2 
Exam 3  934  16439  36  48  2.62%  170  15 
Exam 4  273  5045  21  1  15.00%  40  0 
Exam 5  1018  9253  42  3  0.87%  27  0 
Exam 6  242  7909  16  8  6.16%  23  0 
Exam 7  1096  14676  80  15  1.93%  28  0 
Exam 8  598  7718  80  8  4.55%  20  1 
Exam 9  169  655  25  3  7.84%  10  0 
Exam 10  214  1577  32  48  4.97%  58  0 
Exam 11  934  16439  26  40  2.62%  170  15 
Exam 12  78  1653  12  50  18.45%  9  7 
Table 5 lists the main characteristics for each of the examination datasets provided by the organizers of ITC2007. The conflict density is a measure of the number of examinations that are in conflict due to student enrolments, defining how tightly the problem is constrained by student module choice. It is initially observed that the conflict density for most of the datasets is quite low, which is reflective of the amount of choice available to students within a modern curriculum, with a large variation in course or subject choices between each student. The measure of problem size, based on the number of exams and students, varies across the datasets. The largest exam dataset could be argued to be either Exam 3/Exam 11 or Exam 7 and the smallest to be either Exam 9 or Exam 12. The amount of periods and rooms available will also have a measurable effect on the difficulty of constructing a feasible solution. Exam 3 and Exam 11 are almost identical, however Exam 11 has a much smaller set of period resources available. The differences between Exam 3 and Exam 11 reflect a ”realworld” situation where an examination session has been shortened to minimize space and staff costs, while keeping all other existing constraints where possible.
The following tests focus on datasets 7, 9, 11 and 12 as they include the biggest and smallest data size as well as conflict density and provide a reasonable variation for testing purposes. The tests on the four datasets were used to guide the design of the final hyper heuristic reheating scheme, selection method and the utility value update method.
5.2 Simulated Annealing operator tests
For the simulated annealing process, four groups of experiments were carried out to define the best settings for reheating frequency and to determine whether to combine the reheating scheme with the reset of the low heuristics utility value on reaching the maximum reheating count (which is 5).
5.2.1 Reheating frequency
The reheating frequency is tested at values 1000 and 10000 on datasets 7, 9, 11 and 12 with ten runs for each dataset. Here, a simple random selection method and sum utility update method is used, and the utility value is not reset when the maximum reheating count is reached.
Table 6: reheating frequency tests  
reheat frequency = 1000  reheat frequency = 10000  
AVG  BEST  AVG  BEST  
Dataset 7  8631  8497  8858  8805 
Dataset 9  1347  1278  1352  1300 
Dataset 11  35782  35266  35504  35611 
Dataset 12  5985  5965  6034  5911 
The results presented in table 6 above suggest that the reheat frequency of 1000 is generally better. Therefore, in the following test the reheat frequency is set as 1000.
5.2.2 Reset low heuristics value
The utility values of the low heuristics can become similar in value after a certain amount of iterations, which can adversely affect the selection methods and render the process ineffective. We test the approaches of both resetting and not resetting the utility value of the low heuristics when the maximum reheating point has been reached. This uses a simple greedy selection method and sum utility update method.
Table 7: reset low heuristics utility value tests  
reset  No reset  
AVG  BEST  AVG  BEST  
Dataset 7  7530  6721  7890  7528 
Dataset 9  1340  1320  1520  1506 
Dataset 11  35733  35519  35673  35375 
Dataset 12  5995  5955  6015  6165 
The results presented in table 7 above suggest that generally the performance is improved by resetting the utility value of all low heuristics on reaching the maximum reheating iteration count. Therefore, this approach is adopted in the selection method tests outlined in the next section.
5.3 Selection Method test
As described above, four different selection methods were designed, some with differing operators. Tests were performed on all the selection methods with different operator values on datasets 7, 9, 11 and 12 with ten runs per dataset. These tests use the sum utility update method. The best results for each selection method are presented in the table below.
Table 8: different selection methods comparison  
equal sequence  greedy  egreedy,
e=0.01 
egreedy,
e=0.1 
Decay greedy  Softmax,
Temperature=0.1 
Softmax,
Temperature=1 

BEST  BEST  BEST  BEST  BEST  BEST  BEST  
Dataset 7  8644  7792  8381  7953  6501  8856  8912 
Dataset 9  1372  1303  1343  1306  1288  1276  1396 
Dataset 11  35840  35295  35038  35281  34054  35832  35441 
Dataset 12  6065  5965  5965  5965  5965  6065  5965 
In this test, for the greedy two values, 0.1 and 0.01 are tested, with the setting 0.1 beating 0.01 in most cases. For the softmax selection method, the setting of temperature controls the balance of exploration and exploitation. The test temperature values of 0.1 and 1 are chosen based on the literature, with the results suggesting setting this value to 0.1 on average performs better, although the advantage is not great. Overall, the decay greedy method has the lowest best result over the chosen datasets.
5.4 Utility update method
This test uses the decay greedy selection method combined with several different utility update methods. For this test, the temporal difference utility update method uses an estimation stepsize setting value of 1 in order to establishing effectiveness of this approach.
Table 9: different utility value update method comparison  
Sum  Average  Discount  Temporal Difference  
Best  Best  Best  Best  
Dataset 7  6501  8824  6721  9233 
Dataset 9  1288  1370  1258  1423 
Dataset 11  34954  35529  34390  34399 
Dataset 12  6065  6065  5883  5916 
It is clear from the results that the discount sum utility update method performs best over the chosen datasets.
5.5 Comparison with the published literature
Having chosen the best selection method (decay greedy) and the best utility update method (discount sum utility), experimentation is carried out over all 12 datasets, with a best evaluation value recorded over 10 times for each. A comparison over the current best techniques in the literature is presented in table 10 below.
Table 10: Best result comparison to the literature  
RLSAHH  Other Techniques  
BEST  DSO  Muller ITC2007  Adaptive Liner Combination  Graph Coloring  Multistage algorithm  
dataset 1  6059  5186  4370  5231  6234  5814 
dataset 2  863  405  400  433  395  1062 
dataset 3  14027  9399  10049  9265  13302  14179 
dataset 4  20031  19031  18141  17787  17940  20207 
dataset 5  3637  3117  2988  3083  3900  3986 
dataset 6  26910  26055  26585  26060  27000  27755 
dataset 7  6572  3997  4213  10712  6214  6885 
dataset 8  10485  7303  7742  12713  8552  10449 
dataset 9  1267  1048  1030  1111  N/A  N/A 
dataset 10  14357  14789  16682  14825  N/A  N/A 
dataset 11  34054  30311  34129  28891  N/A  N/A 
dataset 12  5509  5369  5535  6181  N/A  N/A 
The results obtained show that the proposed method is reasonably competitive, and in fact has achieved a best result for dataset 10, and is approaching the best for several others. The approach taken can therefore be argued as promising and worth further research and experimentation.
5.6 Comparison with other hyperheuristic methods
It is interesting to examine the experimental results of other hyperheuristic techniques which have been applied to solving the examination timetabling problem using the ITC 2007 benchmarks. A comparison with the approach undertaken in this work is useful to determine whether it improves on the current work using hyperheuristics.
Table 11: Comparison to other Hyperheuristics methods in the literature  
RLSAHH  Other Techniques  
BEST  HSHH  GCCHH  EAHH  AIH  
dataset 1  6059  11823  6234  8559  6235 
dataset 2  863  976  395  830  2974 
dataset 3  14027  26670  13002  11576  15832 
dataset 4  20031  ////  17940  21901  35106 
dataset 5  3637  6772  3900  3969  4873 
dataset 6  26910  30980  27000  28340  31756 
dataset 7  6572  11762  6214  8167  11562 
dataset 8  10485  16286  8552  12658  20994 
dataset 9  1267  –  –  –  – 
dataset 10  14357  –  –  –  – 
dataset 11  34054  –  –  –  – 
dataset 12  5509  –  –  –  – 
The results in table 11 above, show that this approach has achieved best results for three of the data sets as compared to the other hyper heuristics in solving the examination timetabling problem for the benchmark ITC2007. Indeed, for the other datasets the results are relatively good (above the average), suggesting that this work makes a positive contribution to improving research within hyperheuristics.
6 Conclusion
This paper outlined the implementation of a hyper heuristic to solve the examination timetabling problems. The construction stage uses an adaptive squeaky wheel algorithm to generate a feasible and relatively good initial solution. The improvement phase focusses on how combing reinforcement learning algorithms into a hyperheuristic, with the examination timetabling problems as the target. It uses four selection methods and four utility value update methods inspired by reinforcement learning, with a series of experimentation performed to test the effectiveness of the approach.
The main aim of this work is in identifying a general approach to utilizing reinforcement learning algorithms in solving complicated resource allocation problems. The examination timetabling problems were selected due to their high complexity and the fact this complexity means they are not suitable to be solved directly using reinforcement learning alone. The work and results presented show that combining reinforcement learning with a hyperheuristic forms a workable way to solve complicated problems. This also contributes to the automatic scheduling research area using a creative hyper heuristic. Multiple selection methods and utility values inspired by reinforcement learning methods fits the learning theory of the hyper heuristic. The problem area of examination timetabling, and in particular the chosen benchmarks used are closely reflective of real world scheduling problems, bringing theory to real world problem solving. In addition, the tests carried out also cover the investigation of how reheating influences the simulated annealing searching process.
Future work is planned to extend the reinforcement learning hyperheuristic to other resource allocation optimization problems as well as combining further reinforcement learning algorithms into this area. Another strand of research from this work will be involved in the discovery of how the unsupervised learning algorithms can better fit into the optimization research area. The ultimate aim is to prove this approach works in general for a wide variety of optimisation problems, with encouraging results already obtained in progressing towards this goal.
References
1. B. McCollum, “A perspective on bridging the gap between theory and practice in university timetabling,” Practice and Theory of Automated Timetabling VI, vol. 3867, pp. 3–23, 2007.
2. R. Qu, E. K. Burke, B. McCollum, L. T. G. Merlot, and S. Y. Lee, “A survey of search methodologies and automated system development for examination timetabling,” Journal of Scheduling, vol. 12, no. 1, pp. 55–89, Oct. 2008.
3. S. Kristiansen and T. R. Stidsen, “A Comprehensive Study of Educational Timetabling – a Survey,” Department of Management Engineering, Technical University of Denmark, no. November, 2013.
4. B. McCollum, A. Schaerf, B. Paechter, P. McMullan, R. Lewis, A. J. Parkes, L. D. Gaspero, R. Qu, and E. K. Burke, “Setting the Research Agenda in Automated Timetabling: The Second International Timetabling Competition,” INFORMS Journal on Computing, vol. 22, no. 1, pp. 120–130, May 2009.
5. M. W. Carter, G. Laporte, and S. Y. Lee, “Examination Timetabling : Algorithmic Strategies and Applications,” The Journal of the Operational Research Society, vol. 47, no. 3, pp. 373–383, 1996.
6. E. K. Burke, G. Kendall, B. McCollum, and P. Mcmullan, “Constructive versus improvement heuristics: an investigation of examination timetabling,” 3rd Multidisciplinary International Scheduling Conference: Theory and Applications, pp. 28–31, 2007.
7. T. B. Cooper and J. H. Kingston, “The Complexity of Timetable Construction Prob lems,” Lecture Notes in Computer Science, vol. 1153, pp. 281–295, 1996.
8. O. Arogundade, A. Akinwale, and O. Aweda, “Genetic Algorithm Ap proach for a RealWorld University Examination Timetabling Problem,” Int. J. Comput. Appl., vol. 12, no. 5, pp. 09758887, Dec., 2010.
9. E. Ozcan and A. Alkan, “A Memetic Algorithm for Solving a Timetabling Problem:An Incremental Strategy,” In Multidisciplinary international scheduling: theory and applications (MISTA07), Paris, France, 2007.
10. S. Abdullah, S. Ahmadi, K. Burke, and M. Dror, “Investigating AhujaOrlins large neighbourhood search approach for examination timetabling,” ORSpektrum, vol. 29, no.1, pp. 351372, 2007.
11. A. Syariza, E. Burke, A. Bargiela, B. McCollum, and E. Ozcan, “A Constructive Approach to Examination Timetabling based on Adaptive Decomposition and Ordering,” Ann. Oper. Res., vol. 218, no. 1, pp. 321, 2014.
12. F. Glover, tabu search fundamentals and uses. University of Colorado, Boulder.
13. M.Eley,Ant algorithms for the exam timetabling problem, International Conference on the Practice and Theory of Automated Timetabling. Springer Berlin Heidelberg, 2006.
14. J. Thompson and K. Dowsland, “Variants of simulated annealing for the examination timetabling problem,” Ann. Oper. Res., vol. 63, pp. 105128, 1996.
15. McCollum, Barry, et al. “An extended great deluge approach to the examination timetabling problem.” Proceedings of the 4th multidisciplinary international scheduling: Theory and applications 2009 (MISTA 2009) (2009): 424434.
16. K. Burke, J. Eckersley, B. McCollum, S. Petrovic, and R. Qu, “Hybrid variable neighbourhood approaches to university exam timetabling,” Eur. J. Oper. Res., vol. 206, no. 1, pp. 4653, 2010.
17. L. Mumford, “A multiobjective framework for heavily constrained examination timetabling problems,” Ann. Oper. Res., vol. 180, no. 1, pp. 331, 2010.
18. E.O ̈zcan, Y.Bykov, M.Birben, and E.Burke, “ Examination timetabling using late acceptance hyperheuristics,” in Proceedings of the 2009 IEEE congress on evolutionary computation (CEC 2009), Trondheim, Norway.
19. E. Burke, S. Petrovic, and R. Qu, “Case based heuristic selection for timetabling problems,” Journal of Scheduling, vol. 9, no. 2, pp. 115132, 2006.
20. H. Asmuni, E. Burke, J. Garibaldi, B. McCollum, and A. Parkes, “An investigation of fuzzy multiple heuristic orderings in the construction of university examination timetables,” Comput. Oper. Res., vol. 36, no. 4, pp. 9811001, 2009.
21. P. Corr, B. McCollum, M. McGreevy, and P. McMullan, “A new neural network based construction heuristic for the examination timetabling problem,” In Lecture notes in computer science: Vol. 4193. Parallel problem solving from naturePPSN IX, pp. 392401, 2006.
22. Burke E K, Gendreau M, Hyde M, et al. Hyperheuristics: A survey of the state of the art[J]. Journal of the Operational Research Society, 2013, 64(12): 16951724.
23. Burke E K, Hyde M, Kendall G, et al. A classification of hyperheuristic approaches[M]//Handbook of metaheuristics. Springer US, 2010: 449468.
24. Ouelhadj D, Petrovic S. A cooperative hyperheuristic search framework[J]. Journal of Heuristics, 2010, 16(6): 835857.
25. Ross P. Hyperheuristics[M]//Search methodologies. Springer US, 2005: 529556.
26. N. R. Sabar, M. Ayob, R. Qu, and G. Kendall, “A graph coloring constructive hyperheuristic for examination timetabling problems,” Applied Intelligence, vol. 37, pp. 111, 2012.
27. K. Anwar, A. T. Khader, M. A. AlBetar, and M. A. Awadallah, “Harmony Searchbased Hyperheuristic for examination timetabling,” in Signal Processing and its Applications (CSPA), 2013 IEEE 9th International Colloquium on, 2013, pp. 176181.
28. E. Burke, G. Kendall, M. Mısır, and E. Özcan, “Monte Carlo hyper heuristics for examination timetabling,” Annals of Operations Research, vol. 196, pp. 7390, 2012/07/01 2012.
29. Özcan E, Mısır M, Ochoa G, et al. A Reinforcement Learning: GreatDeluge HyperHeuristic[J]. Modeling, Analysis, and Applications in Metaheuristic Computing: Advancements and Trends: Advancements and Trends, 2012: 34.
30. D. P. Clements and D. E. Joslin, “Squeaky Wheel Optimization,” Journal Of Artificial Intelligence Research, vol. 10, pp. 353–373, May 1999.
31. Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. Cambridge: MIT press, 1998.
32. D. Abramson, “Constructing School Timetables using Simulated Annealing: Sequential and Parallel Algorithms”, Management Science, vol. 37, no.1, 98113, 1991.
33. D. Abramson, M. Krishnamoorthy, and H. Dang, “Simulated Annealing Cooling Schedules for School Timetabling Problem”, 1997. Available: http://citeseer.nj.nec.com/104097.html
34. E. Poupaert, Y. Deville, “Simulated Annealing with Estimated Temperature”, AI Communication, vol. 13, 2000, 1926.
35. J. Thompson, and K. A. Dowsland, ”General Cooling Schedules for a Simulated Annealing Based Timetabling System”, In: Practice and Theory of Automated Timetabling, First International Conference: Selected Papers, (E. Burke & P. Ross , Eds.), Edinburgh, U.K., August/September 1995. Lecture Notes in Computer Science 1153, SpringerVerlag, 345363, 1996.