Random Sample-based Software Defect Prediction with Semi-supervised Learning
Table of Contents
Figure 2: SBFL Working Process
Figure 3: Spectrum-based Fault Localization in a nutshell
Figure 4: Log-log plot of module sizes with LOC
Figure 5: Flow graph of a program 
Figure 6: Comparison of F-measures with the classifier of conventional learners and semi-supervised learners
Figure 7: The box-plot of the F-measures for JW1 datasets with different sampling rate
Figure 8: The box-plot of the F-measures for MW1 datasets with different sampling rate
Figure 9: The box-plot of the F-measures for KC2 datasets with different sampling rate
Figure 10: The box-plot of the F-measures for PC4 datasets with different sampling rate
Figure 11: The box-plot of the F-measures for PC1 datasets with different sampling rate
Figure 12: The box-plot of the F-measures for PC3 datasets with different sampling rate
Figure 13: The box-plot of the F-measures for PC5 datasets with different sampling rate
SQA Software Quality Assurance
LOC Line Of Code
AUC Area Under the ROC Curve
AdaBoost Adaptive Boosting
RCForest Random Committee Forest
NASA The National Aeronautics and Space Administration
ML Machine Learning
ST Software Testing
ET Exhaustive Testing
TRM Technical Review Meeting
SF Spam Filtering
OA Online Applications
HRS Hybrid Recommender System
SBFL Spectrum Based Fault Localization
TC Test cases
KNN K Nearest Neighbors Algorithm
SVM Support Vector Machine Algorithm
ID3 Iterative Dichotomiser 3
CART Classification And Regression Tree
MARS Multivariate Adaptive Regression Splines
Software Defects: Software defects are introduced when the outputs don’t meet with the expected requirements . It is also called software bugs or faults.
AUC: AUC means Area Under the Curve. AUC is used for classification analysis that helps to identify which model predicts best among the models. Model accuracy is measured best on this curve. It is a comprehensive measurement that helps to take decision in prediction research. Some uses of AUC can be found in , , and .
Software Quality Assurance: Software quality assurance is a process that helps developers to deliver quality software products to customers. It has some specifications followed by software development staffs. Overview of software quality assurance can be found in .
Exhaustive Testing: Exhaustive testing is a testing approach to test the software with all possible combination of data for the specific function. Most of the cases, it is not possible to perform the test with all possible data or values .
Spam Filtering: Spam filtering is a process to filter the unwanted emails and remove the emails from email inbox. In machine learning, there have some classifiers used to filter emails .
Technical Review Meeting: Technical review meeting is a component of software quality assurance. Technical review meeting helps developers to decisions with the discussion of other technical staffs. Details guidelines are described in .
Software Failure: Software fails when the software projects contain faults. Software can be failed for different reasons. An overview of the reason for software failure is described in .
Software Testing: Software testing is a software development activity that is apply to ensure quality of the software .
µ Sampling rate
Dn List of datasets
Cn List of classifiers
P Parameter settings per classifier
F F-measure per classifier on datasets
AUC AUC per classifier on datasets
tp Number of defective modules
fp Number of defective-free modules
fn Number of defect-free modules that are predicted
tn Number of defect modules that are predicted
a11 Statement executed and found bug
a10 Statement executed and found no bug
a01 Statement not executed and found bug
a00 Statement not executed and found no bug
S Success of test cases
F Failed of test cases
L Halstead program length
D Halstead difficulty
n1 Unique program operators
n2 Unique program operands
N1 Total number of program operators
N2 Total number of program operands
N1 Total number of program operators
T Halstead time
E Halstead effort
For rapid growth and quality release of the software, software defect prediction has drawn much attention in recent days. Software defect prediction can also help to identify defective modules, better understanding, and controlling the quality of the software. Currently, machine learning techniques have been applied to predict defect-proneness modules of software. However, current methods have failed to address two issues. Firstly, previous project data is used to predict the defect of software but the data is not avail- able and not similar to new modules in terms of functionality. It is also hard to test all the modules though it’s costly and time-consuming. So, for a large software project, we can choose a small part of the modules, prepare some test data of the modules, and predict rest of the modules either it’s defective or defect-free. Secondly, all the modules of the software do not contain defects. Only some modules are defective and most of the modules are defect-free that is the imbalance situation of the datasets. In this work, we address these two practical issues and describe two methods: random sampling with conventional learners and random sampling with semi-supervised learners to predict defects. We conduct our experiment with a popular and mostly used repository, PROMISE NASA datasets. Our experiments show that random sampling with semi-supervised learners performs better than other conventional learners for defect prediction. We consider two evaluation metrics, F-measures and AUC, for performance measurement. The experiment results also show that the value for semi-supervised learners of evaluation metrics, F-measures and AUC, are on average 85.84% and 82.28% respectively that show significant performance comparing with other conventional machine learners. Our experiments result also shows that smaller sampling rate can achieve highest prediction performance that has potential to implement in practice.
Keywords: Software Quality, Software Defect Prediction, Semi-supervised Learning, Random Sampling, Machine Learning
Software Quality Assurance (SQA) is an indispensable part of ensuring the quality of software. SQA is a process that makes sure the defect-free and fruitful project. For ensuring quality, SQA contains set of activities e.g., formal technical reviews (i.e., meeting with all staffs to discuss the requirements that refer to quality) , exquisite and intensive testing (i.e., exhaustive testing of the project that most of cases quite impossible), manual code checking (i.e., is a formal type of review or static testing that lists the findings of defects led by trained moderator maintained by some rules and checklists) , applying the testing strategy (i.e., there are many strategies maintained by development and testing teams and followed by management) , ensuring process allegiance (i.e., a product evaluation and process monitoring task) , technical review meetings and reports (i.e., a report keeping the information of technical reviews relevant to SQA) , performing SQA audits (i.e., ensures the activities performed by developers) , con- trolling change (i.e., controls the impact of changes) , applying software engineering techniques (i.e., is a technique that helps the development staffs to achieve high quality products) , quality management plan (i.e., is a plan designed for SQA process) et al. .
However, SQA ensures the quality of software projects by maintaining its well- organized processes and activities although it’s time-consuming and costly. In a practical testing environment, more efforts, time, and expenses are spent on the modules for finding minimal defects or even one defect that is unexpected. In this case, software fault prediction can be helpful to predict defect-proneness modules among all modules that could be useful for the developers or even for the development organizations. In recent days, software defect prediction methods are applied that helps to SQA activities in a cost-effective way by predicting fault-proneness modules.
There have been a number of studies, and prediction models (i.e., can be found in , , , , and ) proposed to predict defect-proneness models among all the modules of software projects. Many researchers have investigated on static code attributes that are obtained from some selected metrics (e.g., Line of Code, McCabe et al.). They proposed prediction models based on the code attributes. We also found some tasks, where used the changing history data to predict defect-proneness modules. They extracted features from software modules that can be defective or defect-free. After that, they used statistical methods or machine learning methods to predict defect-proneness modules from the software projects.
There have many constraints e.g., quick release, limitation of testing resources, changing customer demands tremendously to deliver defect-free modules. It is also required to test exhaustively for fulfilling and meeting the customer demands though it’s costly and time-consuming. For this, it is needed a lot of efforts and testing resources
In practice, it is difficult to test all the modules. In seldom, it is focused on functionality test for quick release without maintaining the rules of SQA. For releasing quality software regarding to test all the modules or test some modules or to predict rest of them either defective or defect-free, some issues can be raised that are listed below.
Firstly, for releasing the quality software with customer satisfaction, some important modules are tested and predicted for rest of the modules considering the previous project’s data that has similarity in functions to new projects. But it is not obvious and valid.
Secondly, all the modules don’t contain defects. Some modules may be defective and some are defect-free. It is not necessary to test all the modules, if the minority modules contain defect.
To address these problems, we propose random sampling based software defect prediction to find out defect-proneness modules. We have applied semi-supervised based machine learners bearing in mind that the less modules are defective. We also have compared the semi-supervised machine learners with conventional based machine learners. In our approach, we conduct our experiment on small percentage of modules where small percentages of modules are defective.
To deal with the first problem (i.e., considering to use the previous project data), we have selected previous project data (i.e., training data) and choose small amount of data to learn from them, construct a classification model to predict the rest of the modules. Next, for the current project, we can select small part of the project or modules and build the prediction model using current project’s data. In our experiment, we have showed that small percentage of sample can show the same prediction result as a large percentage of the sample do. In other word, increasing the sampling rate does not affect the prediction results comparing with the large sampling rate.
For dealing with the second issue (i.e., all the modules are not defective), we have selected semi-supervised machine learners bearing in mind that only small percent- age of modules is defective and rest of the modules is defect-free. We also compared the semi-supervised learners with conventional machine learners to learn from the training sets. We used F-measures and AUC to measure the performance of the classifiers. We also found that the semi-supervised learner, RCForest performs better than the other conventional learners.
There are five chapters and the descriptions are given below.
Chapter 1: Chapter 1 describes the background of this thesis, the problem statement of the thesis (i.e., the current status of the problem) and the contribution of our work.
Chapter 2: Chapter 2 illustrates the related work of this thesis. This literature work is by no means a complete study. Here, I have listed the defect prediction models that are commonly used in last few years. These are program execution in- formation based and extract static code properties based methods. Again, program execution information based software defect prediction methods are two types. These are spectrum-based technique and slice-based technique. In this chapter, I have showed the working process of these two techniques by analysing two c program codes. After that, I have mentioned the extract static code properties based defect prediction models where machine learning methods are employed for getting the effective models and performance of defect prediction. I also listed two issues while using previous project data’s for defect prediction by using machine learners.
Lastly, this chapter gives a brief introduction of defect prediction models and shows how it works that are depicted by diagnosing with c programs.
Chapter 3: Chapter 3 depicts the classification algorithms. This part describes the description of the conventional machine learners and semi-supervised learners. In this chapter, it is shown how the classifiers are used in terms of random sampling. It also shows the working process of selected classifiers in step by step described in algorithm.
Chapter 4: Chapter 4 visualizes the experiment results. It depicts the classifiers performance for each dataset. The semi-supervised machine learner, RCForest shows the significant performance for the imbalance datasets. It also describes with box-plot that increasing sampling rate does not affect the prediction results that is significant. This chapter also shows the results of F-measures and AUC for the sampling rate 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50%. For all the sampling rate, RCForest shows the highest F-measures and AUC comparing with other machine learners.
Chapter 5: This chapter contains summary of our work. This part ends the thesis with a conclusion. It summaries the whole experiment results and focuses on some points that can be used in real software development environment.
This chapter contains the literature review of our study. This review work is not a complete study. It contains the software defect prediction techniques that are used in literature.
Automatic identification of defect proneness modules is important of software quality assurance. It helps to get the success of the products and also ensures quality assurance. Recently, software prediction models have been proposed to help the quality assurance to make it easier to get the success and produce quality products. Generally, software prediction models could be categorized in two ways. These are described below.
1. Program Execution Information Based
2. Extract Static Code Properties Based
The description of program execution information and extract static code properties based software defect prediction models are below.
According to literature work, program execution information based software defect pre- diction falls in two categories. This method is used to identify suspicious program behaviours for defect prediction of the software. These two categories are given below.
A. Spectrum-based technique
B. Slice-based technique
The description and working procedure of these techniques are given below.
18.104.22.168 Spectrum-based Technique
We delineate program spectra, program elements and describe how they are used and operated in spectrum-based fault localization in this segment. We also describe twelve types of program spectra and mention the description of the spectra how they work to locate faults.
Program spectrum is a program behaviour. It gives or represents a signature of a program . According to Le, Program spectra is a log of execution traces of a program generated when it is run . In another way, program spectra are program traces collected during the execution of a program.
More generally, it is a collection of execution trace or data about program entities (e.g., statements, branches, path or basic blocks, methods, components, etc.) for a specific test suite. For getting the comprehensive overview of program spectra please see Table 1.
Statement Spectra. Statement spectra record the activity of the statements that are executed in run-time of a program. It counts the execution trace of statements of a program in statement-count spectra. For statement-hit spectra, all the execution traces of statement are recorded in program run-time.
Branch / Block Spectra. The set of activity of conditional branches or blocks are recorded in branch spectra that are executed in program run-time. For block- hit spectrum, the spectrum indicates that the conditional branch or block of code in a program whether or not was executed in a particular run . In branch-count spectra, spectrum represents the number of time that conditional branch was executed or not. Path Spectra. There are two types of activity are conducted in path spectra. One is, path-hit and another is, path-count. In path-hit spectrum, the spectrum indicates whether or not that path was executed . The number of times that path was executed, that is counted in path-count spectra.
Complete-path Spectra. The complete path of a program is recorded in complete-path spectra.
Data-dependence Spectra. Data-dependence spectra illustrate the record of the set of definition-use of pairs. In spectrum of dependence- hit, it indicates whether or not that definition-use pair was exercised. How many of times that was exercised which illustrates in data- dependence-count spectra.
Output Spectra. In output spectra, an output spectrum is recorded. This record is produced in program execution time.
Execution-trace Spectra. Sequence of program statements are recorded in execution-trace spectra.
Time-spectra. Execution time of program elements (e.g., program functions) are recorded.
|2||Statement-count||A statement execution count|
|3||Block-hit||Executed conditional branches|
|4||Block-count||A conditional branch execution count|
|6||Path-count||Each path execution count|
|7||Complete-path||Complete path execution|
|8||Data-dependence-hit||Definition-use pairs execution|
|9||Data-dependence-count||A definition-use pair execution count|
|11||Execution-trace||Execution trace produced|
|12||Time-Spectra||Execution time of program elements|
Program Elements and Activity
Program elements refers to the statements, branches, path or basic blocks, methods, components of a program. In SBFL, various formulas are used to do operations with program elements. After operation with various formulas, suspiciousness scores are gained that used to find out the faults1 of the program. According to suspiciousness scores, program elements are ranked by the statistical analysis. After that by the suspiciousness, developers can locate bugs by investigating the list of program entities.
Origin of Spectrum-based Fault Localization
Spectrum-based fault localization (SBFL) is a dynamic process . This technique analyses program spectra to correlate failures with program elements that assign suspiciousness score by using various formula. A rank of program elements are prepared on the basis of the score that helps to locate bugs as well as root cause of failures.
The main idea of SBFL is followed from pinpoint , traditional problem determination technique. The subsection illustrates the overview of the pinpoint framework how it works regarding to locate bugs.
Pinpoint is a framework that is used to analysis the root cause of a failure. This frame- work is developed on the J2EE platform. For finding the root cause of a failure, this framework doesn’t need to know the knowledge of application components . This problem diagnosis tool is targeted at large and dynamic application environments such as e-commerce system, web-based email services and search engine.
Pinpoint consists of three parts:
- Communication layer that traces the client request
- Failure detection
- Data clustering analysis
Communication Layer. Communication layer works with client request traces. For finding root cause of a failure, pinpoints record the component traces which are used to satisfy each individual client re- quest. This process is done dynamically. More specifically, it records all the components (i.e., program elements) traces with the specific program elements. More details check in Figure 1.
Failure Detection. Pinpoint monitors the system by maintaining fault log to detect whether the request succeeding or failing. It records trace log for each client request. It is also called live tracing.
Data Clustering Analysis. Pinpoint combines trace log (i.e., record of each client request) with fault log (i.e., pass and fail status) to detect failure. In data clustering analysis, the data analysis engine identifies faulty components (i.e., faulty software statement or program elements). The detected faulty is discovered based on some statistical analysis of trace log and fault log of components.
Diagnosis of a Program
This section describes the diagnosis process based on spectrum-based fault localization. In this diagnosis process, there are two types of trace logs (i.e., fault log and trace log for executed program elements) recorded based on test case and program elements. Program elements are tested with test cases. Success and fail statues of each element are recorded. For getting the idea of diagnosis process, please see Figure 2.
After getting the two types of log data, SBFL algorithms (called similarity co- efficient) are applied on the data and get suspiciousness score. Statistical analysis is applied on data (i.e., suspiciousness data) and gets score that helps to identify bug of the selected code elements.
|Program Element||TC1||TC2||TC3||Suspiciousness Score||Rank|
|(Success / Fail) S||S||F|
Spectrum-based fault localization analyses the result of test case (i.e., success and fail) and hit information of a statement . Table 2 explains the overview of the diagnosis process of program elements. It contains program elements (i.e., statements), test case, test results, suspiciousness score and rank of the program elements that indicates the buggy statement. Rank list is created based on statistical analysis where higher rank indicates the most defect proneness statement. According to the table sample suspiciousness data, statement 2 is more buggy statement.
Similarity Coefficient / Algorithms
Spectrum-based fault localization talks about which lines of code are mostly causing test failures or carrying bugs. This type of approach saves huge time and effort in software testing industry . For running this approach, it needs test cases to execute the program elements. Two types of data are recorded. These are-
1. a list of test results (i.e., success or fail status after hitting the test data on program elements)
2. Logs of executed statements (i.e., arranged by test case)
For analysing the test results (i.e., success or failure data) and trace log (i.e., logs of executed statement’s data), many algorithms or similarity co-efficient are used to generate suspiciousness data or ratio (i.e., score of each program element).
There are more than 40 algorithms found in literature. In our literature, we have listed well-known and famous algorithms that were used for diagnosis accuracy such as Tarantula , AMPLE , Ochiai  and Jaccard . Table 4 describes these algorithms and the explanation is listed in below section. In these algorithms, for counting and calculating the suspiciousness score of every spectrum, a binary combination of 0 and 1 is symbolized whether or not fault is found and statement is hit. The explanation is presented in Table 3.
In this section, an explanation of one similarity co-efficient or algorithm is discussed in details and how it is implemented to locate faults in program codes.
|a11||Statement executed and found bug|
|a10||Statement executed and found no bug|
|a01||Statement not executed and found bug|
|a00||Statement not executed and found no bug|
|Algorithm||Whrere Introduced||Where Used||Implemented in Language Program||Used Spectra||Formula|
|Tarantula||||, , ||C||Statement-hit spectra||Eq. 2.1|
|AMPLE||||, , , , ||Java||Hit spectra of method call sequences||Eq. 2.2|
|C||Block-hit spectra||Eq. 2.3|
|Jaccard||||, ||C||Statement-hit spectra||Eq. 2.4|
In software testing industry, software testers accumulate large amount of testing data in different testing environment. Sometimes, data of test cases are used to identify defect proneness code or state.
Tarantula uses and takes advantages from the data (i.e., success or fail information of test case) to identify bug from program element (i.e., statement) . In addition, tarantula uses the information of statement coverage form success and fail runs to program statement. It assigns suspiciousness score to every program statement , . James A. Jones et al.  developed tarantula that delineates a program and execution based on test suite. Particularly, it is an example of spectrum-based fault localization.
There are some benefits that already implemented in literature given below-
a) This system improves software quality,
b) It reduces the number of delivered faults,
c) It aims to locate fault in specific area of the program, and
d) It reduces the time and cost in debugging.
Tarantula is used to statement-hit spectra on c programs and visualizes the suspiciousness of each statement. If we diagnosis the program by statement-hit spectra, suspiciousness score of each statement calculated by 2.1.
|6||mn = tm;|
|7||sum = tm;||sum = tm;||sum = tm;|
|8||numb = 1;||numb = 1;||numb = 1;|
|10||while(tm >= 0)||while(tm >= 0)||while(tm >= 0)||while(tm >= 0)|
|12||If (max <tm)|
|13||max = tm;|
|15||mn = tm;|
|16||sum +=tm;||sum +=tm;||sum +=tm;|
|18||tm = readInt();||tm = readInt();||tm = readInt();||tm = readInt();|
|21||av = sum /numb;||av = sum /numb;|
|22||printf(“Max = %d”,max);|
|23||printf(“Min = %d”,mn);|
|24||printf(“Max = %d”,mv);||printf(“Max = %d”,mv);|
|25||printf(“Avg = %d”,sum);||printf(“Avg = %d”,sum);|
|26||printf(“Num = %d”,numb);||printf(“Num = %d”, numb);|
In this chapter, we have discussed how datasets are prepared with the use of McCabe, Halstead, and LOC metrics. Example is illustrated for understanding the formation of datasets. We also have discussed each metrics with detail descriptions.
All the data comes from the NASA PROMISE datasets. We used seven datasets from the repository. These are JM1, MW1, KC2, PC4, PC1, PC3, and PC5. The datasets detail descriptions are depicted in the Tables 5.1 and 5.2. These datasets are prepared with source code extraction by using McCabe, Halstead, and LOC metrics. For understanding the Halstead attributes, we consider the following C program.
float x, y, z, average;
scanf(“%d %d %d”, &x, &y, &z);
average = (x + y + z) / 3;
printf(“Average is = %d”, average);
For extracting this code, we use HALSTEAD_OPERATOR, HALSTEAD_OPERANDS, HALSTEAD_PROGRAM_LENGTH, HALSTEAD_VOLUME, HALSTEAD_DIFFICULTY, HALSTEAD_EFFORT, HALSTEAD_TIME, and HALSTEAD_DELIVERED_BUGS attributes.
Here, the unique operators, n1 are 10. These are main, ( ), , float, &, =, +, /, printf, and scanf.
The unique operands, n2 are 7. These are x, y, z, average, “%d, %d, %d”, 3, “Average is = %d”. The total number of operators, N1 is 16 and total operands, N2 are 15.
So, in total, N is 31. So the HALSTEAD_OPERATOR, HALSTEAD_OPERANDS are 16 and 15 respectively.
= n1log2n1 + n2log2n2 = 10log210 + 7log27 = 52.9
where, n1 =10 and n2 = 7 So, HALSTEAD PROGRAM LENGTH = 52.9
HALSTEAD VOLUME, V = Nlog2n = 31log217 = 126.7
For the HALSTEAD VOLUME, the value of N is 31. So, the HALSTEAD_VOLUME is 126.7
HALSTEAD_DIFFICULTY, D =
n12 x N2n2 = 102 x 157=10.7
For finding the HALSTEAD DIFFICULTY, the value of operands is 10. After calculating the HALSTEAD DIFFICULTY, we get the value, 10.7.
HALSTEAD_EFFORT, E = D×V =10.7×126.7=1355.7
HALSTEAD_TIME, T =
1355.718 =75.4 seconds
HALSTEAD_DELIVERED_BUGS, B =
For the value of HALSTEAD_EFFORT, HALSTEAD_TIME, and HALSTEAD_DELIVERED_BUGS, we get 1355.7, 75.4 seconds and 0.04 respectively.
These metrics were also used to extract the NASA programs. Next sections give the more details of all datasets that are used in our experiment. We also have described which datasets contain how many modules and attributes with the ratio of defective and defect-free modules.
PROMISE NASA repository is famous data repository where programs were used for ground system or satellite system. Programs were written either C or C++. These datasets were crated with program extraction by using metrics. In this section, I have described the datasets attributes with the extraction metrics.
JM1 dataset is created from NASA Metrics Data Program. This dataset is written in C prepared for real-time prediction used in ground. We get this data after extracting the source code. The source codes are extracted with McCabe and Halstead metrics. JM1 program contains 7782 modules or instances. This dataset contains 22 instances. These are five (5) different measures of lines of code, three (3) McCabe measures, twelve (12) Halstead measures (i.e., 4 base and 8 derived Halstead measures), one (1) branch-count, and one (1) goal field. The goal field is used to determine which modules are defective or not.
For the dataset MW1, it contains 38 attributes and 253 modules. Among all the modules of MW1, only 27 modules are defective and 226 modules are defect- free. There are 22 attributes and 522 modules for KC2 dataset. Among 522 modules, there are 107 modules are defective. For PC4 dataset, there are 38 attributes and 1458 modules. Among these 1458 modules, 178 modules are defective. PC1 dataset contains 38 attributes and 705 modules. From 705 modules, only 61 modules are defective. Again, the dataset PC3, there have 38 attributes and 1077 modules. This dataset contains 943 defective modules. This dataset contains the highest defective modules among all of our experimented dataset. Lastly, PC5 dataset contains 39 attributes and 17186 modules. PC5 is the big dataset among all the experimented datasets. This dataset contains 516 defectives and 16670 defect-free modules. The ratio of the defective and defect-free modules of PC5 dataset is 32:31.
Every dataset contains one attribute that is named as label. This attribute specifies the defective modules by using Boolean identifier 0 or 1. Sometimes it uses Y or N. The defective or defect-free modules are identified from all the datasets as follows:
These datasets are not in same sizes with functionality. Some datasets are larger than others. Some datasets contain less defective modules. The dataset, JM1 contains the maximum defective modules comparing with other datasets. We have listed the detail descriptions of datasets in section 5.1 in Table 10.
Figure 4 portraits the modules of all datasets. This log-log plot shows the module size of our data; for example, there are 253 modules in the MW1 dataset; most of them are under 100 lines of code, but a very few them are more than 100 lines of code long. Again, there are 7782 modules in the JM1 dataset; there are some modules are more than 1000 lines of code though most of them are under 1000 lines of codes. If we look for the largest module set among all of our dataset, we see that the PC5 dataset contains 17186 modules; most of them are under 1000 lines of code; a very few of them are 1000 lines of code long.
As an example, we depict the first module of the JM1 dataset in Table 8. According to the dataset, we get the first module as defect-free by the attribute, label. The first module is prepared with 14 lines of code with 7 branches. Table 8 gives the detail description of first modules of the JM1 dataset.
JM1 dataset contains 5 LOC counts, 12 Halstead attributes, 3 McCabe at- tributes, 1 branch counts, and 1 output defect measure. LOC counts are LOC_BLANK, LOC_CODE_AND_COMMENT, LOC_COMMENTS, LOC_EXECUTABLE, and LOC_TOTAL.
All LOC counts give numeric values. From the given C example, we get 0 (zero) LOC_BLANK, 0 (zero) LOC_CODE_AND_COMMENT, 0 (zero) LOC_COMMENTS, 1 (one) LOC_EXECUTABLE, and 7 LOC TOTAL. In JM1 dataset, we get 1 LOC_BLANK, 0 (zero) LOC_CODE_AND_COMMENT, 0 (zero) LOC_COMMENTS, 11 LOC_EXECUTABLE, and 14 LOC_TOTAL.
The JM1 dataset contains 3 McCabe attributes. These are CYCLOMATIC COMPLEXITY and the complexity of program DESIGN COMPLEXITY, and ESSENTIAL COMPLEXITY.
CYCLOMATIC COMPLEXITY is a measure of the complexity of decision structure of a program or a module. It measures the linear independent paths that should be tested. The minimum number of CYCLOMATIC COMPLEXITY is always 1 meaning that a module or program contains minimum 1 program flow or decision structure to run the program. It helps developers and testers to find out the independent path of the program that improve code coverage. For this, testers can ensure that all the paths of the program have been tested at least once that is significant improve of program. CYCLOMATIC COMPLEXITY is computed by V(G) = E – N + 2 where E is a number of edges and N is a number of nodes. It also can be computed by V(G) = P + 1 where P is a number of predicate nodes (i.e., that codes contain condition). If the CYCLOMATIC COMPLEXITY number is between 1 to 10, which meaning the program has good structure, well written and high testability . If the program has 10 to 20 CYCLOMATIC COMPLEXITY number, it contains complex code and medium testability. Again, if the program contains 20 to 40 CYCLOMATIC COMPLEXITY number, it consists of very complex code with low testability. Lastly, if the CYCLOMATIC COMPLEXITY number goes up 40, it is not at all testable .
From the figure 5, computing mathematically of CYCLOMATIC COMPLEXITY of a program by flow graph, we use the equation 3.1.
Where, E = Total edges, and N = Total Nodes
Here, from the figure 5, we get 9 edges, E and 7 nodes, N. So, the CYCLOMATIC COMPLEXITY of the program is computed by flow graph given below.
V (G) = 9 − 7 + 2 = 4
According to the complexity number, we get the value of V(G) and that is 4. So the program is well written and has high testability.
In our dataset, according to the observation of the first module of JM1 program, the CYCLOMATIC COMPLEXITY number is 4. That’s meaning that it contains 4 linear independent paths. It is well written and contains high testability.
DESIGN COMPLEXITY is an essential measure of McCabe. Such as DE- SIGN COMPLEXITY of the module is calculated to reduce the module flow graph. It is used to measure of decision logic that controls calls to subroutines . It measures the decision logic of the modules and identifies managerial modules that indicate the design of reliability, integration, and testability .
ESSENTIAL COMPLEXITY is used to measure unstructured constructs. In other word, it measures the “structuredness” of decision logic of a software module. It is applied in written code that starts from 1 to v based on unstructured decision logic . It is used to predict the maintenance effort that helps in modularization process . The first module of JM1 dataset contains 3 DESIGN COMPLEXITY and 1 ESSENTIAL COMPLEXITY.
JM1 contains 12 halstead attributes. Among 12 attributes, one attribute is con- tent related that gives numeric value (i.e., HALSTEAD_CONTENT numeric). Two attributes are error related that also gives numeric values (i.e., HALSTEAD_EFFORT numeric, HALSTEAD_ERROR_EST numeric). Four attributes are operators and operands related (i.e., NUM_OPERANDS numeric, NUM_OPERATORS numeric, NUM_UNIQUE_OPERANDS numeric, NUM_UNIQUE_OPERATORS numeric). Others are difficulty, length, level, programming time, and volume related (i.e., HALSTEAD_DIFFICULTY numeric, HALSTEAD_LENGTH numeric, HALSTEAD_LEVEL numeric, HALSTEAD_PROG_TIME numeric, HALSTEAD_VOLUME numeric).
In details, program length is counted with the sum of the total number of operators and operands. The program length is counted by equation 3.2.
Where, program length = N and the number of operators and operands = N1 and N2 sequentially
Vocabulary size is calculated with the total number of unique operators and operands by following equation 3.3.
Program volume describes the size of algorithm that is implemented and contains the performance information of operations. The hastead’s volume is computed by the equation 3.4.
The halstead difficulty level or error proneness depends on the total number of operands and unique operands and is computed by following equation 3.5.
|D= n12 x N1n2||(3.5)|
For calculating program level, it is the inverse of the level of difficulty that is depicts in equation 3.6.
The halstead effort is calculated with the multiplication of the program volume and difficulty that is calculated by equation 3.7.
The implementation time, T is the propositional of effort, E. Halstead implementation time is calculated by dividing the effort with 18 that is in time in seconds. The time is computed by the equation of 3.8.
There are many advantages of halstead metrics. It doesn’t require control flow analysis of the program. It helps for the prediction of the implementation time, error rate, and effort. It is useful for scheduling project though it depends on operators and operands. It has no uses of the design level of program .
We have discussed about the halstead measures in the section 3.1 with an example. The table 3.1 also shows which halstead attributes contain how many values for the first module of JM1 dataset.
For an another example, we have listed the first module of the MW1 dataset that is extracted with LOC, McCabe, and Halstead measures. This module is also non- defective modules. The MW1 dataset contains 38 measures (mentioned earlier) that why this module consists of 38 values of 38 measures (i.e., consist of 7 LOC measures, 12 Halstead measures, 4 McCabe measures, and 15 miscellaneous measures). This module is prepared with 25 lines of code.
Dataset for the fist module of MW1  =
5,7,13,0,7,8,4,0.16,4,2,4,1,26,1,0,25,2,73.59, 5.95, 2602.12, 0.15, 84, 0.17, 144.56, 437.59, 0.25, 2, 4, 24, 0.11, 37, 47, 28, 9, 38, 21.88, 25, N.
This chapter describes the conventional machine learners and our proposed classifier. The advantages and disadvantages of conventional machine learning classifiers are also listed. The comparison technique of proposed machine learner and conventional machine leaners is mentioned in this chapter. We also have illustrated the proposed method with examples.
Prediction of defect-proneness module is imperative for big software or cross-platform software projects. In particular, software defect prediction aims to predict the defect among the software modules. For learning and predicting defect-proneness from unknown modules, machine-learning classification can be applied. From a software module, software metrics are extracted where manually assigned and labeled defective modules as “defective” and defective-free as “non-defective”. Then classification algorithms of machine learning are applied to learn from the modules. For learning from the historical data, in our experiment, we analysed software defective and defect-free modules for prediction. Here, we have applied random sample based software defect prediction and also applied conventional and semi-supervised machine learners (e.g., Logistic Regression, Naive Bayes, and J48) for the classification.
In practical, software system contains many modules that can be hundred or even thousand. It ’s hard to test the whole modules for the development organizations or even for the customers. It is time-consuming and also cost expensive. For controlling and improving quality of software, software developer can be utilized machine learning algorithms. In past years, researchers have used machine learners to learn from the selected modules of software where labeled data selected as defective , , , , and . The conventional machine learners commonly used by researchers for the defective module of software systems are Naive Bayes, Logistic Regression, J48, and AdaBoost. The brief explanation of these classifiers is listed below.
Before discussing the Naive Bayes classifier, some background knowledge about how the classifier has been build that is discussed in below section.
A classifier can be established within three ways; these are:
- Model is prepared with the rule of classification that is a discriminative classifier (e.g., KNN, SVM, and decision trees)
- Model is prepared with the probability class where input data is given that is also a discriminative classifier (e.g., multi-layer perceptions)
- Model is prepared within each class that is a probabilistic classifier (e.g., Model based classifier and Naive Bayes) Naive Bayes is a simple and probabilistic approach . As an old and very
popular approach, it is used to calculate conditional class probabilities . This classifier is based on Bayes theorem . It works with independence assumptions among the predictors . This classifier works nicely with a few training datasets obtained good results in most of the cases. As a sophisticated statistical classification method, it provides a way to calculate the probability . As a supervised learning approach, it can solve problems regarding categorical and continuous valued attributes .
According to basic of probability, P(X) is prior probability, P (X1|X2), P (X2|X1) is conditional probability, X = (X1,X2), P(X) = P(X1,X2) is joint probability, and P (X1, X2) = P (X2|X1)P (X1).
The Bayesian rule formula is listed below in equation 4.1.
Where, P(C|X) = Posterior, P(X|C) = Likelihood, P(C) = Prior, and P(X) = Evidence
So, we can say in equation 4.2,
|Posterior = Likelihood*PriorEvidence||(4.2)|
Thomas Bayes proposed this theorem worked for statistical calculation  . Naive Bayes is best to apply in filtering spam messages . So, for example, if we apply to a spam filter then prior would be the probability of the messages. Likelihood is a probability is the given words or inputs. Evidence is just the probability of a word appearing in a message using the given training data.
|PCX1, X2,…,Xn= P(X1,X2, …,Xn|C)P(C)P(X1,X2, …, Xn)||(4.3)|
Where, X1, X2, up to Xn would be the input or the words from the training data. It is called the Naive Bayes because it makes the assumption that all the input attributes are independent such as one word doesn’t affect the other words in deciding whether or not a message is spam that’s how Naive Bayes classifier works.
In practical, this classifier is also used for text classification (e.g., interested articles, topic wise web pages’ classification), hybrid recommender system (e.g., system applied for data mining and machine learning), online application (e.g., simple emotion modeling), and software defect prediction for identifying fault-proneness modules for large software systems. Despite these uses in practical application areas, it does not work well in the situation of imbalance datasets or even if the assumption is class conditional independence and that time, it loses accuracy.
It is worth mentioning that based on its simplicity, elegance, and robustness, this classifier is using rapidly in dependable features datasets, which considered the relationship in attributes.
Logistic regression is a statistical technique used to create prediction models. It is also called logit regression . It works with the independent variable with some predictor variables. It also helps researchers to evaluate some experiment points (e.g., average, test scores, and outcome variable) of particular datasets . It does not work for prediction if the researchers are unable or fail to select independent variables. That can’t be a useful regression method if it is selected irreverent data, independent variable. But it shows significant output for predicting certain outcomes (e.g., selected or rejected for a job).
Another point is, in logistic regression, the data points will be independent of each other, and in this case, if the data points are related each other, it creates an overweight situation among the data points or observations (e.g., comparing two groups with same data). Even it fails to able for predictions regarding with high dimensional datasets .
More in details, logistic regression is named by the method of logistic function that is mentioned in equation 4.4.
|fX= L1+ e-k(x-x0)||(4.4)|
Where, e = Euler’s number (i.e., e = 2.71828; check  for more details), L = maximum value of curve, k = curve steepness, and x = real number.
The logistic function is used to describe the properties of population growth of the environment (i.e., in economy). It provides a S-shaped curve. It takes real number that helps to map between 0 and 1. The equation of logistic regression is similar to liner regression.
For an example, for modeling the gender as male and female with height, the class of logistic regression can be the male with their height written in equation 4.5.
Equation 4.5 can be formed considering input as X. The default class, Y =1 described in equation 4.6.
It is important to note that the prediction result must be in binary format (i.e., 0 or 1) because of actual prediction. For learning from logistic regression model, the dataset must be training set using maximum-likelihood estimation, which make assumptions about the training data. If we get the prediction value near to 1 meaning the best coefficients.
Tree-based learning is based on decision trees from training set those are labeled. Decision or class label prediction (i.e., in the terms of machine learning) can be taken from the root to a leaf node. It can handle multidimensional data as well as to learn from the data .
This learning is the based on decision trees from class-labeled training data. It looks like a flow chart like tree structure. In the tree-structure, the internal nodes denote test of attributes. Branches are outcomes of the tests and leafs are class labels. Root represents the topmost node of the tree. In literature, there are many decision tree algorithms listed here.
- ID3 (i.e., Iterative Dichotomiser 3 (ID3) i.e., can be found in )
- CART (i.e., Classification And Regression Tree also called hierarchical optimal discriminant analysis)
- CHAID (i.e., CHi-squared Automatic Interaction Detector that outputs are highly visual)
- MARS (i.e., Multivariate Adaptive Regression Splines, which is a non-parametric regression technique)
Tree-based method, J48 (i.e., implementation of C4.5) is used widely in different domain applications (e.g., diagnosis of a problem, astronomy, artificial intelligence, financial banking analysis, molecular biology, and et al.). A decision tree is also used for producing an applicable classifier and learning for prediction of the assigned problem  those are significant research topic in recent days. At the same time, it shows significant performance with multiple variable analyses. It has ability for selecting complex features with rule related features . Along with various data points, it works with noisy and incomplete data variables (i.e., missing values). It is simple and fast from learning and classification from the training data with good accuracy .
Despite these facilities, the information is coming out not exact that can be influenced by other domains. In most of the case, it requires a target variable to take prediction in the training examples . It is over sensitive on irreverent and noisy data that contains in most of the practical cases. With the operation of training examples, noise or outliers may reproduce when tress are built that can be another shortcoming of decision trees .
Boosting is the example of ensemble method that combines a series of base classifiers . Ensemble methods perform better results when show significant performance among the models . Here, performance is measured with some performance metrics i.e., recall, precision, F-measure et al. The target of ensemble methods is to combine different methods or predictors for learning an algorithm. There are two categories of ensemble methods. These are listed below.
- Averaging methods: Several classifiers are prepared differently and average their predictions.
- Boosting methods: Boosting is a popular methods used in supervised learning to reduce bias and variance.
For understanding the concept of boosting (i.e., introduced in 1990 ), we have considered an example. Let, you are a patient with different symptoms. You may choose many doctors instead of one choosing one doctor. You can give them different weights based on accuracies of previous diagnoses and combine the weights to make decision for final diagnosis. This is the logic behind the boosting.
Suppose, a set of weak classifiers, c1 to cT and combined classifier C with weighted majority vote, T. For each classifier, ct contains weight at. So, the combined classifier, C for training data X illustrated in equation 4.7.
Where X ∈ Xi
AdaBoost is a linear algorithm with good generalization properties (i.e., not overfitting) . Adaptive Boosting is shortly called AdaBoost . This flexible classifier assigns an equal weight for each training datasets for fairly good generalization. It can be used for textual, numeric and discrete classification though it’s vulnerable to work noisy data (i.e., can be overfitted with noise and outliers). As an advantage, it performs multiple rotations with different data weights and finally gives a prediction .
From the definition of AdaBoost, we get that AdaBoost is an algorithm for making a strong classifier. The AdaBoost is defined as in equation 4.8.
Where, ct(X) is weak classifiers and f(X) is final classifier or strong hypothesis.
In summary, boosting is a combination of weak learners to correct classification errors. AdaBoost is a famous algorithm mostly used for binary classification problems that learned from weak learners.
We have proposed a semi-supervised approach, RCForest where Random Forest is used as a base classifier with the ensemble of Random Committee. For a better under- standing of semi-supervised learning, at first it’s better to know about supervised and unsupervised learning. Supervised and unsupervised learning are machine-learning task that works with labeled and unlabeled datasets .
As supervised learning, it analyses the label training data, which is used to map new datasets or examples . In short, if the training examples have targeted for every input called supervised learning. As an example, suppose you have labeled training datasets, i.e., admission and reject based on the exam score and age for student admission in a school. Based on this labeled data, new students with the requirements of age and test score can be categorized at admission and reject.
As unsupervised learning, if you have unlabeled datasets, you can define the data in groups based on similarities . As given the previous example, for the purpose of the admission in a school, students can be grouped based on the exam scores and age (e.g., age<25 and score>60). After consolidating the data of students, a decision can be taken for admission in school. This learning is called unsupervised learning from the student datasets.
In semi-supervised learning, the supervised and unsupervised learning are used to learn from the training datasets . In practice, a lot of unlabeled data is collected while few data can be labeled. In semi-supervised learning, learning performance is improved on analyzes of unlabeled data with the help of labeled data. In most of the cases, labeled training data is limited while unlabeled data is abundant  and after building cluster from unlabeled abundant datasets that help to learn for prediction from labeled datasets. This medium is called semi-supervised learning.
Random Committee is a semi-supervised learning algorithm, which is con- structed with a base classifier and averages their prediction. It is a disagreement-based semi-supervised learning working with labeled and unlabeled datasets. It works well in the situation of class imbalance datasets . It also works to build a classifier for ran- dom training sets. It directly uses labeled datasets to make a better result for prediction and classifier refinement .
For constructing RCForest as ensemble algorithm, we use Random Committee where Random Forest is used as a base classifier. Random Forest is one of the famous methods used by data scientists , . It is combination of tree predictors used to construct a number of decision trees. In real environment, single decision trees often have high bias where random forests help to solve to problem by averaging the weights or levels of the classes. For example, there have four tresses where first tree belongs to class 1, second tree belongs to class 2, third tree belongs to class 1, and fourth tree belongs to class 1. So for finding the class of the new tree, it is averaged the four trees. According to the example, the tree will be under of class 1. Random Forests algorithm was developed by Leo Breiman and Adele Cutler that is very easy to learn and applicable in real life . There are many features of Random Forests i.e., accuracy, applicable in large datasets, handles of thousands input, predicts the new input variable, effective on missing unlabeled data, etc.
In details, RCForest is build with an ensemble algorithm, Random Committee and base classifier Random Forest where uses different random number seeds (i.e., de- fault) and number of attributes is randomly investigated. According to our dataset, let LD and UD denote the labeled and unlabeled data. N is the number of random trees that are used in LD. The pseudo-code of RCForest is shown in Algorithm 1.
|Algorithm 1 Pseudo code of RCForest|
LD: the labeled dataset
UD: the unlabeled dataset
N: the number of random trees
RT: the prediction with Random Committee with Random Tree for each dataset with label
In step 1, Random Trees are constructed for the dataset where labeled data, LD. Step 2 sets the random tree for each labeled data. From the steps 3 to 7, random committee ensemble is constructed for each tree. The experiment is conducted in WEKA environment. Classifier, RCForest is experimented for all the datasets along with other conventional classifiers. Algorithm 2 illustrated the working procedure of each classifier for all the datasets.
Modern software systems require quality improvement to fulfill the demands of customers. It also requires the defect-free improvement in all aspects though gets bigger regarding size, i.e., consisting of hundreds or even thousands of modules. It’s a big challenge to release defect-free software and also a matter of far-reaching testing cost with limited resources. In practice, it is quite tough to test all the modules for the developers as well as development organizations.
In the real practical environment, few modules can be taken for testing based on the requirements for software project hoping that few modules can be faulty. That is supposed to happen that only a few modules can be fault-proneness. In this case, developers can select a module or small part of the developed software to be taken for the test. In this paradigm, if it is selected as a defective module that can be considered as labeled data. Machine learning algorithms can be taken into account to find defect- proneness modules based on labeled (i.e., defective modules that are very few) and unlabeled data (i.e., defect-free modules in this circumstances).
In our experiment, we have applied our developed semi-supervised learning approach, RCForest in the PROMISE NASA datasets and compared with conventional machine learners (e.g., J48, Logistic Regression, AdaBoost, and Naive Bayes). As train- ing datasets, we have taken a small part of the data bearing in mind that few modules can be faulty. Firstly, in our experiment, we have selected 5% of the data (i.e., small) and late on, have increased to 50% of the datasets from the historical datasets for the prediction of the defect. The key point of this study is to learn better prediction from conventional and semi-supervised learners keeping the sample in small that can be cost- effective for defect prediction.
The pseudo code of our experiment is presented in Algorithm 2. Briefly, the description is given as following. Let Dn denote as the list of datasets and Cn de- note the list of classifiers. We have applied regression based method (i.e., Logistic Regression), tree-based method (i.e., J48), statistical-based method (i.e., Naïve Bayes), ensemble method (i.e., AdaBoost), and lastly semi-supervised method (i.e., RCForest). For our experiment, we have selected seven datasets (i.e., JM1, MW1, KC2, PC4, PC1, PC3, and PC5) for the prediction of the defects. Every dataset contains defective and defect-free modules labeled as Y or N (sometimes labeled as 0 or 1) where Y means defective, and N means defect-free. The percentages of defective modules are less than defect-free modules. We select the training set from each defect datasets taken into the experiment for defect prediction. P denote the parameter settings per classifier. For taking the training set from the datasets, we take μ as sampling rate. We increase the sampling rate to 50% for each dataset by increasing 5% for each iteration.
We select the datasets from list of datasets in step 1. In step 2, we select the classifier from the list of classifier.
From the steps 3 to 12, we select dataset from the list of dataset and perform the experiment on it. These are the repeatable steps worked for the each dataset. In these steps, the main operation is performed on the training set that are selected from the main datasets. After performing these steps, we get the the values of F-measure and AUC for each dataset of each classifier according to sampling rate, μ.
|Algorithm 2 Pseudo code of the experiment of RCForest and other conventional learners|
Dn: list of datasets
Cn: list of classifiers
µ: sampling rate
P: parameter settings per classifier
F: F-measures per classifier on dataset
AUC: AUC per classifier on dataset
1. Set D (D ∈ Dn) as the dataset or training set
2. Set C (C ∈ Cn) //the number of classifiers (i.e., c1, c2, c3, c4, c5)
3. foreach d ∈ D do
4. for <µ ← 5, µ
5. foreach c ∈ C do
6. p = ModelP(µ, c, P [c])
7. model = BuildClassifier(µ, c, p) //Prepare model based on sampling rate with per classifier
8. F [c, d] = ApplyClassifier(model, d) //Compute F-measure of Model on data
9. AUC [c, d] = ApplyClassifier(model, d) //Compute AUC of Model on data
10. end foreach
11. µ ← µ + 5
12. end foreach
13. output F, AUC // F-measures and AUC per classifier on datasets
Firstly, from the steps 4 to 11, we select 5% data from the dataset as sampling rate and taken into the experiment. In the step 5, we choose classifier from the list of classifiers and apply the classifier into the training set. We prepare model with training set, μ, and selected classifier, c, from the list of classifiers and model is used to build classifier at the steps, 6, and 7. After that, we get F-measures and AUC by applying classifier on the training set at steps 8, and 9.
Lastly, in the steps 11, we increase the sampling rate to 5% and repeat the steps from 4 to 11 for the selected classifier, c. At step 13, we get the values of F-measures and AUC for each datasets with the selected classifier according to sampling rate that helps to find out the prediction performance of each classifier.
As example, we have considered JM1 dataset from the all dataset for illustrating the working method. This dataset contains 22 attributes with one goal attribute (i.e., label) that specifies defective the modules or not. It contains 7782 modules where 1622 modules are defective and 6110 modules are defective-free. The ratio of defective and defective-free modules is 3.78.
In step 1, we select the JM1 dataset from the all datasets. In step 2, we select the classifier from the all the classifiers. After selecting the classifier, we repeat the steps from 3 to 12. It is an iterative process. As training data, we select the sampling rate as 5% in step 4. We apply the selected classifier on the training data by preparing model in step 7. In steps 7 and 8, we get F-measure and AUC by applying the classifier (i.e., Naive Bayes) where models and 5% training data are the inputs of the Naive Bayes classifier. After that, the first iteration is completed. In second iteration from the steps 4 to 12, sampling rate is increased to 10%. Then, F-measure and AUC are collected for the 10% training dataset by applying Naive Bayes classifier. Thus, sampling rate (i.e., training data) is increased to 5% in every iteration for Naive Bayes classifier in step 11. Iterations are completed until the sampling rate goes to 50%. Iterations are completed for the Naive Bayes classifier and we collect F-measure and AUC for the JM1 dataset. After that, we change the classifier to J48 and complete the iteration process by increasing 5% sampling rate each iteration time and collect F-measure and AUC. In this way, we apply our five classifiers (i.e., Naive Bayes, J48, Logistic Regression, AdaBoost, RCForest) to JM1 dataset and collect F-measure and AUC for 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50% sampling rate. The values of F-measure and AUC are listed in the Table 5.3 for the dataset JM1 of conventional and semi-supervised learners.
We change the dataset and apply our classifiers one by one. We complete our experiment up to 50% by creasing sampling rate 5% in each iteration. We list the F- measure and AUC of MW1, PC4, PC1, PC3, PC5, and KC2 dataset in tables 12, 13, 14, 15, 16, and 17 respectively.
This chapter describes the PROMISE NASA datasets that are taken for the experiment. It also illustrates the experiment results after comparing the conventional machine learners and proposed semi-supervised learner. The analysis and experiment results are also mentioned in this chapter.
For evaluating semi-supervised method, RCForest and sample-based software bug pre- diction, we perform our experiment on NASA PROMISE repository , which are made available for predicting software defect. This paper employs experiment on JM1, MW1, KC2, PC4, PC1, PC3, and PC5 datasets. Each dataset carry several program modules where module is the small unit of functionality and prepared with quality metrics . These datasets come form by using McCabe and Halstead measures after extracting source code of the program . Each module of the datasets includes the case of defective and defect-free that identify the defective module of the datasets. Tim Menzies is the donor of these datasets . All of these datasets were written either in C or C++ language. Table 9 contains the basic information of the datasets.
The details description of the datasets is recorded in Table 10, where the number of attributes, instances, defective modules, and defect-free modules showed with the ratio of defective and defect-free modules.
From Table 10, it can be noted that, from the datasets, e.g., JM1 and KC2 contain 22 attributes where one attribute is the primary attributes that identify defect or defect-free module. The remaining 21 attributes are the quality metrics of the program. In other words, these 21 attributes are typical software metrics such as LOC counts, McCabe complexity measures, and Halstead measure . The tabulated elements of the datasets are employed in our experiment.
|JM1||C||Real-time predictive ground system; Uses simulations to generate predictions |
|MW1||C++||A zero gravity experiment system related to combustion |
|KC2||C++||Storage management system for receiving and processing ground data |
|PC4||C||Flight software for earth orbiting satellite |
|PC1||C||Flight software for earth orbiting satellite |
|PC3||C||Flight software for earth orbiting satellite |
|PC5||C||Flight software for earth orbiting satellite |
The JMI dataset contains 21 attributes with one defect and defect-free class, i.e., in total 22 attributes with the number of 7782 instances including 1622 defective modules and 6110 defect-free modules. The MW1 dataset contains 37 attributes with one defective class i.e., indicates the defect or defect-free modules with the number of 253 instances including 27 defective modules and 226 defect-free modules. The dataset KC2 contains 21 attributes with one defect or defect-free class containing 522 instances. The information of attributes, instances, defective, and defect-free instances of the datasets (i.e., PC4, PC1, PC3, and PC5) is shown in Table 10. According to the given information in Table 5.2, the ratio of defect or defect-free modules of the datasets (i.e., JM1, MW1, KC2, PC4, PC1, PC3, and PC5) are 3.78, 8.37, 3.38, 7.19, 10.56, 7.04, and 32.31 respectively.
We first perform our experiments with conventional machine learners using random sampling. For each dataset, we select a small part of the module randomly as a training set according to the sampling rate. As labeled training set, the selected small part of the sample sets takes the place of the test, and the remaining part of the datasets is used as unlabeled data or test sets. For example, if a project contains 1000 modules and if we take 5% as a training set from the 1000 modules according to sample rate (u), the remaining 95% will be the part of test sets. That’s mean, 50 modules (i.e., labeled data) will be taken for the training sets, and 950 modules will be taken from the test sets as unlabeled data.
In practical, it’s difficult to conduct the test for the whole module of a large project. Even, it is costly and time-consuming. Here in our experiment, we select ten sampling rates (i.e., 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50%) that are labeled data used for learning. In our experiment, we take in four machine learners, namely J48, Na ̈ıve Bayes, Logistic Regression, and AdaBoost. We take in each dataset in our experiment according to sample rate and apply conventional machine learners on the data sets. We consider two-evaluation measures that are used widely for evaluation, namely F-measure and AUC . The experiments are conducted in WEKA environment .
In real life applications, many unlabeled data can be collected easily. With the help of expertise and human effort, we can also collect labeled data. Semi-supervised learning is a machine learning technique where labeled data and unlabeled data can be work together. By using this technique, we can use labeled data for learning from the un- labeled data. Few labeled data is used to build learning model that helps to improve learning performance. This learning model explores the learning or knowledge from the vast amount of unlabeled data.
In our experiment, we adopt semi-supervised learning using random sampling. Since all the modules or data are labeled in our datasets. Here, we also use 10 sampling rate (i.e., 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50%) like the experiment of under sampling with conventional machine learners. The training set, which is taken for experiment considered as labeled data and put the rest of data considering as unlabeled data as test datasets.
RCForest is a semi-supervised learning method. This learning method uses all labeled and unlabeled data. It also works with imbalance class distribution. Since all of our datasets contains labeled data (i.e., defective modules) and unlabeled data (i.e., defect-free modules) that is imbalance. RCForest is suitable if the class nature of the data is imbalance  . In Table 5.2 illustrates the ratio of defective and defect-free modules where defective modules are fewer than defect-free modules in all the datasets. Here, we also consider F-measures and AUC measures that are used widely for evaluation .
We evaluate the performance of all the methods of conventional learners and semi- supervised learners with F-measure and AUC (Area Under the ROC Curve) . F- measure summarizes precision and recall  . Recall correctly defines the defective modules that have been retrieved. Precision also defines the retrieved items how much its correct that are predicted as defective. The definition of Recall, Precision, and F- measures are given below.
Where tp, fp, tn, and fn define as follows:
tp stands for the number of predicted defective modules;
fp stands for the number of predicted defect-free modules;
tn stands for the defect-free modules predicted as defect-free; and
fn stands for defective modules predicted adds defect-free, sequentially.
High precision and recall result make high F-measures that we have listed after our experiments results. Another performance evaluation metrics is AUC (Area Under the ROC Curve) that helps to measure the classification  where ROC means Receiver Operating Characteristics. AUC shows the potential of the classifier used over the experiments . High AUC specifies that the classifier produce good classification of the training datasets, which is taken for the experiments. Even high AUC can give bad classification if the training datasets is not good for classification  .
In our experiment, firstly we perform our test using conventional machine learners, namely Naive Bayes, Logistic Regression, AdaBoost, and J48 in PROMISE defect datasets. After that, we perform semi-supervised learners, RCForest in the same datasets in same sampling rate (u).
The result of F-measures and AUC are given in Table 11 to Table 17.
Figure 6 shows the comparison of F-measures with the classifiers of conventional learners and semi-supervised learners among the datasets. For the dataset JM1, the lower value of F-measure is 0.6948 for the AdaBoost classifier and the upper value of F-measure is 0.7475 for the RCForest classifier.
In details, the performance of semi-supervised learning classifier, RCForest is better than other conventional machine learning classifier. The lower values of F- measures are 0.8280, 0.7026, 0.9640, and 0.7948 for the classifier Na ̈ıve Bayes (for datasets MW1 and PC3), Logistic Regression (for dataset PC5) and J48 (for dataset KC2) respectively. Semi-supervised learning classifier, RCForest gets the highest F- measures value for the datasets of JM1, MW1, PC3, PC5, and KC2 instead of PC4 and PC1. Logistic Regression and RCForest get the almost the same F-measures values for the datasets, PC4 (i.e., 0.8747, 0.875) and PC1 (i.e., 0.8888, 0.8844) respectively. Semi-supervised learner, RCForest gets the highest value of F-measures of almost all of the datasets. Figure 6 depicts the improvement of semi-supervised learners, RCForest by F-measures that is significant.
Our experiment results illustrate that if the sample size increases, it doesn’t change prediction results. In one word, prediction results are not dependent on sample sizes.
For an example in Figure 7, for JW1 dataset, increasing sample from 5% to 50% increases F-measures by 0.07(i.e., from 0.713 to 0.784) using J48, by 0.01 (i.e., from 0.72 to 0.73) using Logistic Regression, by 0.02 (i.e., from 0.73 to 0.75) using Naive Bayes, by 0.01 (i.e., from 0.69 to 0.70) using AdaBoost, and lastly 0.08 (i.e., from 0.69 to 0.77) by RCForest when sample size is increased. Here, it is clearly showing, that increasing sample size does not significantly improve prediction results.
Figure 7 depicts the box-plot of F-measures of the dataset of JM1 where sampling size from 5% to 50%. It also shows that the upper quartile and lower quartile all are very narrow for project JW1. Few data points fall below the lower tail for the classifier RCForest and up for the classifier J48.
Figure 8 visualizes the box-plot of the F-measures for the dataset MW1 with sampling rate from 5% to 50%. While sampling rate increases from 5% to 50%, F- measure increases by 0.1 (i.e., from 0.76 to 0.86) using J48, by 0.03 (i.e., from 0.86 to 0.89) using Logistic Regression, by 0.06 (i.e., from 0.80 to 0.86) using Na ̈ıve Bayes, by 0.11 (i.e., from 0.78 to 0.89) using AdaBoost, and by 0.04 (i.e., from 0.86 to 0.90) using RCForest. The distance between lower quartile and upper quartile is very narrow. The experiment of this dataset clearly shows that increasing sampling rate does not effect to improve prediction results.
Figure 9 presents the box-plot of the F-measures for the dataset KC2 using sampling rate from 5% to 50%. It also shows the values distance of F-measures by using J48, Logistic Regression, Na ̈ıve Bayes, AdaBoost, and RCForest learners.
While sampling rate increases from 5% to 50%, F-measure increases by 0.042 (i.e., from 0.772 to 0.814) using J48, by 0.055(i.e., from 0.769 to 0.824) using Logistic Regression, by 0.014 (i.e., from 0.799 to 0.813) using Na ̈ıve Bayes, by 0.005 (i.e., from 0.807 to 0.832) using AdaBoost, and by 0.054 (i.e., from 0.795 to 0.849) using RCForest. Here in Figure 5.4, the distance of lower quartile and upper quartile is very narrow which is significant and also showing, increasing sample size does not improve prediction results.
Figure 10 also illustrates the box-plot of F-measures of the dataset PC4 using sampling rate from 5% to 50%. While sampling rate increases from 5% to 50%, F- measure increases by 0.06 (i.e., from 0.82 to 0.88) using J48, by 0.05(i.e., from 0.84 to 0.89) using Logistic Regression, by 0.12 (i.e., from 0.74 to 0.86) using Naive Bayes, by 0.05 (i.e., from 0.83 to 0.88) using AdaBoost, and by 0.06 (i.e., from 0.84 to 0.90) using RCForest.
Here, lower value of F-measures among the classifier is 0.74 and upper value is 0.90 shown in Figure 5.5. So the distance between the F-measures boundary is very narrow like the upper lower quartile and upper quartile, which is significant.
Figure 11 describes the box-plot of F-measures for the dataset PC1 using sampling rate from 5% to 50%. Like the other datasets, while sampling rate increases from 5% to 50%, F-measure increases by 0.01 (i.e., from 0.87 to 0.88) using J48, by 0.02(i.e., from 0.88 to 0.90) using Logistic Regression, by 0.23 (i.e., from 0.67 to 0.90) using Naive Bayes, by 0.04 (i.e., from 0.86 to 0.90) using AdaBoost, and by 0.03 (i.e., from 0.87 to 0.90) using RCForest. Here, the distance between the values of F-measures is 0.23 for the classifier of Naive Bayes, which is bigger than other classifiers.
In the box-plot, the lower quartile and upper quartile is very narrow shown in Figure 5.6. There are only a few data points fall below for the classifier Naive Bayes. Here, for the dataset PC1, increasing sample size also does not improve the prediction results.
Figure 12 demonstrates the box-plot of F-measures of PC3 datasets using sampling rate from 5% to 50%. While sampling rate increases from 5% to 50%, F- measure increases by 0.05 (i.e., from 0.78 to 0.83) using J48, by 0.03(i.e., from 0.79 to 0.82) using Logistic Regression, by 0.48 (i.e., from 0.34 to 0.82) using Na ̈ıve Bayes, by 0.04 (i.e., from 0.79 to 0.83) using AdaBoost, and by 0.02 (i.e., from 0.82 to 0.84) using RCForest. For the classifier Na ̈ıve Bayes, the difference between lower value and upper value of F-measures is 0.48, which is bit high comparing to other classifier. In terms of quartile distance, all values are very narrow. It is also showing that increasing sampling rate also does not improve the prediction of defect.
Figure 13 also visualizes the box-plot of F-measures of the dataset PC5 using sampling rate from 5% to 50%. In terms of the comparisons between the lower values and upper values of F-measures, we see that increasing sample size also does not effect the prediction of the defects for the dataset PC5.
While changing the sampling rate from 5% to 50%, the values of F-measures for the dataset of PC5 also change very closely. For an example in Figure 5.8, F-measure increases by 0.01 (i.e., from 0.96 to 0.97) using J48, by 0.02(i.e., from 0.95 to 0.97) using Logistic Regression, by 0.002 (i.e., from 0.965 to 0.967) using Naive Bayes, by 0.013 (i.e., from 0.955 to 0.968) using AdaBoost, and by 0.022 (i.e., from 0.965 to 0.987) using RCForest.
|JM1||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
|MW1||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
|PC4||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
|PC1||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
|PC3||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
|PC5||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
|KC2||J48||Logistic Regression||Naive Bayes||AdaBoost||RCForest|
Tables 11 to 17 contain the value of F-measures and AUC according to sample rate from 5% to 50%. It visualizes the improvement of RCForest for defect prediction.
We get the high AUC (i.e., on average) for each dataset after using semi-supervised learners, RCForest. The AUC values of JM1, MW1, PC4, PC1, PC3, PC5, and KC2 are 0.6704, 0.6963, 0.8921, 0.8928, 0.7987, 0.9682, and 0.8342 on average respectively by RCForest, which are significant. It also indicates that RCForest shows the potential of a classifier to produce good classification over the test examples.
In summary, our results suggest that increasing sampling rate does not improve pre- diction of defects, which means a smaller sample can achieve highest prediction performance. A smaller sample based prediction of defects helps to reduce the testing cost that is significant. Semi-supervised learner, RCForest shows the significant performance improvement for the classification and prediction of defects comparing with conventional learners.
Large software projects contain a lot of defective and defect-free modules that need to fix before handover to customers. Effective prediction of defect-proneness modules helps developers to achieve the goal of software quality assurance as well as improving the quality of the software. There have a number of machine learners been applied for predicting software defects that is the key challenges in recent software industries. For a large software project, the size of defective and defect-free modules is not similar. Mostly, the defective modules are less than defect-free modules. So for predicting defective modules, an imbalance situation of the data is being created while using previous projects’ data, which is not available all the times and not valid.
To address these two issues, in our work, we apply two approaches of ma- chine learners: random sampling with conventional learners and random sampling with semi-supervised learners. According to our experiment, semi-supervised machine learner shows significant performance comparing with conventional machine learners.
The results of our experiments also show that RCForest, a semi-supervised learner, achieve higher performance based on F-measures and AUC that are on average 85.84% and 82.28% respectively. RCForest performs better in the situation of imbalanced datasets, PROMISE NASA defects datasets.
We also find that increasing sampling rate does not effect to the prediction of defects. We take datasets into the experiment by increasing 5% each time i.e., from 5% to 50% for each classifier and find that small sample can predict the same result as large sample predicts.
In our approach, we implement random sample based defect prediction. To get the effective prediction, the training set should be large and for that, the proposed technique works perfectly for defect prediction. For better prediction, the selection process of the sample should be completed carefully so that the prediction can be identified perfectly. In our experiment, all the datasets are collected from PROMISE NASA datasets that are open source projects. For getting the desirable prediction rate, it is required to implement our approach for real life project that needs to ensure quality. This will be our future work that can put much impact on industrial practice.
 Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Dependable Systems and Networks, 2002. DSN 2002. Proceedings. International Conference on, pages 595–604. (Cited on page 597.). IEEE, 2002.
 Learn Mccabe’s Cyclomatic Complexity with Example, 2017 (accessed April 18, 2017). URL http://www.guru99.com/cyclomatic-complexity.html.
 Mary Jean Harrold, Gregg Rothermel, Rui Wu, and Liu Yi. An empirical investi- gation of program spectra. In ACM SIGPLAN Notices, volume 33, pages 83–90. (Cited on page 84.). ACM, 1998.
 Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan JC Van Gemund. A practical evaluation of spectrum-based fault localization. Journal of Systems and Software, 82(11):1780–1792. (Cited on page 1782.), 2009.
 Mark Sullivan and Ram Chillarege. Software defects and their impact on system availability: A study of field failures in operating systems. In FTCS, volume 21, pages 2–9, 1991.
 Jin Huang and Charles X Ling. Using auc and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3):299– 310, 2005.
 Junjie Hu, Haiqin Yang, Michael R Lyu, Irwin King, and Anthony Man-Cho So. Online nonlinear auc maximization for imbalanced data sets. IEEE Transactions on Neural Networks and Learning Systems, 2017.
 Yu-Hsuan Lin, Po-Hsien Lin, Chih-Lin Chiang, Yang-Han Lee, CC Yang, TB Kuo, and ShH Liln. Incorporation of mobile application (app) measures into the diagnosis of smartphone addiction. The Journal of clinical psychiatry, 2017.
 Michael Unterkalmsteiner and Tony Gorschek. Requirements quality assurance in industry: Why, what and how? In International Working Conference on Require- ments Engineering: Foundation for Software Quality, pages 77–84. Springer, 2017.
 D Richard Kuhn, Dolores R Wallace, and Albert M Gallo. Software fault inter- actions and implications for software testing. IEEE transactions on software engi- neering, 30(6):418–421, 2004.
 Ion Androutsopoulos, John Koutsias, Konstantinos V Chandrinos, George Paliouras, and Constantine D Spyropoulos. An evaluation of naive bayesian anti- spam filtering. arXiv preprint cs/0006013, 2000.
 James S Collofello. The software technical review process. Technical report, DTIC Document, 1988.
 Robert N Charette. Why software fails [software failure]. Ieee Spectrum, 42(9): 42–49, 2005.
 Glenford J Myers, Corey Sandler, and Tom Badgett. The art of software testing. John Wiley & Sons, 2011.
 Software Quality Assurance, 2017 (accessed April 6, 2017). URL http:// softwaretestingfundamentals.com/software-quality-assurance/.
 Software Project Management, 2017 (accessed April 6, 2017). URL http://www.zeepedia.com/read.php?software_quality_assurance_ activities_software_project_management&b=18&c=19.
 Prabhjot Kaur. Quality Assurance: Five Crucial Activities For Software Test- ing, 2017 (accessed April 6, 2017). URL https://www.grazitti.com/blog/ quality-assurance-five-crucial-activities-to-improve-the—/.
 Henning Femmer, Daniel M ́endez Fern ́andez, Stefan Wagner, and Sebastian Eder. Rapid quality assurance with requirements smells. Journal of Systems and Software, 123:190–213, 2017.
 Jaroslaw Hryszko and Lech Madeyski. Assessment of the software defect prediction cost e↵ectiveness in an industrial project. In Software Engineering: Challenges and Solutions, pages 77–90. Springer, 2017.
 JeffTian. Software quality engineering: testing, quality assurance, and quantifiable improvement. John Wiley & Sons, 2005.
 Tim Menzies and Justin S Di Stefano. How good is your blind spot sampling policy. In High Assurance Systems Engineering, 2004. Proceedings. Eighth IEEE International Symposium on, pages 129–138. IEEE, 2004.
 Sadia Sharmin, Md Rifat Arefin, M Abdullah-Al Wadud, Naushin Nower, and Mo- hammad Shoyaib. Sal: An effective method for software defect prediction. In Computer and Information Technology (ICCIT), 2015 18th International Confer- ence on, pages 184–189. IEEE, 2015.
 Zhi-Wu Zhang, Xiao-Yuan Jing, and Tie-Jian Wang. Label propagation based semi- supervised learning for software defect prediction. Automated Software Engineering, 24(1):47–69, 2017.
 Ming Li, Hongyu Zhang, Rongxin Wu, and Zhi-Hua Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19(2):201–230, 2012.
 Yuan Jiang, Ming Li, and Zhi-Hua Zhou. Software defect detection with rocus. Journal of Computer Science and Technology, 26(2):328–342, 2011.
 Tien-Duy B Le, Ferdian Thung, and David Lo. Theory and practice, do they match? a case with spectrum-based fault localization. In ICSM, pages 380–383. (Cited on pages 380 and 383.), 2013.
 Xiaoyuan Xie, Tsong Yueh Chen, Fei-Ching Kuo, and Baowen Xu. A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization. ACM Transactions on Software Engineering and Methodology (TOSEM), 22(4):31, 2013.
 Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. An evaluation of simi- larity coe cients for software fault localization. In Dependable Computing, 2006. PRDC’06. 12th Pacific Rim International Symposium on, pages 39–46. IEEE, 2006.
 Aritra Bandyopadhyay and Sudipto Ghosh. On the e↵ectiveness of the tarantula fault localization technique for di↵erent fault classes. In High-Assurance Systems Engineering (HASE), 2011 IEEE 13th International Symposium on, pages 317–324. IEEE, 2011.
 James A Jones, Mary Jean Harrold, and John T Stasko. Visualization for fault localization. In Proceedings of ICSE 2001 Workshop on Software Visualization, Toronto, Ontario, Canada, pages 71–75. Citeseer, 2001.
 Valentin Dallmeier, Christian Lindig, and Andreas Zeller. Lightweight defect lo- calization for java. In European conference on object-oriented programming, pages 528–550. Springer, 2005.
 James A Jones and Mary Jean Harrold. Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM in- ternational Conference on Automated software engineering, pages 273–282. ACM, 2005.
 James A Jones, Mary Jean Harrold, and John Stasko. Visualization of test informa- tion to assist fault localization. In Proceedings of the 24th international conference on Software engineering, pages 467–477. ACM, 2002.
 Lee Naish, Hua Jie Lee, and Kotagiri Ramamohanarao. A model for spectra-based software diagnosis. ACM Transactions on software engineering and methodology (TOSEM), 20(3):11, 2011.
 Jeongho Kim, Jonghee Park, and Eunseok Lee. A new spectrum-based fault lo- calization with the technique of test case optimization. J. Inf. Sci. Eng., 32(1): 177–196, 2016.
 Jeongho Kim and Eunseok Lee. Empirical evaluation of existing algorithms of spectrum based fault localization. In Information Networking (ICOIN), 2014 In- ternational Conference on, pages 346–351. IEEE, 2014.
 Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION, 2007. TAICPART-MUTATION 2007, pages 89–98. IEEE, 2007.
 Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
 Andr ́eia da Silva Meyer, Antonio Augusto Franco Garcia, Anete Pereira de Souza, and Cl ́audio Lopes de Souza Jr. Comparison of similarity coe cients used for cluster analysis with dominant markers in maize (zea mays l). Genetics and Molecular Biology, 27(1):83–91, 2004.
 Benoit Baudry, Franck Fleurey, and Yves Le Traon. Improving test suites for e cient fault localization. In Proceedings of the 28th international conference on Software engineering, pages 82–91. ACM, 2006.
 Mark David Weiser. Program slices: formal, psychological, and practical investiga- tions of an automatic program abstraction method. 1979.
 W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8): 707–740, 2016.
 Tutorial on Defect Prediction, 2017 (accessed April 12, 2017). URL http:// openscience.us/repo/defect/tut.html.
 T.J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308–320, December 1976.
 Ahmed H Yousef. Extracting software static defect models using data mining. Ain Shams Engineering Journal, 6(1):133–144, 2015.
 M.H. Halstead. Elements of Software Science. Elsevier, 1977.
 Hao Tang, Tian Lan, Dan Hao, and Lu Zhang. Enhancing defect prediction with static defect analysis. In Proceedings of the 7th Asia-Pacific Symposium on Inter- netware, pages 43–51. ACM, 2015.
 Chubato Wondaferaw Yohannese and Tianrui Li. A combined-learning based frame- work for improved software fault prediction. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 10(1):647–662, 2017.
 Shomona Jacob and Geetha Raju. Software defect prediction in large space systems through hybrid feature selection and classification. International Arab Journal of Information Technology (IAJIT), 14(2), 2017.
 O Martınez Mozos, Cyrill Stachniss, and Wolfram Burgard. Supervised learning of places from range data using adaboost. In Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on, pages 1730–1735. IEEE, 2005.
 Wang Tao, LI Weihua, SHI Haobin, and LIU Zun. Software defect prediction based on classifiers ensemble. JOURNAL OF INFORMATION &COMPUTATIONAL SCIENCE, 8(16):4241–4254, 2011.
 Neeraj Bhargava, Girja Sharma, Ritu Bhargava, and Manish Mathuria. Decision tree analysis on j48 algorithm for data mining. Proceedings of International Journal of Advanced Research in Computer Science and Software Engineering, 3(6), 2013.
 David Bowes, Tracy Hall, and Jean Petri ́c. Software defect prediction: do di↵erent classifiers find the same defects? Software Quality Journal, pages 1–28, 2017.
 YU Qiao, Shujuan JIANG, and Yanmei ZHANG. The performance stability of defect prediction models with class imbalance: An empirical study. 2017.
 Wei Fu and Tim Menzies. Revisiting unsupervised learning for defect prediction. arXiv preprint arXiv:1703.00132, 2017.
 Thomas J McCabe and Charles W Butler. Design complexity measurement and testing. Communications of the ACM, 32(12):1415–1425, 1989.
 Module Design Complexity, 2017 (accessed April 18, 2017). URL http://www. ieee-stc.org/proceedings/2009/pdfs/tjm2273.pdf.
 More Complex Equals Less Secure-McCabe, 2017 (accessed April 18, 2017). URL http://www.mccabe.com/pdf/MoreComplexEqualsLessSecure-McCabe.pdf.
 Dipesh Joshi. McCabe Complexity Metrics, 2017 (accessed April 18, 2017). URL http://gtumaterial.com/wp-content/uploads/2015/04/ McCabe-Complexity-Metrics.pdf.
 Halstead’s Software Science, 2017 (accessed April 18, 2017). URL http://www. whiteboxtest.com/Halstead-software-science.php.
 MW1 Dataset, 2017 (accessed March 27, 2017). URL https://zenodo.org/ record/268490#.WNjfiBKGM6h.
 Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A review of classification techniques, 2007.
 MC Prasad, Lilly Florence, and Arti Arya. A study on software metrics based software defect prediction using data mining and machine learning techniques. In- ternational Journal of Database Theory and Application, 8(3):179–190, 2015.
 PA Selvaraj and Dr P Thangaraj. Support vector machine for software defect prediction. International Journal of Engineering & Technology Research, 1(2):68– 76, 2013.
 Ashraf Uddin. Naive Bayes Classifier in Java with database connectivity, 2017 (accessed April 6, 2017). URL http://ashrafsau.blogspot.in/2012/11/ naive-bayes-classifier-in-java-with.html.
 M Anbu and GS Anandha Mala. Investigation of software defect prediction using data mining framework. Research Journal of Applied Sciences, Engineering and Technology, 11(1):63–69, 2015.
 David D Lewis. Naive (bayes) at forty: The independence assumption in informa- tion retrieval. In European conference on machine learning, pages 4–15. Springer, 1998.
 Ju ̈rgen Janssen and Wilfried Laatz. Naive bayes. In Statistische Datenanalyse mit SPSS, pages 557–569. Springer, 2017.
 Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Citeseer, 1998.
 Irina Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, pages 41–46. IBM New York, 2001.
 Naive Bayes, 2017 (accessed April 20, 2017). URL http://web.cs.hacettepe. edu.tr/~pinar/courses/VBM687/lectures/NaiveBayes.pdf.
 Nick Robinson. The Disadvantages of Logistic Regression, 2017 (accessed April 7, 2017). URL http://classroom.synonym.com/ disadvantages-logistic-regression-8574447.html.
 Miho Ohsaki, Peng Wang, Kenji Matsuda, Shigeru Katagiri, Hideyuki Watanabe, and Anca Ralescu. Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Transactions on Knowledge and Data Engineering, 2017.
 David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression, volume 398. John Wiley & Sons, 2013.
 Robert M Young. 75.9 euler’s constant. The Mathematical Gazette, 75(472):187– 190, 1991.
 J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. 74
 Decision Trees, 2017 (accessed April 6, 2017). URL http://www.cs.ubbcluj.ro/ ~gabis/DocDiplome/DT/DecisionTrees.pdf.
 Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.
 Kelemen Zsolt Benk Erika. Boosting Methods, 2017 (accessed April 21, 2017). URL http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/ BenkKelemen_Boosting.pdf.
 Jiri Matas and Jan Sochman. AdaBoost, 2017 (accessed April 6, 2017). URL http://www.robots.ox.ac.uk/~az/lectures/cv/adaboost_matas.pdf.
 Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co- training. In Proceedings of the eleventh annual conference on Computational learn- ing theory, pages 92–100. ACM, 1998.
 Xiaojin Zhu. Semi-supervised learning. In Encyclopedia of Machine Learning, pages 892–897. Springer, 2011.
 Introduction to Semi-supervised Learning, 2017 (accessed April 6, 2017). URL https://mitpress.mit.edu/sites/default/files/titles/content/ 9780262033589_sch_0001.pdf.
 Piyush Rai. Semi-supervised Learning, 2017 (accessed April 6, 2017). URL https: //www.cs.utah.edu/~piyush/teaching/8-11-slides.pdf.
 Leo Breiman and Adele Cutler. Random Forests, 2017 (accessed April 23, 2017). URL https://www.stat.berkeley.edu/~breiman/RandomForests/cc_ home.htm.
 Ming Cheng, Guoqing Wu, Mengting Yuan, and Hongyan Wan. Semi-supervised software defect prediction using task-driven dictionary learning. Chinese Journal of Electronics, 25(6):1089–1096, 2016.
 Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine learning, 28(2):133–168, 1997.
 McCabe and Halsted. NASA PROMISE repository, 2017 (accessed March 27, 2017). URL http://openscience.us/repo/defect/mccabehalsted/.
 JM1 Dataset, 2017 (accessed March 27, 2017). URL https://zenodo.org/record/ 268514#.WNO6uhJ946g.
 KC2 Dataset, 2017 (accessed March 27, 2017). URL https://terapromise.csc.ncsu.edu/repo/defect/mccabehalsted/kc/kc2/kc2arff.
 PC4 Dataset, 2017 (accessed March 27, 2017). URL http://openscience.us/repo/defect/mccabehalsted/pc4.html.
 PC1 Dataset, 2017 (accessed March 27, 2017). URL http://openscience.us/repo/defect/mccabehalsted/pc1.html.
 PC3 Dataset, 2017 (accessed March 27, 2017). URL http://openscience.us/repo/defect/mccabehalsted/pc3.html.
 PC5 Dataset, 2017 (accessed March 27, 2017). URL http://openscience.us/repo/defect/mccabehalsted/pc5.html.
 Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
 Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
 Ana Stanescu and Doina Caragea. An empirical study of ensemble-based semi- supervised learning approaches for imbalanced splice site datasets. BMC systems biology, 9(5):S1, 2015.
 Wuying Liu and Ting Wang. Multi-field learning for email spam filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and de- velopment in information retrieval, pages 745–746. ACM, 2010.
 Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8): 861–874, 2006.