The aim of this empirical project is to find out which variables affect the wages of an individual. This will initially be done by describing the variables such as qualitative data, which are data that are described in terms of quality, for example a person is very tall and quantitative data, which are data that are described in terms of quantity, for example the person is 6ft. Later a model with my variables will be constructed and a number of regressions will be run, which will help me to identify which factors in my model affect the wages of an individual. The results from my regressions will be compared to the standard theory of Microeconomic. As the model deals with individuals and households it is microeconomics.
The journal The Racial Wage Discrimination in America: by The Economist explored the differences in the level of earnings of different races in America. The paper states that black workers, men or women earn less than white workers in America. The paper also suggests that the reason for this is that ‘the average black worker has less education and experience than his white counterpart.’ The paper also articulates that firms are ‘one and a half times more likely to interview a person they think is white than one they think is black, even if both have identical qualifications.’
ECONOMIC THEORY OF WAGES
Economic theory suggests that firms will pay high wages to attract the best pool of workers, workers who have the best education, mainly due to the fact that these workers will be more productive in comparison to workers with lower levels of education. As a result of this theory there is a positive relationship between the level of education and the level of wages that you will receive. Economic theory also suggests that firms will pay higher wages to white men with respects to white women in the U.S, and black men are paid higher than black women in the U.S, which helps to show Labour Market Discrimination. There is also a theory of microeconomics which discusses the benefits of good looks. People who are supposed to be better looking usually experience higher wages than those who are not so good looking.
The data was collected from The Boston College Department of Economics http://ideas.repec.org/s/boc/bocins.html and has a variety of Stata datasets for econometrics. The file downloaded wage 2 was created by Jeffrey M. Wooldridge. My selected data which is already in stata form, is from the 2000 wages survey in Boston. However this survey is outdated and is not the most recent form of data
The data that was collected had 935 observations and is a cross-sectional dataset on wages. Cross-sectional datasets refers to data which has been collected by observing a number of subjects, (for example individuals, age experience etc.) at one period of time, or without regard to difference in time. Analysing this sort of dataset will consist of comparing the difference between each subject.
Below are the variables used:-
Wages:- It is the dependant variable which is affected by the other variables. It shows the grossly monthly earnings of an individual in dollars and is in quantitative form.
Hours:- This is the first of the independent variables showing the average weekly hours which an individual has worked and is also in quantitative form
IQ:- A quantitative variable showing the I.Q of the individual. It shows the score that the individual has received after doing the test.
Age:- This quantitative independent variable shows the age. It shows the age of the individual when the survey was conducted.
Married:- This is the first qualitative independent dummy variable. If the individual is married then the value will be 1, i.e. married = 1, but if the individual is not married at the time the survey is carried out the value will be 0, i.e. not married = 0.
Education:- This quantitative independent variable shows the years of education that the individual has. If the value is high then it shows that the individual has spent longer in education making them more employable.
Black:- This is the second qualitative dummy variable which shows whether the individual is black or not. If the individual is black then there will be a value 1, but if the individual is not black then there will b a value 0. Black = 1 Non Black = 0
Sibs:- This quantitative variable shows the number of siblings the has. If the value for the certain individual is high then that individual has many siblings, and if the value is low or even zero then the individual has 0 or very few siblings.
The model which I will first use will be the estimated model:-
Y= Weekly Wages X1= Hours Worked X2= IQ X3= Married X4= married X5= Education X6= Black X7= sibs
Number of Data observations = 935
A regression will be run, using the variables stated as this will help me to find the best linear unbiased estimator. A T-Test will be carried out which will show me which is the best estimator. An F-Test will be carried out subsequent to the T-Test which will show which model is preferred. There is also a need to test the data for multicollinearity and will show which variables are positively, negatively or not correlated at all. This test will be subsequently carried out after the F-test. Finally a test for heteroskedasticity will be carried out to show if there is a difference in variances. These tests will be carried out in this specific order as the results from the F-Test will be needed before multicollinearity is carried out and the results from both of these tests will needed to carry out the test for heteroskedasticity.
The F-Test will be run at the 5% significance level as it will compare the restricted model and the unrestricted model to find out which model is better. A correlation matrix will help to test for multicollinearity, allowing me to look at the correlation between the independent variables and the dependant variables. The test for heteroskedasticity will help figure out which estimator should used, and once confirmed then I will need to standardise the heteroskedasticity.
This is the first regression uses all of my variables from the dataset which was provided:-
reg wage hours IQ age married educ black sibs
SSR= 123332195 R-squared = 0.1924
wage Coef. Std. Err. t P>|t| [95% Conf. Interval]
hours -3.483357 1.667437 -2.09 0.037 -6.755746 -0.2109685
IQ 4.158012 0.9960817 4.17 0.000 2.203176 6.112849
age 19.41802 3.875994 5.01 0.000 11.81128 27.02476
married 174.9624 38.97636 4.49 0.000 98.47022 251.4545
educ 44.10633 6.4159 6.87 0.000 31.51496 56.6977
black -113.9696 39.96556 -2.85 0.004 -192.403 -35.53611
sibs -4.458277 5.581175 -0.80 0.425 -15.41148 6.494926
_cons -675.0723 183.4671 -3.68 0.000 -1035.131 -315.0134
The critical value for the 5% significance level if 1.96 Ho : βj = 0
If the T-value is greater than the critical value (1.96) we will reject the hypothesis and it would be significant at the 95% confidence level, however if the T-value is less than the critical value (1.96) then I will accept the hypothesis and it would be insignificant at the 95% confidence level.
All variables are significant apart from the sibling’s variable, which is 42.5. This is dramtically higher than 1.96 and therefore sibling will be omited from my model. Due to this a second regression will be run, restricting my model. As you can see from the regression command below, siblings has been omitted from the regression
.reg wage hours IQ age married educ black
SSR= 123417089 R-squared = 0.1919
wage Coef. Std. Err. t P>|t| [95% Conf. Interval]
hours -3.479966 1.667106 -2.09 0.037 -6.751701 -0.2082306
IQ 4.246344 .9897316 4.29 0.000 2.303972 6.188716
age 19.55043 3.871693 5.05 0.000 11.95214 27.14872
married 174.8716 38.96859 4.49 0.000 98.3948 251.3484
educ 44.72156 6.368262 7.02 0.000 32.2237 57.21943
black -121.036 38.96662 -3.11 0.002 -197.5089 -44.5631
_cons -708.9586 178.4606 -3.97 0.000 -1059.192 -358.7254
H0 = β7 = 0
F = (SSRr − SSRur)/q
SSRur/(n − k − 1)
F = (SSRr – SSRur)/q = 123417089 – 123332195 / 1
SSRur (n – k – 1) = 123332195 / 935 – 7 – 1
F-Value = 84893/133044.4391 = 0.6380875486
N= 935 DF= N – K – 1 = 927
Critical values at 1% level of significance = 1.04
Critical value at 5% level of significance = 1.04
As the F-value is less than both of the critical values ( F value < critical value) the null hypothesis states that the restricted model is better than the unrestricted model, therefore the null hypothesis will not be rejected. This proves it was correct to omit the sibling’s variable from my model as it has no relation to the individual’s earnings. The restricted model will now be used to carry out the remainder of my results on Multicollinearity and Heteroskedasticity
Multicollinearity will be carried out to see which variables in the model are correlated. The command used to obtain multicollinearity of the data is:-
. cor wage hours IQ age married educ black, with 935 observations.
We can see that there is relatively a strong positive correlation between two independent variables, those two are the years of education (educ) and I.Q (0.5157), but this is not very large which shows the model does not have multicollinearity , which is excellent. Due to this the unrestricted model will not be altered. We can also see that there is a positive correlation between the dependant variable educ (education) and the wage (0.3271), however I would have expected this to slightly higher. There is a slight negative relationship between the independent variable hours and wage. I would have expected their to be a strong positive relationship as I would have expected the number of hours worked to increase the weekly wage of an individual.
The R-sqaured of the regression measures the fraction of the variance of the dependent variable (Yi), which is explained by the independent variables (Xi).The R-squared value of the first unrestricted model was 0.1924 which is 19.24% of the dependant variable to be explained by the independent variables in the model. R-squared value for the restricted model however is 0.1919 i.e. 19.19%. There is a change in the R-squared but a very small one. This could be due to the fact that we have omitted one independent variable in the restricted model.
The next test is a test for heteroskedasticity, which will show if there is a differing in variances. The results are as follows:-
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of wage
chi2(1) = 38.52
Prob > chi2 = 0.0000
I will reject the null hypothesis as there is no homoskedasticity in the model because my calculated value of my chi squared is greater than the critical value, this means the estimated model shows evidence of heterskedasticity. Heteroskedasticity is quite common with cross-sectional data, as my data is cross-sectional I was expecting my model to be heteroskedastic.
As my model is subject to heteroskedasticity, a robust regression will be carried out to correct this problem.
reg wage hours IQ age married educ black, robust
Linear regression Number of obs = 935
F( 6, 928) = 36.01
Prob > F = 0.0000
R-squared = 0.1919
Root MSE = 364.68
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
hours | -3.479966 2.238777 -1.55 0.120 -7.873618 .9136859
IQ | 4.246344 .9319879 4.56 0.000 2.417296 6.075392
age | 19.55043 3.905086 5.01 0.000 11.88661 27.21425
married| 174.8716 34.84146 5.02 0.000 106.4944 243.2488
educ | 44.72156 6.481587 6.90 0.000 32.0013 57.44183
black | -121.036 32.0796 -3.77 0.000 -183.993 -58.07905
_cons | -708.9586 206.279 -3.44 0.001 -1113.786 -304.1311
The robust regression standardised the white heteroskedasticy corrected standard errors. The coefficient values can be classed as my β values, for my model. We can see that being married (174.8716) has the greatest impact on the independent variable, wages (Yi). As the independent variable is a dummy variable (married=1, 0=not married) then being married has 174.8716 dollars increase in gross monthly earnings increase.
We can also see that the level of education (educ) has a significantly high co-efficient of 44.72156, which means that for every increase of one year of education the individual gross monthly earnings increase by 44.72156 dollars. Further, age has a positive effect on the gross monthly earnings of an individual. As the individuals age increases by one year we can see that it has a 19.55043 dollars increase in the gross monthly earnings of an individual in 2000 in Boston. IQ also has a positive impact on the earnings of an individual. The value 4.246344 shows that there is a weak positive effect on the gross earnings of and individual. For every increase of one mark in I.Q it has 4.246344 dollars increase in gross earnings.
We can see that the amount of hours has an inverse relationship with wages. As the individual increases his average weekly hours by one, it has -3.479966 dollars increase in gross earnings of an individual. Finally the variable black has a very large inverse relationship with gross earnings. As black is a dummy variable (Black=1 Not Black=0) then being black increases the individuals earnings by -121.036, which means gross earnings in dollars decreases by 121.036 dollars in gross earnings.
Before constructing the results of the data, I had a variety of initial ideas which are different to the results which obtained. My initial deliberation was that the greater the number of hours an individual works will affect the monthly income. From the dataset used, being black and married had the greatest effect on monthly incomes of an individual. The final regression illustrated that being black had -121.036 dollars effect on the wages of an individual and being married had a 174.8716 dollars increase in gross monthly earnings This does not come as a surprise to me, looking at journals black people are generally less educated and productive that white people. Age did not have as much of an impact on monthly wages of an individual, only a 19.55043 increase in wages from one years increase in wage. I found this relatively strange as I would have expected older individuals to be paid higher than younger individuals. However the younger individuals could be more qualified than older individuals and therefore could show why age did not have as much impact on the weekly wage of an individual. If I were to do this project again I would include the sex variable. I believe that the gender of an individual would affect the weekly wage of an individual quite highly and would lead to my data being more useful.
I had relatively low R-squared values for my unrestricted model. This was due to the fact that the variables in t unrestricted model had been poor in explaining the dependant variable. If I were to do this project again, I would take this into consideration and would use more independent variable which would help explain the dependant variable; one would be sex of the individual which I have discussed before. I would also like to use a variety of different states rather than just using Boston, as this will make my model more valid.
I wasn’t able to physically input a variable (independent) into the dataset as the variable came with the dataset, but I would search for more datasets which I would expect to affect wages.