only do question 3 4 5
3. This question requires you to download data obtained from Statistics Canada. If you are working on campus go to www.odesi.ca (off campus users must first sign into the McMaster library via libaccess at library.mcmaster.ca/libaccess, search for odesi via the library search facilities then select odesi from these search results). Next, select the “Find data” field in odesi and search for “Labour Force Survey
Don't use plagiarized sources. Get Your Custom Essay on
Labour Force Survey
Get a plagiarism free paper Just from $13/Page
June, 2020”, then scroll down and select the Labour Force Survey, June 2020 [Canada]. Next click on the “Explore & Download” icon, then click on the download icon (i.e., the diskette icon, square, along the upper right of the browser pane) and then click on “Select Data Format” then scroll down and select “Comma Separated Value file” (csv) which, after a brief pause, will download the data to your hard drive (you may have to extract the file from a zip archive depending on which operating system you are using). Finally, make sure that you place this csv file in the same directory/folder as your R code file (this file ought to have the name LFS-71M0001-E-2020-June_F1.csv, and in RStudio select the menu item Session -> Set Working Directory -> To Source File Location). There will be another file with (almost) the same name but with the extension .pdf that is the pdf documentation that describes the variables in this data set. Note that it would be prudent to retain this file as we will use it in future assignments (this question is worth 8 marks).
Next, open RStudio, make sure this csv file and your R Markdown script are in the same directory (in RStudio open the Files tab (lower right pane by default) and refresh the file listing if necessary). Then read the file as follows:
lfp <- read.csv(“LFS-71M0001-E-2020-June_F1.csv”)
This data set contains some interesting variables on the labour force status of a random subset of Canadians. We will focus on the variable HRLYEARN (hourly earnings) described on page 22 of the pdf file LFS-71M0001-E-2020-June.pdf. We will also consider other variables so that we can condition our analysis on these variables by restricting attention to subsets of the data, e.g., for full-time workers only (FTPTMAIN==1) reporting positive earnings. We also look at the highest educational attainment for people in the survey and consider both high school graduates (EDUC==2) and those holding a bachelors degree (EDUC==5). To construct these subsets we can use the R command subset as follows (the ampersand is the logical operator and – see ?subset for details on the subset command):
hs <- subset(lfp, FTPTMAIN==1 & EDUC==2 & HRLYEARN > 0)$HRLYEARN
ba <- subset(lfp, FTPTMAIN==1 & EDUC==5 & HRLYEARN > 0)$HRLYEARN
These commands simply tell R to take a subset of the data frame lfp for full-time workers having either a high school diploma or university bachelors degree for those reporting positive earnings, and then retain only the variable HRLYEARN and store these in the variables named hs (hourly earnings for high-school graduates) or ba (hourly earnings for university graduates). The following questions ask you to compute various descriptive statistics and other graphical summaries of these two variables.
Note that nothing will be printed out by running the two lines above – they simply create subsets of the data for subsequent use.
- Report the five number summary for each subset (hint: fivenum(hs) etc.). Indicate what each number tells us (hint: see help by typing ?fivenum in the console pane).
ii. What can you say about relative wages of high school and university graduates?
iii. Using Sturges’ rule, how many classes would you construct for the hs and ba wage data (hint – length() gives you the length of the vector, log10() may also be useful, so something like
round(1+3.3*log10(length(hs))) might do the trick for the hs data at least)? iv. Plot histograms for the hs and ba data on separate graphs (hint: hist()).
v. Do the number of classes correspond to Sturges’ rule?
vi. Plot density curves for the hs and ba data on the same graph and add a legend (hint: first use something like plot(density(…),col=”blue”,lty=1) (you need to fill in (…) parts with the name of your data object, e.g., hs etc.) then lines(density(…),col=”red”,lty=2), then see the help page by typing ?legend in the console pane. Note that you can add a legend using something like
legend(“topright”,c(“High School”,”University”),
lty=c(1,2),col=c(“blue”,”red”),bty=”n”)
vii. What do these density curves tell us about the distribution of hourly wages for high school versus university graduates?
4. Consider the following data on annual profits (in $millions of dollars) for all firms in the textbook publishing industry in Canada (ignore the ## [1] and ## [12] that appear at the beginning of each line; this is simply the way R displays a vector of numbers):
## [1] 7.20 8.85 17.80 10.40 10.60 18.60 12.30 3.67 6.57 7.77 16.10 11.80
## [13] 12.00 10.60 7.22
To set these values in a vector in R, if desired, you can use the command profits <- c(…) where … are the values above separated by commas, e.g., profits <- c(3.67, 6.57, etc.)
i. How many observations are there (i.e., what is n, the sample size?)
ii. What is the minimum, maximum, and range?
iii. How many classes would you create if you used Sturges’ rule?
iv. What are the class widths and class boundaries based on your answers to the previous two questions, using Sturges’ rule, the sample minimum as the first lower class boundary, and the sample maximum as the last upper class boundary?
v. Complete the table below showing the absolute frequency, relative frequency, cumulative frequency, and cumulative relative frequency for the above data. For this question you will need to do some manual data entry in the table skeleton provided below after you have figured out what the counts are based on your answers to the previous set of questions. In particular, you are to use Sturges’ rule (above) to obtain the desired number of classes, and use the range of the data (above) when constructing your class boundaries (note that you need to have a blank line between each new row that you add to the table, and the last class must be closed at the right – this question is worth 8 marks).
Cumulative Cumulative Absolute Relative Absolute Relative
Class Frequency Frequency Frequency Frequency
[…,…) … … … … […,…) … … … … […,…] … … … …
5. Since we use the summation operator (Σni=1) often in class, let’s make sure we understand how to calculate objects that can be expressed succinctly using this operator.
i. Care must be exercised when expanding certain sums and quantities. Let the sample size be n=3, and letX1 =1,X2 =−1, andX3 =3. Demonstrate in R that it is generally not true that ni=1 Xi2 = ( ni=1 Xi)2 (this question is worth 2 marks).
ii. Using the same data as in the previous question, compute the sample mean X ̄ = ni=1 Xk/n then compute the sample standard deviation σˆ = ni=1(Xi − X ̄)2/(n − 1) in two ways: longhand
(you can use R and use longhand notation, e.g., X[1], X[2], and X[3] or 1, -1, and 3, whichever you prefer), then using R functions such as mean() and sd() (this question is worth 2 marks).
iii. Express ni=1 K, where K is a constant (i.e., a number that does not change hence has no subscript i), in terms of n and K only (Hint – a constant does not have a subscript as it does not change with i, but it is being added/summed, so type out a string of n constants etc.). Then for K = 3 and n = 5 determine ni=1 K using your result purely using n and K (i.e., without a summation sign – this question is worth 2 bonus marks, and you do not use R, rather use your powerful sense of logic and type out your answer with an explanation).