Hints for EDA of the Old Faithful Geyser Data

Some Hints if you’re Stuck

Here are some initial questions you might attempt to answer:

How many data points exist in this data set? What does the data represent?
What type of data would you classify this as? (I.e. numerical vs. categorical? Discrete vs. continuous?)
Calculate the mean and median of the faithful$eruptions data.
Create a line graph of the faithful$eruptions data. What trend or patterns do you see?
Use the table() function to find counts of different values of eruption data. Does this work and is the result useful, why or why not?
Create a histogram of the faithful$eruptions data. What is a good value to use for the breaks parameter? And can you set the axis labels to meaningful values? How would you describe the distribution. Based on this plot, what is your estimate of the mode of the data?

Now, using your results from both the calculations in (3) and the histogram in (6), how would you describe the average eruption length to someone? What is the value and why did you choose that?

Inspecting the Data

How many rows and how many columns? You can probably already see this above, but if you needed to calculate it, we could write

dim(faithful)

## [1] 272   2

and similarly, to see the column names, we could write:

names(faithful)

## [1] "eruptions" "waiting"

Plots of the Data

Here’s what you most likely found:

First the point plot:

plot(faithful$waiting, type="p", ylab="Waiting time (min)")

Then a line plot, with the mean added in red:

plot(faithful$waiting, type="l", ylab="Waiting time (min)")
abline(h=mean(faithful$waiting), col="red")

It’s almost hard to believe those are the same data, right? Just by connecting the dots, a much stronger pattern emerges. There seems to be some regularity to this. I don’t know what it is yet.

What patterns do people see in this data?

plot(table(faithful$waiting), xlab="Waiting Time (min)", ylab="Count")
 abline(v=mean(faithful$waiting), col="red")
 abline(v=median(faithful$waiting), col="green")

Ok, what does this show? If you’re confused about this, think about (look at?!) what the table() function outputs and recognize that’s what we’re plotting here. What did table() do on the WDL data?

table(faithful$waiting)

## 
## 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70 
##  1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4 
## 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 
##  5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1

table(faithful$waiting)[table(faithful$waiting)==max(table(faithful$waiting))]

## 78 
## 15

Calculate the Descriptive Statistics

For both the faithful$eruptions and faithful$waiting data.

Can you calculate the mean, median and mode?
- Hint: use sort() and or table() or any other information above to calculate the mode?
Can you calculate the range, standard deviation and IQR

Questions for Discussion

Why are the mean and median different for this data set? See above plot with mean and median added. Also: are the mean and median always different? What is the mode here? How would we calculate it?
What’s the best estimate of how long someone would have to wait for the next eruption? This really comes down to what do we mean by the “mean” (ha!)….

Does there seem to be a pattern in the data as shown by the point or line plot? What is the pattern? We talked about this above already, and hopefully its obvious that the line plot is superior. It won’t always be. Why is it ‘ok’ to add lines in this case?
Which plot is most useful for understanding the data? I think the answer is it depends.

Scatter Plot of Eruption length against Wait Time

plot(faithful$waiting, faithful$eruptions, xlab="Waiting time (min)", ylab="Eruption length (min)", ylim=c(0, 6), main="Eruptions at Old Faithful", pch=1, col=2, type="p")