-
-
Notifications
You must be signed in to change notification settings - Fork 61
/
Copy path07-indexingvectors.Rmd
486 lines (305 loc) · 18.3 KB
/
07-indexingvectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
---
output:
pdf_document: default
html_document: default
---
# Indexing Vectors with [ ] {#vectorindexing}
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```
```{r, fig.cap= "", echo = FALSE, fig.align='center'}
# knitr::include_graphics(c("images/legoship.jpg"))
```
```{r, echo = FALSE}
boat.df <- data.frame(
boat.names = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"),
boat.colors = c("black", "green", "pink", "blue", "blue",
"green", "green", "yellow", "black", "black"),
boat.ages = c(143, 53, 356, 23, 647, 24, 532, 43, 66, 86),
boat.prices = c(53, 87, 54, 66, 264, 32, 532, 58, 99, 132),
boat.costs = c(52, 80, 20, 100, 189, 12, 520, 68, 80, 100)
)
knitr::kable(boat.df)
```
```{r eval = TRUE}
# Boat sale. Creating the data vectors
boat.names <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
boat.colors <- c("black", "green", "pink", "blue", "blue",
"green", "green", "yellow", "black", "black")
boat.ages <- c(143, 53, 356, 23, 647, 24, 532, 43, 66, 86)
boat.prices <- c(53, 87, 54, 66, 264, 32, 532, 58, 99, 132)
boat.costs <- c(52, 80, 20, 100, 189, 12, 520, 68, 80, 100)
# What was the price of the first boat?
boat.prices[1]
# What were the ages of the first 5 boats?
boat.ages[1:5]
# What were the names of the black boats?
boat.names[boat.colors == "black"]
# What were the prices of either green or yellow boats?
boat.prices[boat.colors == "green" | boat.colors == "yellow"]
# Change the price of boat "s" to 100
boat.prices[boat.names == "s"] <- 100
# What was the median price of black boats less than 100 years old?
median(boat.prices[boat.colors == "black" & boat.ages < 100])
# How many pink boats were there?
sum(boat.colors == "pink")
# What percent of boats were older than 100 years old?
mean(boat.ages > 100)
```
By now you should be a whiz at applying functions like `mean()` and `table()` to vectors. However, in many analyses, you won't want to calculate statistics of an entire vector. Instead, you will want to access specific *subsets* of values of a vector based on some criteria. For example, you may want to access values in a specific location in the vector (i.e.; the first 10 elements) or based on some criteria within that vector (i.e.; all values greater than 0), or based on criterion from values in a *different* vector (e.g.; All values of age where sex is Female). To access specific values of a vector in R, we use *indexing* using brackets `[]`. In general, whatever you put inside the brackets, tells R which values of the vector object you want. There are two main ways that you can use indexing to access subsets of data in a vector: numerical and logical indexing.
##Numerical Indexing
With numerical indexing, you enter a vector of integers corresponding to the values in the vector you want to access in the form `a[index]`, where `a` is the vector, and `index` is a vector of index values. For example, let's use numerical indexing to get values from our boat vectors.
```{r}
# What is the first boat name?
boat.names[1]
# What are the first five boat colors?
boat.colors[1:5]
# What is every second boat age?
boat.ages[seq(1, 5, by = 2)]
```
You can use any indexing vector as long as it contains integers. You can even access the same elements multiple times:
```{r}
# What is the first boat age (3 times)
boat.ages[c(1, 1, 1)]
```
If it makes your code clearer, you can define an indexing object before doing your actual indexing. For example, let's define an object called `my.index` and use this object to index our data vector:
```{r}
my.index <- 3:5
boat.names[my.index]
```
## Logical Indexing
```{r, fig.cap= "Logical indexing. Good for R aliens and R pirates.", fig.margin = TRUE, echo = FALSE, out.width = "50%", fig.align='center'}
knitr::include_graphics(c("images/logic.jpg"))
```
The second way to index vectors is with *logical vectors*. A logical vector is a vector that *only* contains TRUE and FALSE values. In R, true values are designated with TRUE, and false values with FALSE. When you index a vector with a logical vector, R will return values of the vector for which the indexing vector is TRUE. If that was confusing, think about it this way: a logical vector, combined with the brackets `[ ]`, acts as a *filter* for the vector it is indexing. It only lets values of the vector pass through for which the logical vector is TRUE.
```{r, fig.cap= "FALSE values in a logical vector are like lots of mini-Gandolfs. In this example, I am indexing a vector x with a logical vector y (y for example could be x > 0, so all positive values of x are TRUE and all negative values are FALSE). The result is a vector of length 2, which are the values of x for which the logical vector y was true. Gandolf stopped all the values of x for which y was FALSE.", fig.margin = TRUE, echo = FALSE, out.width = "50%", fig.align='center'}
knitr::include_graphics(c("images/indexgandolf.png"))
```
You could create logical vectors directly using `c()`. For example, I could access every other value of the following vector as follows:
```{r}
a <- c(1, 2, 3, 4, 5)
a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
```
As you can see, R returns all values of the vector `a` for which the logical vector is TRUE.
```{r comparison, fig.cap = "Logical comparison operators in R", echo = FALSE}
par(mar = rep(.1, 4))
plot(1, xlim = c(0, 1.1), ylim = c(0, 10),
xlab = "", ylab = "", xaxt = "n", yaxt = "n",
type = "n")
text(rep(0, 9), 9:1,
labels = c("==", "!=", "<", "<=",
">", ">=", "|", "!", "%in%"),
adj = 0, cex = 2)
text(rep(.2, 9), 9:1,
labels = c("equal", "not equal", "less than",
"less than or equal","greater than",
"greater than or equal", "or", "not", "in the set"),
adj = 0, cex = 2)
```
However, creating logical vectors using `c()` is tedious. Instead, it's better to create logical vectors from *existing vectors* using comparison operators like < (less than), == (equals to), and != (not equal to). A complete list of the most common comparison operators is in Figure \@ref(fig:comparison). For example, let's create some logical vectors from our `boat.ages` vector:
```{r}
# Which ages are > 100?
boat.ages > 100
# Which ages are equal to 23?
boat.ages == 23
# Which boat names are equal to c?
boat.names == "c"
```
You can also create logical vectors by comparing a vector to another vector of the same length. When you do this, R will compare values in the same position (e.g.; the first values will be compared, then the second values, etc.). For example, we can compare the `boat.cost` and `boat.price` vectors to see which boats sold for a higher price than their cost:
```{r}
# Which boats had a higher price than cost?
boat.prices > boat.costs
# Which boats had a lower price than cost?
boat.prices < boat.costs
```
Once you've created a logical vector using a comparison operator, you can use it to index any vector with the same length. Here, I'll use logical vectors to get the prices of boats whose ages were greater than 100:
```{r}
# What were the prices of boats older than 100?
boat.prices[boat.ages > 100]
```
Here's how logical indexing works step-by-step:
```{r}
# Which boats are older than 100 years?
boat.ages > 100
# Writing the logical index by hand (you'd never do this!)
# Show me all of the boat prices where the logical vector is TRUE:
boat.prices[c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE)]
# Doing it all in one step! You get the same answer:
boat.prices[boat.ages > 100]
```
### `&` (and), `|` (or), `%in%`
In addition to using single comparison operators, you can combine multiple logical vectors using the OR (which looks like `|` and AND `&` commands. The OR `|` operation will return TRUE if any of the logical vectors is TRUE, while the AND `&` operation will only return TRUE if all of the values in the logical vectors is TRUE. This is especially powerful when you want to create a logical vector based on criteria from multiple vectors.
For example, let's create a logical vector indicating which boats had a price greater than 200 OR less than 100, and then use that vector to see what the names of these boats were:
```{r}
# Which boats had prices greater than 200 OR less than 100?
boat.prices > 200 | boat.prices < 100
# What were the NAMES of these boats
boat.names[boat.prices > 200 | boat.prices < 100]
```
You can combine as many logical vectors as you want (as long as they all have the same length!):
```{r}
# Boat names of boats with a color of black OR with a price > 100
boat.names[boat.colors == "black" | boat.prices > 100]
# Names of blue boats with a price greater than 200
boat.names[boat.colors == "blue" & boat.prices > 200]
```
You can combine as many logical vectors as you want to create increasingly complex selection criteria. For example, the following logical vector returns TRUE for cases where the boat colors are black OR brown, AND where the price was less than 100:
```{r}
# Which boats were eithe black or brown, AND had a price less than 100?
(boat.colors == "black" | boat.colors == "brown") & boat.prices < 100
# What were the names of these boats?
boat.names[(boat.colors == "black" | boat.colors == "brown") & boat.prices < 100]
```
When using multiple criteria, make sure to use parentheses when appropriate. If I didn't use parentheses above, I would get a different answer.
The `%in%` operation helps you to easily create multiple OR arguments.Imagine you have a vector of categorical data that can take on many different values. For example, you could have a vector x indicating people's favorite letters.
```{r}
x <- c("a", "t", "a", "b", "z")
```
Now, let's say you want to create a logical vector indicating which values are either a or b or c or d. You could create this logical vector with multiple | (OR) commands:
```{r}
x == "a" | x == "b" | x == "c" | x == "d"
```
However, this takes a long time to write. Thankfully, the `%in%` operation allows you to combine multiple OR comparisons much faster. To use the `%in%` function, just put it in between the original vector, and a new vector of possible values. The `%in%` function goes through every value in the vector x, and returns TRUE if it finds it in the vector of possible values -- otherwise it returns FALSE.
```{r}
x %in% c("a", "b", "c", "d")
```
As you can see, the result is identical to our previous result.
### Counts and percentages from logical vectors
Many (if not all) R functions will interpret TRUE values as 1 and FALSE values as 0. This allows us to easily answer questions like "How many values in a data vector are greater than 0?" or "What percentage of values are equal to 5?" by applying the `sum()` or `mean()` function to a logical vector.
We'll start with a vector x of length 10, containing 3 positive numbers and 5 negative numbers.
```{r}
x <- c(1, 2, 3, -5, -5, -5, -5, -5)
```
We can create a logical vector to see which values are greater than 0:
```{r}
x > 0
```
Now, we'll use `sum()` and `mean()` on that logical vector to see how many of the values in x are positive, and what percent are positive. We should find that there are 5 TRUE values, and that 50\% of the values (5 / 10) are TRUE.
```{r}
sum(x > 0)
mean(x > 0)
```
This is a *really* powerful tool. Pretty much *any* time you want to answer a question like "How many of X are Y" or "What percent of X are Y", you use `sum()` or `mean()` function with a logical vector as an argument.
### Additional Logical functions
R has lots of special functions that take vectors as arguments, and return logical vectors based on multiple criteria. For example, you can use the `is.na()` function to test which values of a vector are missing. Table \@ref(tab:logicalfunctions) contains some that I frequently use:
| Function| Description|Example | Result
|:--------------|:-----------------|:-----------------------|----|
| `is.na(x)`| Which values in x are NA?|`is.na(c(2, NA, 5))` | `r is.na(c(2, NA, 5))`|
| `is.finite(x)`| Which values in x are numbers? | `is.finite(c(NA, 89, 0))` | `r is.finite(c(NA, 89, 0))`|
| `duplicated(x)`| Which values in x are duplicated? | `duplicated(c(1, 4, 1, 2))` | `r duplicated(c(1, 4, 1, 2))`|
| `which(x)`| Which values in x are TRUE? | `which(c(TRUE, FALSE, TRUE))` | `r which(c(TRUE, FALSE, TRUE))`|
Table: (\#tab:logicalfunctions) Functions to create and use logical vectors.
Logical vectors aren't just good for indexing, you can also use them to figure out which values in a vector satisfy some criteria. To do this, use the function `which()`. If you apply the function `which()` to a logical vector, R will tell you which values of the index are TRUE. For example:
```{r}
# A vector of sex information
sex <- c("m", "m", "f", "m", "f", "f")
# Which values of sex are m?
which(sex == "m")
# Which values of sex are f?
which(sex == "f")
```
## Changing values of a vector
Now that you know how to index a vector, you can easily change specific values in a vector using the assignment (`<-`) operation. To do this, just assign a vector of new values to the indexed values of the original vector:
Let's create a vector `a` which contains 10 1s:
```{r}
a <- rep(1, 10)
```
Now, let's change the first 5 values in the vector to 9s by indexing the first five values, and assigning the value of 9:
```{r}
a[1:5] <- 9
a
```
Now let's change the last 5 values to 0s. We'll index the values 6 through 10, and assign a value of 0.
```{r}
a[6:10] <- 0
a
```
Of course, you can also change values of a vector using a logical indexing vector. For example, let's say you have a vector of numbers that should be from 1 to 10. If values are outside of this range, you want to set them to either the minimum (1) or maximum (10) value:
```{r}
# x is a vector of numbers that should be from 1 to 10
x <- c(5, -5, 7, 4, 11, 5, -2)
# Assign values less than 1 to 1
x[x < 1] <- 1
# Assign values greater than 10 to 10
x[x > 10] <- 10
# Print the result!
x
```
As you can see, our new values of x are now never less than 1 or greater than 10!
**A note on indexing...**
Technically, when you assign new values to a vector, you should always assign a vector of the same length as the number of values that you are updating. For example, given a vector a with 10 1s:
```{r}
a <- rep(1, 10)
```
To update the first 5 values with 5 9s, we should assign a new vector of 5 9s
```{r}
a[1:5] <- c(9, 9, 9, 9, 9)
a
```
However, if we repeat this code but just assign a single 9, R will repeat the value as many times as necessary to fill the indexed value of the vector. That's why the following code still works:
```{r}
a[1:5] <- 9
a
```
In other languages this code wouldn't work because we're trying to replace 5 values with just 1. However, this is a case where R bends the rules a bit.
### Ex: Fixing invalid responses to a Happiness survey
```{r, fig.cap= "", fig.margin = TRUE, echo = FALSE, out.width = "50%", fig.align='center'}
knitr::include_graphics(c("images/happiness.png"))
```
Assigning and indexing is a particularly helpful tool when, for example, you want to remove invalid values in a vector before performing an analysis. For example, let's say you asked 10 people how happy they were on a scale of 1 to 5 and received the following responses:
```{r}
happy <- c(1, 4, 2, 999, 2, 3, -2, 3, 2, 999)
```
As you can see, we have some invalid values (999 and -2) in this vector. To remove them, we'll use logical indexing to change the invalid values (999 and -2) to NA. We'll create a logical vector indicating which values of `happy` are *invalid* using the `%in%` operation. Because we want to see which values are *invalid*, we'll add the `== FALSE` condition (If we don't, the index will tell us which values *are* valid).
```{r}
# Which values of happy are NOT in the set 1:5?
invalid <- (happy %in% 1:5) == FALSE
invalid
```
Now that we have a logical index `invalid` telling us which values are invalid (that is, not in the set 1 through 5), we'll index `happy` with `invalid`, and assign the invalid values as NA:
```{r}
# Convert any invalid values in happy to NA
happy[invalid] <- NA
happy
```
We can also recode all the invalid values of `happy` in one line as follows:
```{r}
# Convert all values of happy that are NOT integers from 1 to 5 to NA
happy[(happy %in% 1:5) == FALSE] <- NA
```
As you can see, `happy` now has NAs for previously invalid values. Now we can take a `mean()` of the vector and see the mean of the valid responses.
```{r}
# Include na.rm = TRUE to ignore NA values
mean(happy, na.rm = TRUE)
```
## Test your R Might!: Movie data
```{r, fig.cap= "", fig.margin = TRUE, echo = FALSE, out.width = "100%", fig.align='center'}
knitr::include_graphics(c("images/moviecollage.png"))
```
Table \@ref(tab:moviedata) contains data about 10 of my favorite movies.
```{r moviedata, echo = FALSE}
movie.data <- data.frame("movie" = c("Whatever Works", "It Follows", "Love and Mercy",
"The Goonies", "Jiro Dreams of Sushi",
"There Will be Blood", "Moon",
"Spice World", "Serenity", "Finding Vivian Maier"),
year = c(2009, 2015, 2015, 1985, 2012, 2007, 2009, 1988, 2005, 2014),
boxoffice = c(35, 15, 15, 62, 3, 10, 321, 79, 39, 1.5),
genre = c("Comedy", "Horror", "Drama", "Adventure", "Documentary",
"Drama", "Science Fiction", "Comedy", "Science Fiction",
"Documentary"),
time = c(92, 97, 120, 90, 81, 158, 97, -84, 119, 84),
rating = c("PG-13", "R", "R", "PG", "G", "R", "R",
"PG-13", "PG-13", "Unrated"))
knitr::kable(movie.data, caption = "Some of my favorite movies")
```
0. Create new data vectors for each column.
1. What is the name of the 10th movie in the list?
2. What are the genres of the first 4 movies?
3. Some joker put Spice World in the movie names -- it should be ``The Naked Gun'' Please correct the name.
4. What were the names of the movies made before 1990?
5. How many movies were Dramas? What percent of the 10 movies were Dramas?
6. One of the values in the `time` vector is invalid. Convert any invalid values in this vector to NA. Then, calculate the mean movie time
7. What were the names of the Comedy movies? What were their boxoffice totals? (Two separate questions)
8. What were the names of the movies that made less than \$50 Million dollars AND were Comedies?
9. What was the median boxoffice revenue of movies rated either G or PG?
10. What percent of the movies were rated R OR were comedies?