1
00:00:00,049 --> 00:00:04,760
Tapply is useful because it splits up a
vector into,
tapply() 很有用 因为它可以把一个向量

2
00:00:04,760 --> 00:00:08,750
into little pieces and it applies a, a
summary statistic or
分解成多个小块 对它们应用一个函数

3
00:00:08,750 --> 00:00:11,650
function to those little pieces, and then
after it applies
或者计算它们的概括统计量

4
00:00:11,650 --> 00:00:14,020
a function it kind of brings the pieces
back together again.
之后再把这些小块重新组合起来

5
00:00:14,020 --> 00:00:17,783
So so split is not a loop function but
it's a very handy
split() 不是一个循环函数 但它使用起来相当方便

6
00:00:17,783 --> 00:00:22,770
function that can be used in conjunction,
with functions like lapply or sapply.
它可以和 lapply() 以及 sapply() 一起使用

7
00:00:22,770 --> 00:00:24,990
And so I just want to mention it here.
我就在这里提一下

8
00:00:24,990 --> 00:00:26,530
So split takes a vector.
split() 的参数是一个向量

9
00:00:26,530 --> 00:00:29,030
So it's kind of like tapply, but it, but
所以它跟 tapply() 函数类似

10
00:00:29,030 --> 00:00:31,790
it's like tapply but without applying the
summary statistics.
只不过不会计算概括统计量

11
00:00:31,790 --> 00:00:33,310
So what it does, is it takes a vector, or
它所做的就是接收一个向量或是对象 x

12
00:00:33,310 --> 00:00:37,490
an object x and it takes a factor
variable, f.
然后接收一个因子变量 f

13
00:00:37,490 --> 00:00:39,450
Which again identifies levels of a group.
这个因子变量被用来指定分组的水平 (level)

14
00:00:40,510 --> 00:00:43,980
And then it splits the object x into the
最后把对象 x 根据 f 进行分组

15
00:00:43,980 --> 00:00:47,000
number of groups that are identified in,
in factor f.
最后把对象 x 根据 f 进行分组

16
00:00:47,000 --> 00:00:50,190
So for example, if f has three levels
identifying three
例如 如果 f 有3个水平

17
00:00:50,190 --> 00:00:55,160
groups, then the split function will split
x, into three groups.
那么 split() 会把 x 分成三个组

18
00:00:55,160 --> 00:00:58,460
And so, and then once you've got those
groups split apart, you can apply,
一旦你把这些组分开了之后

19
00:00:58,460 --> 00:01:03,641
you can use lapply, or sapply to apply a
function to those individual groups.
你就可以对这些独立的组使用 lapply() 或是 sapply() 了

20
00:01:06,040 --> 00:01:10,446
So here is, is a simpler example, similar
to what I had before.
这里有个更简单的例子 和之前那个差不多

21
00:01:10,446 --> 00:01:16,100
With tapply example, I've simulated a
normal 10 normal random variables
在 tapply() 的例子里 我模拟生成了10个均值为0的正态随机变量

22
00:01:16,100 --> 00:01:20,310
with mean zero, 10 uniforms, and 10
normal's with mean one.
10个均匀随机变量 以及10个均值为1的正态随机变量

23
00:01:20,310 --> 00:01:22,550
And has created my factor variable here.
我也设定好了因子变量

24
00:01:22,550 --> 00:01:24,960
And now I'm just going to split the vector
into three parts.
现在我要做的就是把这个向量分解成三个部分

25
00:01:24,960 --> 00:01:27,410
Because because the factor variable has
three levels.
因为因子变量的水平为3

26
00:01:27,410 --> 00:01:31,810
So now you can see when I split the x
vector.
你可以看到我把 x 向量分解后

27
00:01:31,810 --> 00:01:36,440
The first, I got a list back and the first
element is 10 normals, the second element
得到了一个列表 它的第一个元素是10个正态随机变量

28
00:01:36,440 --> 00:01:37,940
is 10 uniforms and the third element,
which
第二个元素是10个均匀随机变量

29
00:01:37,940 --> 00:01:41,360
gets a little cutoff here is 10 normals
again.
第三个元素也是10个正态随机变量 (没有完全显示)

30
00:01:41,360 --> 00:01:42,930
So that's what the split function does.
以上就是 split() 的功能

31
00:01:42,930 --> 00:01:46,130
And now I've got a, so a split always
returns a list back.
它总是会返回一个列表

32
00:01:46,130 --> 00:01:50,422
And so if you want to do something with
this list, you can use lapply or sapply.
如果你想要对这个列表进行操作的话 你可以使用 lapply() 或是 sapply()

33
00:01:50,422 --> 00:01:56,920
So, here for example, it is common to use
来看这个例子 通常我们都会

34
00:01:56,920 --> 00:02:00,040
the lapply function in conjunction with
the split function, so
把 lapply() 和 split() 放在一起用

35
00:02:00,040 --> 00:02:03,222
the idea that you split something that
lapply function over it.
也就是你对一个对象使用 split() 然后再用 lapply()

36
00:02:03,222 --> 00:02:08,130
Now, this case, this use of lapply and
split is not necessary, because
在这个例子里 并不需要这样使用 lapply() 和 split()

37
00:02:08,130 --> 00:02:11,050
you can use the tapply function which will
do the same exact thing.
你可以使用 tapply() 来达到完全一样的效果

38
00:02:12,290 --> 00:02:15,060
It's not anymore efficient or any worse to
do it this
这两种做法区别不大

39
00:02:15,060 --> 00:02:22,130
way but the tapply function is a little
bit more compact.
只是 tapply() 更紧凑一点而已

40
00:02:22,130 --> 00:02:24,490
But the nice thing about the split, using
the split function is
split() 的好处在于

41
00:02:24,490 --> 00:02:28,490
that it can be used to split much more
complicated types of objects.
它可以用来分解类型更加复杂的对象

42
00:02:28,490 --> 00:02:30,210
So for example, here I've got a data frame
for.
例如 我这里有一个数据框

43
00:02:30,210 --> 00:02:32,750
I'm loading the data sets package and I'm,
and I'm
我加载了一个叫 datasets (数据集) 的包

44
00:02:32,750 --> 00:02:35,480
looking at the airquality data frame, from
the data sets package.
然后观察里面的 airquality (空气质量) 数据框

45
00:02:35,480 --> 00:02:38,550
So, you can see that this is the first six
rows of the data, of this...
你可以看到数据的前6行

46
00:02:38,550 --> 00:02:42,540
Data frame I think there's about a hundred
some rows total in this data frame.
我估计这个数据框大概有100行

47
00:02:42,540 --> 00:02:43,930
And you see there are measurements on
可以看到这里面有

48
00:02:43,930 --> 00:02:47,510
ozone, solar radiation, wind, and
temperature, and
Ozone (臭氧)、Solar.R (太阳辐射)、Wind (风力)、Temp (温度) 等的测量值

49
00:02:47,510 --> 00:02:49,620
then the month and the day within that
month.
接着是 Month (月份) 和 Day (日期)

50
00:02:50,750 --> 00:02:53,680
And so, one thing I might want to do is,
is calculate for
我想要计算比方说 臭氧、太阳辐射

51
00:02:53,680 --> 00:02:56,290
example the mean of ozone, solar
radiation,
我想要计算比方说 臭氧、太阳辐射

52
00:02:56,290 --> 00:02:59,830
wind and temperature in, within each
month.
风力和温度在一个月内的平均值

53
00:02:59,830 --> 00:03:03,310
So, so for in each month, there's you
know, 30 some observations.
对于每个月来说 一般都会有30个的观测值

54
00:03:03,310 --> 00:03:06,130
And I want to calculate the mean within
each month.
我想要计算每个月的平均值

55
00:03:06,130 --> 00:03:07,180
All right, so how do I do this?
那么我要怎么做呢？

56
00:03:07,180 --> 00:03:13,250
Well, what I'd like to do is I'd like to
split the data frame into monthly pieces.
我要做的是把这个数据框按月分组

57
00:03:13,250 --> 00:03:18,360
And then once I've split data frame into
separate months, I can just calculate the
一旦我按月分组之后 我就只需要

58
00:03:18,360 --> 00:03:23,768
means, the column means using either apply
or call means, on those other variables.
利用 apply() 或是 colMeans() 来计算不同变量的列均值

59
00:03:23,768 --> 00:03:27,459
[SOUND].
利用 apply() 或是 colMeans() 来计算不同变量的列均值

60
00:03:27,459 --> 00:03:29,080
So that's what I've done here.
这是我的结果

61
00:03:29,080 --> 00:03:32,360
What I've done is I split the airquality
data frame and this,
我所做的就是分解 airquality 数据框

62
00:03:32,360 --> 00:03:35,670
and the factor I'm going to use to split
is the month variable.
用来分解的因子是 Month (月) 这个变量

63
00:03:35,670 --> 00:03:38,460
So the month variable technically
speaking, in the data frame is not
严格来说 这个数据框里的 Month 变量并不是因子变量

64
00:03:38,460 --> 00:03:42,010
a factor variable but it can be converted
into a factor variable,
但是它可以被转换成一个因子变量

65
00:03:42,010 --> 00:03:46,030
because it only takes the values 5, 6, 7,
8 and 9.
因为它只有5、6、7、8、9这几个值

66
00:03:46,030 --> 00:03:48,750
Basically because the measurements are
only taken in
主要是因为这些数据

67
00:03:48,750 --> 00:03:50,770
the, kind of, warmer months of the year.
都是在一年中比较温暖的月份里采集的

68
00:03:50,770 --> 00:03:53,080
So here I've split the airquality variable
according
这里我已经根据 Month 变量

69
00:03:53,080 --> 00:03:56,700
to the month variable, and then I'm
going to apply.
把 airquality 分解了

70
00:03:56,700 --> 00:03:59,770
An anonymous function and the anonymous
function here, what it does is
接着我要使用一个匿名函数 它的作用是

71
00:03:59,770 --> 00:04:03,660
it takes the column means of just the
ozone, solar radiation and wind.
取 Ozone, Solar.R 和 Wind 的列均值

72
00:04:03,660 --> 00:04:05,550
So I'm not going to take the mean of
temperature here.
这里我不取 Temp 的均值

73
00:04:05,550 --> 00:04:08,530
So I'm just going to take the column means
of the,
我只需要这3个变量每个月的列均值

74
00:04:08,530 --> 00:04:12,930
those three variables for each month each
column monthly data frames.
我只需要这3个变量每个月的列均值

75
00:04:12,930 --> 00:04:14,460
So here you can see the results.
你可以看到结果是这样的

76
00:04:14,460 --> 00:04:17,070
You can't see them all but you can see
most of them into
这里没有完全显示出来 但你可以看到

77
00:04:17,070 --> 00:04:20,640
lapply is returning a list back, where
each element of the list is
lapply() 对每个月份都返回了一个列表

78
00:04:20,640 --> 00:04:24,060
a vector of length three which is, which
is the mean for ozone,
列表里的每个元素都是一个长度为3的向量 它们分别是

79
00:04:24,060 --> 00:04:26,935
the mean for solar radiation and the mean
for wind, within that month.
Ozone、Solar.R 以及 Wind 在当月的平均值

80
00:04:26,935 --> 00:04:29,220
As you can see that
你可以看到

81
00:04:29,220 --> 00:04:31,890
for, for most of the months the ozone
value is
大多数月份里 Ozone 的值

82
00:04:31,890 --> 00:04:34,110
NA and that's because when I take the mean
of that
都是 NA 这是因为 当我计算那一列的平均值时

83
00:04:34,110 --> 00:04:37,220
column there are, and there are missing
values in that column
里面有一些缺失值

84
00:04:37,220 --> 00:04:40,210
and I can't take the mean if there are
missing values.
有缺失值就会导致无法计算平均值

85
00:04:40,210 --> 00:04:44,430
So the, the result, when I think the mean
is that I just get a missing value back.
因此结果是我得到了一个缺失值

86
00:04:46,170 --> 00:04:48,210
So one thing I can do is I can.
但在修复这个缺失值问题之前 我也可以先调用 sapply()

87
00:04:48,210 --> 00:04:52,312
So before I fix the missing value problem,
I can also call sapply here.
但在修复这个缺失值问题之前 我也可以先调用 sapply()

88
00:04:52,312 --> 00:04:54,226
And the idea is that sapply, instead of
它的原理是

89
00:04:54,226 --> 00:04:57,880
returning me a list, it will simplify the
result because each element
它会简化返回的结果 而不是返回一个列表

90
00:04:57,880 --> 00:05:00,635
of the returned list has a, has a vector
of length 3.
因为返回的列表里的每个元素都含有一个长度为3的向量

91
00:05:00,635 --> 00:05:01,818
They're all the same length.
它们的长度都是一样的

92
00:05:01,818 --> 00:05:06,490
So what I'll do is put, put all these
numbers into a matrix.
所以我要做的就是把这些数字放进一个矩阵

93
00:05:06,490 --> 00:05:09,030
Where the three rows and in this case 5
columns.
在这里是3行5列

94
00:05:09,030 --> 00:05:11,410
So here you can see the monthly means.
你可以看到3个变量的月平均值

95
00:05:11,410 --> 00:05:13,400
For each of the three variables, in a much
more
格式更加的紧凑

96
00:05:13,400 --> 00:05:16,540
compact format, it's in a matrix, instead
of a list.
这是一个矩阵 而不是列表

97
00:05:16,540 --> 00:05:19,540
Of course I still got NA's for a lot of
them, because the missing values
当然 源数据里缺失值的存在

98
00:05:19,540 --> 00:05:21,456
in the original data.
导致我这里还有很多的 NA

99
00:05:21,456 --> 00:05:24,583
So one thing I knew is I was going to pass
the na.rm argument to call
所以我要给 colMeans() 传递一个 na.rm 参数

100
00:05:24,583 --> 00:05:26,943
means that would remove the missing values
所以我要给 colMeans() 传递一个 na.rm 参数

101
00:05:26,943 --> 00:05:30,022
from each column, before its calculating
the mean.
从而在计算平均值之前移除每列的缺失值

102
00:05:30,022 --> 00:05:34,314
And that, now when I call sapply on the
split list, I can get the, the
接着当我再调用 sapply() 的时候

103
00:05:34,314 --> 00:05:36,756
means of the observed values for each of
我就能得到这5个月中

104
00:05:36,756 --> 00:05:40,600
the three variables for each of the five
months.
3个变量各自的观测值的平均值了

105
00:05:40,600 --> 00:05:44,744
So, so split is a very handy function for
splitting arbitrary
所以 split() 是一个非常方便的函数

106
00:05:44,744 --> 00:05:47,112
objects according to the levels of the
它根据因子的水平来分解任意的对象

107
00:05:47,112 --> 00:05:51,160
factor and then applying any type of
function.
然后再对分解后列表中的元素应用任意类型的函数

108
00:05:51,160 --> 00:05:53,460
To those split elements of that list.
然后再对分解后列表中的元素应用任意类型的函数

109
00:05:53,460 --> 00:05:55,740
And so here I split a data frame, you can
split
这里我分解了一个数据框

110
00:05:55,740 --> 00:05:58,125
other lists, you can, and, or other kinds
of things too.
你也可以分解诸如列表之类的其它对象

111
00:05:58,125 --> 00:06:01,490
[SOUND].
你也可以分解诸如列表之类的其它对象

112
00:06:01,490 --> 00:06:04,340
So the last thing I want to talk about is
splitting on more than one level.
最后我想说一下基于多个水平的分解

113
00:06:04,340 --> 00:06:06,430
So you, in the past couple of examples
在之前的几个例子中

114
00:06:06,430 --> 00:06:09,252
what I've, I've only had a single factor
variable.
我都只有一个因子变量

115
00:06:09,252 --> 00:06:09,796
And I've
并且无论对象是向量还是数据框

116
00:06:09,796 --> 00:06:13,490
split whatever the object is with a vector
or a data frame.
并且无论对象是向量还是数据框

117
00:06:13,490 --> 00:06:15,690
According to the levels of that single
factor.
我都根据这一个因子水平来分解它

118
00:06:15,690 --> 00:06:17,810
But you might have more than one factor.
但是你可能会有多个因子

119
00:06:17,810 --> 00:06:20,104
For example, you might have a variable,
that, you
比如 你可能会有一个变量来表示 gender (性别)

120
00:06:20,104 --> 00:06:22,040
know, it's gender, so it has male and
female.
那么它就会包含 male (男) female (女)

121
00:06:22,040 --> 00:06:23,530
And you might have another variable.
可能还有另一个变量

122
00:06:23,530 --> 00:06:25,100
That has, for example, the race.
比如 race (种族)

123
00:06:25,100 --> 00:06:26,260
And so, you might want to look at
因此 你可能会想要观察

124
00:06:26,260 --> 00:06:29,420
the combination of the levels within those
factors.
这些因子产生的不同水平的组合

125
00:06:29,420 --> 00:06:34,200
And so so here, we've got, I've got f1,
which is a factor with two levels.
这里我有一个 f1 它有2个水平

126
00:06:34,200 --> 00:06:35,112
And so I've simulated
我模拟生成了

127
00:06:35,112 --> 00:06:37,707
a normal random variable with 10, with 10
observations.
一个有10个观测值的正态随机变量

128
00:06:38,810 --> 00:06:41,360
I've got a factor with two levels, and
each repeated
我有一个水平为2的因子 每个水平重复5次

129
00:06:41,360 --> 00:06:44,210
five times, and then I've got another
factor with five levels.
另外还有一个水平为5的因子

130
00:06:44,210 --> 00:06:45,210
If repeated two times.
每个水平重复2次

131
00:06:45,210 --> 00:06:51,290
So there are my kind of two category, two
group, grouping variables here.
这样就得到了两组变量

132
00:06:51,290 --> 00:06:54,340
And I want to look at the kind of
combination of the two.
我想看一看这两个因子的组合

133
00:06:54,340 --> 00:06:58,430
So I can use the interaction function to
combine all the levels
我可以使用 interaction 函数

134
00:06:58,430 --> 00:07:00,970
of the first one with the, all the levels
of the second one.
来把它们所有的水平都组合起来

135
00:07:00,970 --> 00:07:04,430
And so because there are two levels in the
first
因为第一个因子的水平为2

136
00:07:04,430 --> 00:07:06,510
factor and there is five levels in the
second factor
第二个因子的水平为5

137
00:07:06,510 --> 00:07:09,350
and there is a, the total combination of
10 different
所以当你把它们组合起来后

138
00:07:09,350 --> 00:07:12,060
levels that you can have when you combine
the two together.
总共有10个不同水平的组合

139
00:07:12,060 --> 00:07:13,780
So when you see, when I call, when I
called the
因此 当我调用了

140
00:07:13,780 --> 00:07:17,210
interaction function I get another factor,
that kind of concatenates the
interaction() 之后 就得到了另一个因子

141
00:07:17,210 --> 00:07:19,480
levels of one with the other, and you can
see that
它把两个因子的水平数连接起来

142
00:07:19,480 --> 00:07:21,530
it prints out that there is a total of ten
levels.
可以看到 输出后它的水平为10

143
00:07:21,530 --> 00:07:22,030
Okay.
好的

144
00:07:24,440 --> 00:07:31,170
So, what now I can slit my numeric vector
x according to the two different levels.
现在我可以根据这两个不同的水平来分解数值向量 x 了

145
00:07:31,170 --> 00:07:32,950
So now, when I Iike, when I use, now one
thing, when
记着一点

146
00:07:32,950 --> 00:07:36,900
I use the split function I don't have to
use the interaction function.
使用 split 函数时 并不一定要同时使用 interaction 函数

147
00:07:36,900 --> 00:07:40,150
I can just pass it a list with the two
factors and it will
我可以直接传递一个包含两个因子的列表给 split()

148
00:07:40,150 --> 00:07:42,990
automatically call the interaction
function for me,
它会自动调用 interaction()

149
00:07:42,990 --> 00:07:45,710
and create that 10 level interaction
factor.
并生成一个水平为10的因子

150
00:07:45,710 --> 00:07:47,180
So I can just pass the list of these two
所以我可以直接传递包含这两个因子的列表

151
00:07:47,180 --> 00:07:49,890
factors in it, and you can that, it
create, it returns
函数就会给我返回一个

152
00:07:49,890 --> 00:07:51,750
me a list with the levels of
函数就会给我返回一个

153
00:07:51,750 --> 00:07:54,690
the 10 different kind of interaction
factor levels.
有10个不同交互因子水平的列表

154
00:07:54,690 --> 00:07:57,270
And then and then, and then the elements
of
然后这10个水平中的数值因子的元素 (无意义的口误)

155
00:07:57,270 --> 00:08:00,500
the numeric factors that are within those
10 levels.
然后这10个水平中的数值因子的元素 (无意义的口误)

156
00:08:00,500 --> 00:08:02,930
Now of course there are, although there
are 10 levels between
尽管这两个因子可以组成10个不同的水平

157
00:08:02,930 --> 00:08:06,690
the two different factors, that we don't
exactly observe every single combination.
但我们不一定会在每个水平上都进行观测

158
00:08:06,690 --> 00:08:08,470
And so there are some empty levels here
and you
因此在某些水平上会有没有内容

159
00:08:08,470 --> 00:08:11,060
can see that some of these levels have
nothing in them.
你可以看到它们里面什么都没有

160
00:08:11,060 --> 00:08:16,030
They have zero elements, whereas other
levels have a number in it.
它们含有0个元素 其它的水平上则有一个数字

161
00:08:16,030 --> 00:08:17,700
And so, so one thing you can do.
所以你可以做的是

162
00:08:17,700 --> 00:08:21,156
Well first I can, I could take this list
and, and,
首先 如果你想的话 你可以对这个列表

163
00:08:21,156 --> 00:08:23,650
and lapply or sapply a function over it,
if I wanted to.
使用 lapply() 或是 sapply()

164
00:08:23,650 --> 00:08:28,480
But, sometimes it's a little bit handy to
not have to keep these empty levels.
但有时去掉这些空的水平会更好

165
00:08:28,480 --> 00:08:30,600
So, so the split function has an argument
called drop.
而 split() 有一个叫做 drop 的参数

166
00:08:31,700 --> 00:08:34,770
And if you specify drop equals true, it
will drop.
如果你把 drop 设置为 TRUE

167
00:08:34,770 --> 00:08:38,150
The empty levels, that are created by the
splitting.
它就会去掉所有在分解过程中生成的空的水平

168
00:08:38,150 --> 00:08:41,572
And, and this can be very handy, when
you're, you're combining,
当你在对多个不同的因子进行组合的时候

169
00:08:41,572 --> 00:08:43,130
multiple different factors.
这会非常的方便

170
00:08:43,130 --> 00:08:46,570
If you're only using a single factor, then
doesn't, that argument doesn't really
如果你只使用一个因子的话 这个参数就没有什么用处

171
00:08:46,570 --> 00:08:51,110
do anything, because you'll just use all
the, all the levels but, usually.
因为你会把所有的水平都用上

172
00:08:51,110 --> 00:08:53,780
But if you have multiple factors then
typically you're going to have empty
但是如果你有多个因子的话

173
00:08:53,780 --> 00:08:55,090
levels, just because you don't observe
一般来说你会得到一些空的水平

174
00:08:55,090 --> 00:08:57,310
every single combination of the two
factors.
因为你没有观测这两个因子的所有组合

175
00:08:58,320 --> 00:09:01,596
So, so drop equals true will drop those
empty levels and then you can have,
所以 drop 为 TRUE 时 函数会去掉所有空的水平

176
00:09:01,596 --> 00:09:03,728
you'll, you'll will be returned a list,
with
然后你就会得到一个列表

177
00:09:03,728 --> 00:09:06,075
only the levels, that have observations in
them.
它仅包含那些有观测值的水平

178
00:09:06,075 --> 00:09:06,587
[SOUND].
【教育无边界字幕组】三又木君 | hazard1990 | HikaruSama


