dplyr
首先安裝使用dplyr
及hflights
二套組包,前者主為相關函式,後者則為數據集(原始數據來自美國運輸部運輸統計局:http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120)。
install.packages(c("dplyr", "hflights"))
library("dplyr"); library("hflights")
glimpse()(一瞥)
接著將數據集讀入為數據框,指派名稱為 hf,並用glimpse()
一瞥該物件,以概觀其結構與內容,結果顯示其具有21個變項(直欄)、227,496個觀測例(橫排)。
> hf <- data.frame(hflights)
> glimpse(hf)
Observations: 227,496
Variables: 21
$ Year (int) 2011, 2011, 2011, 2011, 2011, 2011, ...
$ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ DayofMonth (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1...
$ DayOfWeek (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, ...
$ DepTime (int) 1400, 1401, 1352, 1403, 1405, 1359, ...
$ ArrTime (int) 1500, 1501, 1502, 1513, 1507, 1503, ...
$ UniqueCarrier (chr) "AA", "AA", "AA", "AA", "AA", "AA", ...
$ FlightNum (int) 428, 428, 428, 428, 428, 428, 428, 4...
$ TailNum (chr) "N576AA", "N557AA", "N541AA", "N403A...
$ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, ...
$ AirTime (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, ...
$ ArrDelay (int) -10, -9, -8, 3, -3, -7, -1, -16, 44,...
$ DepDelay (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, ...
$ Origin (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "...
$ Dest (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "...
$ Distance (int) 224, 224, 224, 224, 224, 224, 224, 2...
$ TaxiIn (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4,...
$ TaxiOut (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19...
$ Cancelled (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ CancellationCode (chr) "", "", "", "", "", "", "", "", "", ...
$ Diverted (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
其中各變項所代表的意義為:
變項 | 說明 |
---|---|
Year, Month, DayofMonth | date of departure |
DayOfWeek | day of week of departure (useful for removing weekend effects) |
DepTime, ArrTime | departure and arrival times (in local time, hhmm) |
UniqueCarrier | unique abbreviation for a carrier |
FlightNum | flight number |
TailNum | airplane tail number |
ActualElapsedTime | elapsed time of flight, in minutes |
AirTime | flight time, in minutes |
ArrDelay, DepDelay | arrival and departure delays, in minutes |
Origin, Dest | origin and destination airport codes |
Distance | distance of flight, in miles |
TaxiIn, TaxiOut | taxi in and out times in minutes |
Cancelled | cancelled indicator: 1 = Yes, 0 = No |
CancellationCode | reason for cancellation: A = carrier, B = weather, C = national air system, D = security |
Diverted | diverted indicator: 1 = Yes, 0 = No |
%>%(管道)
函式f(x, y)
利用二元運算子%>%
作為「管道」(pipe)可改寫為x %>% f(y)
,便於鏈式運算(Chaining),例如third(second(first(x, a), b), c)
即為x %>% first(a) %>% second(b) %>% third(c)
。以下用法所得結果同前例。
> hflights %>% data.frame() %>% glimpse() # 等同 glimpse(data.frame(hflights))。
Observations: 227,496
Variables: 21
$ Year (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
$ Month (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ DayofMonth (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
$ DayOfWeek (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, ...
…(中略)…
$ CancellationCode (chr) "", "", "", "", "", "", "", "", "", "", "", "", "", ""...
$ Diverted (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
distinct()(相異)
此函式的作用在於將數據框中的相異橫排給保留下來(即移除重覆的橫排)。以下將 hf 中的重覆橫排給移除掉,並顯示處理後之獨一(不重覆)的橫排數目。
> hf %>% select(UniqueCarrier:TailNum) %>% distinct() %>% head(5)
UniqueCarrier FlightNum TailNum
1 AA 428 N576AA
2 AA 428 N557AA
3 AA 428 N541AA
4 AA 428 N403AA
5 AA 428 N492AA
arrange()(排序)
以下利用arrange()
為數據框 hf 排序,「年、月、日」三直欄遞增而「出發時間、到達時間」二直欄遞減,並顯示該結果頭端 5 個橫排的數據。
> hf %>% arrange(Year, Month, DayofMonth, desc(DepTime, ArrTime)) %>% head(5)
Year Month DayofMonth DayOfWeek DepTime ArrTime
1 2011 1 1 6 2250 53
2 2011 1 1 6 2223 2329
3 2011 1 1 6 2149 2323
4 2011 1 1 6 2142 2222
5 2011 1 1 6 2139 2303
UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime
1 CO 1644 N37409 243 227
2 XE 2515 N12921 66 48
3 CO 1597 N16217 214 196
4 XE 3033 N11536 40 24
5 XE 2192 N16963 84 67
ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1 109 120 IAH SMF 1609 5 11
2 64 63 IAH HRL 295 7 11
3 38 39 IAH ONT 1334 6 12
4 7 7 IAH LCH 127 3 13
5 25 24 IAH MAF 429 4 13
Cancelled CancellationCode Diverted
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
slice()(選取橫排)
依據橫排的索引位置為數據框「切片」。以下選取第 2 至第 4 橫排,共 3 個橫排。
> hf %>% slice(2:4)
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011 1 2 7 1401 1501 AA 428 N557AA
2 2011 1 3 1 1352 1502 AA 428 N541AA
3 2011 1 4 2 1403 1513 AA 428 N403AA
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1 60 45 -9 1 IAH DFW 224 6 9
2 70 48 -8 -8 IAH DFW 224 5 17
3 70 39 3 3 IAH DFW 224 9 22
Cancelled CancellationCode Diverted
1 0 0
2 0 0
3 0 0
以下選取倒數第 2 及最末橫排,共 2 個橫排。
> hf %>% slice((n()-1):n())
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011 12 6 2 656 812 WN 621 N727SW
2 2011 12 6 2 1600 1713 WN 1597 N745SW
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1 76 64 -13 -4 HOU TUL 453 3 9
2 73 59 -12 0 HOU TUL 453 3 11
Cancelled CancellationCode Diverted
1 0 0
2 0 0
filter()(選取橫排)
由於在關係型數據庫中並無橫排順序的固有概念,在利用 dplyr 作為此類數據庫的前端時,無法直接使用索引位置,在前述slice()
函式中的第一個例子,要利用filter()
改寫如下:
> hf %>% filter(between(row_number(), 2, 4))
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011 1 2 7 1401 1501 AA 428 N557AA
2 2011 1 3 1 1352 1502 AA 428 N541AA
3 2011 1 4 2 1403 1513 AA 428 N403AA
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1 60 45 -9 1 IAH DFW 224 6 9
2 70 48 -8 -8 IAH DFW 224 5 17
3 70 39 3 3 IAH DFW 224 9 22
Cancelled CancellationCode Diverted
1 0 0
2 0 0
3 0 0
在前述slice()
函式中的第二個例子,要利用filter()
改寫如下:
> hf %>% filter(between(row_number(), n()-1, n()))
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011 12 6 2 656 812 WN 621 N727SW
2 2011 12 6 2 1600 1713 WN 1597 N745SW
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1 76 64 -13 -4 HOU TUL 453 3 9
2 73 59 -12 0 HOU TUL 453 3 11
Cancelled CancellationCode Diverted
1 0 0
2 0 0
已知索引位置時使用slice()
來切片,但若要依據特定直欄是否符合某些條件來篩選橫排,則須使用filter()
。以下選取出發延遲超過 30 分鐘「或」到達地點為洛杉磯的橫排,並顯示該結果尾端 4 個橫排的數據。
> filter(hf, DepDelay > 30 | Dest == "LAX") %>% tail(4)
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
28026 2011 12 6 2 1352 1749 WN 3085 N510SW
28027 2011 12 6 2 1850 2046 WN 39 N754SW
28028 2011 12 6 2 1723 1845 WN 33 N698SW
28029 2011 12 6 2 2023 2109 WN 207 N354SW
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
28026 177 163 59 72 HOU PHL 1336 5 9
28027 176 157 71 70 HOU PHX 1020 4 15
28028 202 192 70 78 HOU SAN 1313 3 7
28029 46 38 29 43 HOU SAT 192 4 4
Cancelled CancellationCode Diverted
28026 0 0
28027 0 0
28028 0 0
28029 0 0
以下選取出發延遲超過 30 分鐘「且」到達地點為洛杉磯的橫排,並顯示該結果尾端 3 個橫排的數據。
> filter(hf, DepDelay > 30 & Dest == "LAX") %>% tail(3)
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
692 2011 12 20 2 1423 1554 WN 706 N260WN
693 2011 12 21 3 1424 1557 WN 706 N433LV
694 2011 12 23 5 2207 2334 WN 1053 N290WN
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
692 211 194 14 33 HOU LAX 1390 9 8
693 213 194 17 34 HOU LAX 1390 12 7
694 207 194 14 37 HOU LAX 1390 6 7
Cancelled CancellationCode Diverted
692 0 0
693 0 0
694 0 0
多個條件並列為引數時,其意義與「且」相同。
> hf %>% filter(DepDelay > 30 , Dest == "LAX") %>% tail(2) # 選取條件同上,惟僅顯示二個橫排。
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
692 2011 12 20 2 1423 1554 WN 706 N260WN
693 2011 12 21 3 1424 1557 WN 706 N433LV
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
692 211 194 14 33 HOU LAX 1390 9 8
693 213 194 17 34 HOU LAX 1390 12 7
Cancelled CancellationCode Diverted
692 0 0
693 0 0
select()(選取直欄)
選取全部的直欄,顯示其頭端 2 個橫排。
> hf %>% select(everything()) %>% head(2)
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
5424 2011 1 1 6 1400 1500 AA 428 N576AA
5425 2011 1 2 7 1401 1501 AA 428 N557AA
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
5424 60 40 -10 0 IAH DFW 224 7 13
5425 60 45 -9 1 IAH DFW 224 6 9
Cancelled CancellationCode Diverted
5424 0 0
5425 0 0
下例依序選取 Origin, Dest, TaxiIn, TaxiOut 等 4 個直欄,以及其他剩下的全部直欄,等同調整直欄順序,顯示其頭端 2 個橫排。
> hf %>% select(Origin, Dest, TaxiIn, TaxiOut, everything()) %>% head(2)
Origin Dest TaxiIn TaxiOut Year Month DayofMonth DayOfWeek DepTime ArrTime
5424 IAH DFW 7 13 2011 1 1 6 1400 1500
5425 IAH DFW 6 9 2011 1 2 7 1401 1501
UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay
5424 AA 428 N576AA 60 40 -10 0
5425 AA 428 N557AA 60 45 -9 1
Distance Cancelled CancellationCode Diverted
5424 224 0 0
5425 224 0 0
選取依序自 Year 至 DayOfWeek 這 3 個直欄及 TailNum、Distance 等 2 個直欄,共 5 個直欄,顯示其頭端 3 個橫排。
> hf %>% select(Year:DayOfWeek, TailNum, Distance) %>% head(3)
Year Month DayofMonth DayOfWeek TailNum Distance
5424 2011 1 1 6 N576AA 224
5425 2011 1 2 7 N557AA 224
5426 2011 1 3 1 N541AA 224
文字向量內之元素若為待選取的直欄名稱,則可直接利用,下例所選直欄與上例相同。
> cols <- c("Year", "Month", "DayofMonth", "DayOfWeek", "TailNum", "Distance")
> hf %>% select(one_of(cols)) %>% head(3) # 結果同上。
Year Month DayofMonth DayOfWeek TailNum Distance
5424 2011 1 1 6 N576AA 224
5425 2011 1 2 7 N557AA 224
5426 2011 1 3 1 N541AA 224
文字向量內之元素若為待排除的直欄名稱,亦可直接利用,下例所選直欄與上例相反。
> hf %>% select(-one_of(cols)) %>% head(3)
DepTime ArrTime UniqueCarrier FlightNum ActualElapsedTime AirTime ArrDelay
5424 1400 1500 AA 428 60 40 -10
5425 1401 1501 AA 428 60 45 -9
5426 1352 1502 AA 428 70 48 -8
DepDelay Origin Dest TaxiIn TaxiOut Cancelled CancellationCode Diverted
5424 0 IAH DFW 7 13 0 0
5425 1 IAH DFW 6 9 0 0
5426 -8 IAH DFW 5 17 0 0
選取直欄名稱中含「arr」者(忽略大小寫),顯示其尾端 3 個橫排。
> hf %>% select(contains("arr", ignore.case = TRUE)) %>% tail(3)
ArrTime UniqueCarrier ArrDelay
6083257 1031 WN -4
6083258 812 WN -13
6083259 1713 WN -12
排除直欄名稱中含「arr」者(忽略大小寫),顯示其尾端 2 個橫排。
> hf %>% select(-contains("arr", ignore.case = TRUE)) %>% tail(2)
Year Month DayofMonth DayOfWeek DepTime FlightNum TailNum ActualElapsedTime
6083258 2011 12 6 2 656 621 N727SW 76
6083259 2011 12 6 2 1600 1597 N745SW 73
AirTime DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode
6083258 64 -4 HOU TUL 453 3 9 0
6083259 59 0 HOU TUL 453 3 11 0
Diverted
6083258 0
6083259 0
選取直欄名稱以「Day」為開頭者(區別大小寫),顯示其頭端 3 個橫排。
> hf %>% select(starts_with("Day", ignore.case = FALSE)) %>% head(3)
DayofMonth DayOfWeek
5424 1 6
5425 2 7
5426 3 1
排除直欄名稱以「Day」為開頭者(區別大小寫),顯示其頭端 2 個橫排。
> hf %>% select(-starts_with("Day", ignore.case = FALSE)) %>% head(2)
Year Month DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
5424 2011 1 1400 1500 AA 428 N576AA 60
5425 2011 1 1401 1501 AA 428 N557AA 60
AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
5424 40 -10 0 IAH DFW 224 7 13 0
5425 45 -9 1 IAH DFW 224 6 9 0
CancellationCode Diverted
5424 0
5425 0
選取直欄名稱以「Time」為結尾者(忽略大小寫),顯示其尾端 3 個橫排。
> hf %>% select(ends_with("Time", ignore.case = TRUE)) %>% tail(3)
DepTime ArrTime ActualElapsedTime AirTime
6083257 912 1031 79 61
6083258 656 812 76 64
6083259 1600 1713 73 59
排除直欄名稱以「Time」為結尾者(忽略大小寫),顯示其尾端 2 個橫排。
> hf %>% select(-ends_with("Time", ignore.case = TRUE)) %>% tail(2)
Year Month DayofMonth DayOfWeek UniqueCarrier FlightNum TailNum ArrDelay
6083258 2011 12 6 2 WN 621 N727SW -13
6083259 2011 12 6 2 WN 1597 N745SW -12
DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode
6083258 -4 HOU TUL 453 3 9 0
6083259 0 HOU TUL 453 3 11 0
Diverted
6083258 0
6083259 0
選取直欄名稱符合「.m.」正則表示式(Regular Expression)者,顯示其頭端 3 個橫排。
> hf %>% select(matches(".m.")) %>% head(3)
DayofMonth DepTime ArrTime ActualElapsedTime AirTime
5424 1 1400 1500 60 40
5425 2 1401 1501 60 45
5426 3 1352 1502 70 48
排除直欄名稱符合「.m.」正則表示式(Regular Expression)者,顯示其頭端 2 個橫排。
> hf %>% select(-matches(".m.")) %>% head(2)
Year Month DayOfWeek UniqueCarrier FlightNum TailNum ArrDelay DepDelay Origin
5424 2011 1 6 AA 428 N576AA -10 0 IAH
5425 2011 1 7 AA 428 N557AA -9 1 IAH
Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
5424 DFW 224 7 13 0 0
5425 DFW 224 6 9 0 0
選取直欄時能同步更名,新名置等號左側、舊名在等號右側,下例顯示頭端 2 個橫排。
> hf %>% select(出發時間 = DepTime, 到達時間 = ArrTime) %>% head(2)
出發時間 到達時間
5424 1400 1500
5425 1401 1501
rename()(更名)
直欄更名可用rename()
函式,其更名用法與前例select()
不同之處在於,前者保留全部直欄,後者只留所選直欄。
> hf %>% rename(出發時間 = DepTime, 到達時間 = ArrTime) %>% head(2)
Year Month DayofMonth DayOfWeek 出發時間 到達時間 UniqueCarrier FlightNum TailNum
5424 2011 1 1 6 1400 1500 AA 428 N576AA
5425 2011 1 2 7 1401 1501 AA 428 N557AA
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
5424 60 40 -10 0 IAH DFW 224 7 13
5425 60 45 -9 1 IAH DFW 224 6 9
Cancelled CancellationCode Diverted
5424 0 0
5425 0 0
mutate()(新增直欄)
利用mutate()
可在既有數據框的最右側,增加新的直欄。
> hf %>% mutate(機上地面時間 = ActualElapsedTime - AirTime, 延遲時間合計 = ArrDelay + DepDelay) %>% head(3)
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011 1 1 6 1400 1500 AA 428 N576AA
2 2011 1 2 7 1401 1501 AA 428 N557AA
3 2011 1 3 1 1352 1502 AA 428 N541AA
ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1 60 40 -10 0 IAH DFW 224 7 13
2 60 45 -9 1 IAH DFW 224 6 9
3 70 48 -8 -8 IAH DFW 224 5 17
Cancelled CancellationCode Diverted 機上地面時間 延遲時間合計
1 0 0 20 -10
2 0 0 15 -8
3 0 0 22 -16
transmute()(新增直欄)
新增直欄亦可利用transmute()
,其與前例mutate()
不同之處在於,前者保留全部原有直欄,後者只留所新增的直欄。
> hf %>% transmute(機上地面時間 = ActualElapsedTime - AirTime,
+ 延遲時間合計 = ArrDelay + DepDelay) %>% head(3)
機上地面時間 延遲時間合計
1 20 -10
2 15 -8
3 22 -16
summarise(), summarize()(總結)
使用summarise()
或summarize()
可將多個值總結為單一值,以下求取「機上地面時間」的平均值與標準差、最大飛行距離,以及橫排個數。
> hf %>% summarise(AET_mean = mean(ActualElapsedTime, na.rm = TRUE),
+ AET_sd = sd(ActualElapsedTime, na.rm = TRUE),
+ max(Distance), n())
AET_mean AET_sd max(Distance) n()
1 129.3237 59.28584 3904 227496
group_by()(分組)
利用group_by()
可將數據框分為數個組別,通常搭配summarise()
使用,能得出各個組別的總結數值。以下依據「出發城市」分組,再求取其每一組的「機上地面時間」的平均值與標準差、最大飛行距離,以及橫排個數。
hf %>% group_by(Origin) %>%
+ summarise(AET_mean = mean(ActualElapsedTime, na.rm = TRUE),
+ AET_sd = sd(ActualElapsedTime, na.rm = TRUE),
+ max(Distance), n())
Source: local data frame [2 x 5]
Origin AET_mean AET_sd max(Distance) n()
(chr) (dbl) (dbl) (int) (int)
1 HOU 101.4465 51.31246 1642 52299
2 IAH 137.6125 58.96823 3904 175197
count()(計數)
利用count()
可計算被選取物件的個數,以下計算「出發城市」與「到達城市」的各個組合狀態的個數,並依個數多寡排列後,顯示其數值最高的 5 個組合。
> hf %>% count(Origin, Dest, sort = TRUE) %>% head(5)
Source: local data frame [5 x 3]
Groups: Origin [1]
Origin Dest n
(chr) (chr) (int)
1 HOU DAL 8243
2 HOU MSY 3362
3 HOU ATL 2889
4 HOU DFW 2424
5 HOU HRL 2309
sample_n(), sample_frac()(抽樣)
利用sample_n()
與sample_frac()
可對數據框抽取一定橫排的樣本,前者指定抽樣個數,後者指定抽樣比例。以下用sample_n()
抽取 3 個樣本,容許重覆抽取,並以月份為權重。
> set.seed(1)
> hf %>% sample_n(3, replace = TRUE, weight = Month)
Year Month DayofMonth DayOfWeek DepTime ArrTime
1568379 2011 4 6 3 1006 1333
2199156 2011 5 30 1 2118 2204
3221709 2011 7 22 5 559 957
UniqueCarrier FlightNum TailNum ActualElapsedTime
1568379 CO 1544 N78511 147
2199156 XE 2261 N14904 46
3221709 WN 525 N493WN 178
AirTime ArrDelay DepDelay Origin Dest Distance
1568379 128 -20 -4 IAH CLE 1091
2199156 33 -6 -2 IAH AEX 190
3221709 158 -3 -1 HOU BWI 1246
TaxiIn TaxiOut Cancelled CancellationCode Diverted
1568379 6 13 0 0
2199156 2 11 0 0
3221709 11 9 0 0
以下用sample_frac()
抽取 1.3187e-05 比例的樣本數,由於以同以引數 1 設定了set.seed()
,所得結果與上例相同。
> set.seed(1)
> hf %>% sample_frac(1.3187e-05, replace = TRUE, weight = Month)
如分組後再抽樣,則抽取個數或抽取比例會適用到各個組別。以下選取 TailNum 至 Distance 共 8 個直欄,依據出發城市分組後,各組抽取 3 個橫排,以距離為權重、容許重覆抽取,結果如下:
> set.seed(1)
> hf %>% select(TailNum:Distance) %>%
+ group_by(Origin) %>%
+ sample_n(3, replace = TRUE, weight = Distance)
Source: local data frame [6 x 8]
Groups: Origin [2]
TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
(chr) (int) (int) (int) (int) (chr)
1 N771SA 223 210 -22 5 HOU
2 N396SW 116 103 4 3 HOU
3 N391SW 177 150 48 6 HOU
4 N754SK 134 107 -1 2 IAH
5 N719SK 161 129 -9 -4 IAH
6 N14947 96 73 -4 -5 IAH
Variables not shown: Dest (chr), Distance (int)