dplyr

首先安裝使用dplyrhflights二套組包,前者主為相關函式,後者則為數據集(原始數據來自美國運輸部運輸統計局:http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120)。

install.packages(c("dplyr", "hflights"))
library("dplyr"); library("hflights")

glimpse()(一瞥)

接著將數據集讀入為數據框,指派名稱為 hf,並用glimpse()一瞥該物件,以概觀其結構與內容,結果顯示其具有21個變項(直欄)、227,496個觀測例(橫排)。

> hf <- data.frame(hflights)
> glimpse(hf)
Observations: 227,496
Variables: 21
$ Year              (int) 2011, 2011, 2011, 2011, 2011, 2011, ...
$ Month             (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ DayofMonth        (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1...
$ DayOfWeek         (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, ...
$ DepTime           (int) 1400, 1401, 1352, 1403, 1405, 1359, ...
$ ArrTime           (int) 1500, 1501, 1502, 1513, 1507, 1503, ...
$ UniqueCarrier     (chr) "AA", "AA", "AA", "AA", "AA", "AA", ...
$ FlightNum         (int) 428, 428, 428, 428, 428, 428, 428, 4...
$ TailNum           (chr) "N576AA", "N557AA", "N541AA", "N403A...
$ ActualElapsedTime (int) 60, 60, 70, 70, 62, 64, 70, 59, 71, ...
$ AirTime           (int) 40, 45, 48, 39, 44, 45, 43, 40, 41, ...
$ ArrDelay          (int) -10, -9, -8, 3, -3, -7, -1, -16, 44,...
$ DepDelay          (int) 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, ...
$ Origin            (chr) "IAH", "IAH", "IAH", "IAH", "IAH", "...
$ Dest              (chr) "DFW", "DFW", "DFW", "DFW", "DFW", "...
$ Distance          (int) 224, 224, 224, 224, 224, 224, 224, 2...
$ TaxiIn            (int) 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4,...
$ TaxiOut           (int) 13, 9, 17, 22, 9, 13, 15, 12, 22, 19...
$ Cancelled         (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ CancellationCode  (chr) "", "", "", "", "", "", "", "", "", ...
$ Diverted          (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

其中各變項所代表的意義為:

變項 說明
Year, Month, DayofMonth date of departure
DayOfWeek day of week of departure (useful for removing weekend effects)
DepTime, ArrTime departure and arrival times (in local time, hhmm)
UniqueCarrier unique abbreviation for a carrier
FlightNum flight number
TailNum airplane tail number
ActualElapsedTime elapsed time of flight, in minutes
AirTime flight time, in minutes
ArrDelay, DepDelay arrival and departure delays, in minutes
Origin, Dest origin and destination airport codes
Distance distance of flight, in miles
TaxiIn, TaxiOut taxi in and out times in minutes
Cancelled cancelled indicator: 1 = Yes, 0 = No
CancellationCode reason for cancellation: A = carrier, B = weather, C = national air system, D = security
Diverted diverted indicator: 1 = Yes, 0 = No

%>%(管道)

函式f(x, y)利用二元運算子%>%作為「管道」(pipe)可改寫為x %>% f(y),便於鏈式運算(Chaining),例如third(second(first(x, a), b), c)即為x %>% first(a) %>% second(b) %>% third(c)。以下用法所得結果同前例。

> hflights %>% data.frame() %>% glimpse() # 等同 glimpse(data.frame(hflights))。
Observations: 227,496
Variables: 21
$ Year              (int) 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
$ Month             (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ DayofMonth        (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
$ DayOfWeek         (int) 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, ...
…(中略)…
$ CancellationCode  (chr) "", "", "", "", "", "", "", "", "", "", "", "", "", ""...
$ Diverted          (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

distinct()(相異)

此函式的作用在於將數據框中的相異橫排給保留下來(即移除重覆的橫排)。以下將 hf 中的重覆橫排給移除掉,並顯示處理後之獨一(不重覆)的橫排數目。

> hf %>% select(UniqueCarrier:TailNum) %>% distinct() %>% head(5)
  UniqueCarrier FlightNum TailNum
1            AA       428  N576AA
2            AA       428  N557AA
3            AA       428  N541AA
4            AA       428  N403AA
5            AA       428  N492AA

arrange()(排序)

以下利用arrange()為數據框 hf 排序,「年、月、日」三直欄遞增而「出發時間、到達時間」二直欄遞減,並顯示該結果頭端 5 個橫排的數據。

> hf %>% arrange(Year, Month, DayofMonth, desc(DepTime, ArrTime)) %>% head(5)
  Year Month DayofMonth DayOfWeek DepTime ArrTime
1 2011     1          1         6    2250      53
2 2011     1          1         6    2223    2329
3 2011     1          1         6    2149    2323
4 2011     1          1         6    2142    2222
5 2011     1          1         6    2139    2303
  UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime
1            CO      1644  N37409               243     227
2            XE      2515  N12921                66      48
3            CO      1597  N16217               214     196
4            XE      3033  N11536                40      24
5            XE      2192  N16963                84      67
  ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1      109      120    IAH  SMF     1609      5      11
2       64       63    IAH  HRL      295      7      11
3       38       39    IAH  ONT     1334      6      12
4        7        7    IAH  LCH      127      3      13
5       25       24    IAH  MAF      429      4      13
  Cancelled CancellationCode Diverted
1         0                         0
2         0                         0
3         0                         0
4         0                         0
5         0                         0

slice()(選取橫排)

依據橫排的索引位置為數據框「切片」。以下選取第 2 至第 4 橫排,共 3 個橫排。

> hf %>% slice(2:4)
  Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011     1          2         7    1401    1501            AA       428  N557AA
2 2011     1          3         1    1352    1502            AA       428  N541AA
3 2011     1          4         2    1403    1513            AA       428  N403AA
  ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1                60      45       -9        1    IAH  DFW      224      6       9
2                70      48       -8       -8    IAH  DFW      224      5      17
3                70      39        3        3    IAH  DFW      224      9      22
  Cancelled CancellationCode Diverted
1         0                         0
2         0                         0
3         0                         0

以下選取倒數第 2 及最末橫排,共 2 個橫排。

> hf %>% slice((n()-1):n())
  Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011    12          6         2     656     812            WN       621  N727SW
2 2011    12          6         2    1600    1713            WN      1597  N745SW
  ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1                76      64      -13       -4    HOU  TUL      453      3       9
2                73      59      -12        0    HOU  TUL      453      3      11
  Cancelled CancellationCode Diverted
1         0                         0
2         0                         0

filter()(選取橫排)

由於在關係型數據庫中並無橫排順序的固有概念,在利用 dplyr 作為此類數據庫的前端時,無法直接使用索引位置,在前述slice()函式中的第一個例子,要利用filter()改寫如下:

> hf %>% filter(between(row_number(), 2, 4))
  Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011     1          2         7    1401    1501            AA       428  N557AA
2 2011     1          3         1    1352    1502            AA       428  N541AA
3 2011     1          4         2    1403    1513            AA       428  N403AA
  ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1                60      45       -9        1    IAH  DFW      224      6       9
2                70      48       -8       -8    IAH  DFW      224      5      17
3                70      39        3        3    IAH  DFW      224      9      22
  Cancelled CancellationCode Diverted
1         0                         0
2         0                         0
3         0                         0

在前述slice()函式中的第二個例子,要利用filter()改寫如下:

> hf %>% filter(between(row_number(), n()-1, n()))
  Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011    12          6         2     656     812            WN       621  N727SW
2 2011    12          6         2    1600    1713            WN      1597  N745SW
  ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1                76      64      -13       -4    HOU  TUL      453      3       9
2                73      59      -12        0    HOU  TUL      453      3      11
  Cancelled CancellationCode Diverted
1         0                         0
2         0                         0

已知索引位置時使用slice()來切片,但若要依據特定直欄是否符合某些條件來篩選橫排,則須使用filter()。以下選取出發延遲超過 30 分鐘「或」到達地點為洛杉磯的橫排,並顯示該結果尾端 4 個橫排的數據。

> filter(hf, DepDelay > 30 | Dest == "LAX") %>% tail(4)
      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
28026 2011    12          6         2    1352    1749            WN      3085  N510SW
28027 2011    12          6         2    1850    2046            WN        39  N754SW
28028 2011    12          6         2    1723    1845            WN        33  N698SW
28029 2011    12          6         2    2023    2109            WN       207  N354SW
      ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
28026               177     163       59       72    HOU  PHL     1336      5       9
28027               176     157       71       70    HOU  PHX     1020      4      15
28028               202     192       70       78    HOU  SAN     1313      3       7
28029                46      38       29       43    HOU  SAT      192      4       4
      Cancelled CancellationCode Diverted
28026         0                         0
28027         0                         0
28028         0                         0
28029         0                         0

以下選取出發延遲超過 30 分鐘「且」到達地點為洛杉磯的橫排,並顯示該結果尾端 3 個橫排的數據。

> filter(hf, DepDelay > 30 & Dest == "LAX") %>% tail(3)
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
692 2011    12         20         2    1423    1554            WN       706  N260WN
693 2011    12         21         3    1424    1557            WN       706  N433LV
694 2011    12         23         5    2207    2334            WN      1053  N290WN
    ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
692               211     194       14       33    HOU  LAX     1390      9       8
693               213     194       17       34    HOU  LAX     1390     12       7
694               207     194       14       37    HOU  LAX     1390      6       7
    Cancelled CancellationCode Diverted
692         0                         0
693         0                         0
694         0                         0

多個條件並列為引數時,其意義與「且」相同。

> hf %>% filter(DepDelay > 30 , Dest == "LAX") %>% tail(2) # 選取條件同上,惟僅顯示二個橫排。
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
692 2011    12         20         2    1423    1554            WN       706  N260WN
693 2011    12         21         3    1424    1557            WN       706  N433LV
    ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
692               211     194       14       33    HOU  LAX     1390      9       8
693               213     194       17       34    HOU  LAX     1390     12       7
    Cancelled CancellationCode Diverted
692         0                         0
693         0                         0

select()(選取直欄)

選取全部的直欄,顯示其頭端 2 個橫排。

> hf %>% select(everything()) %>% head(2)
     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
5424 2011     1          1         6    1400    1500            AA       428  N576AA
5425 2011     1          2         7    1401    1501            AA       428  N557AA
     ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
5424                60      40      -10        0    IAH  DFW      224      7      13
5425                60      45       -9        1    IAH  DFW      224      6       9
     Cancelled CancellationCode Diverted
5424         0                         0
5425         0                         0

下例依序選取 Origin, Dest, TaxiIn, TaxiOut 等 4 個直欄,以及其他剩下的全部直欄,等同調整直欄順序,顯示其頭端 2 個橫排。

> hf %>% select(Origin, Dest, TaxiIn, TaxiOut, everything()) %>% head(2)
     Origin Dest TaxiIn TaxiOut Year Month DayofMonth DayOfWeek DepTime ArrTime
5424    IAH  DFW      7      13 2011     1          1         6    1400    1500
5425    IAH  DFW      6       9 2011     1          2         7    1401    1501
     UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay
5424            AA       428  N576AA                60      40      -10        0
5425            AA       428  N557AA                60      45       -9        1
     Distance Cancelled CancellationCode Diverted
5424      224         0                         0
5425      224         0                         0

選取依序自 Year 至 DayOfWeek 這 3 個直欄及 TailNum、Distance 等 2 個直欄,共 5 個直欄,顯示其頭端 3 個橫排。

> hf %>% select(Year:DayOfWeek, TailNum, Distance) %>% head(3)
     Year Month DayofMonth DayOfWeek TailNum Distance
5424 2011     1          1         6  N576AA      224
5425 2011     1          2         7  N557AA      224
5426 2011     1          3         1  N541AA      224

文字向量內之元素若為待選取的直欄名稱,則可直接利用,下例所選直欄與上例相同。

> cols <- c("Year", "Month", "DayofMonth", "DayOfWeek", "TailNum", "Distance")
> hf %>% select(one_of(cols)) %>% head(3) # 結果同上。
     Year Month DayofMonth DayOfWeek TailNum Distance
5424 2011     1          1         6  N576AA      224
5425 2011     1          2         7  N557AA      224
5426 2011     1          3         1  N541AA      224

文字向量內之元素若為待排除的直欄名稱,亦可直接利用,下例所選直欄與上例相反。

> hf %>% select(-one_of(cols)) %>% head(3)
     DepTime ArrTime UniqueCarrier FlightNum ActualElapsedTime AirTime ArrDelay
5424    1400    1500            AA       428                60      40      -10
5425    1401    1501            AA       428                60      45       -9
5426    1352    1502            AA       428                70      48       -8
     DepDelay Origin Dest TaxiIn TaxiOut Cancelled CancellationCode Diverted
5424        0    IAH  DFW      7      13         0                         0
5425        1    IAH  DFW      6       9         0                         0
5426       -8    IAH  DFW      5      17         0                         0

選取直欄名稱中含「arr」者(忽略大小寫),顯示其尾端 3 個橫排。

> hf %>% select(contains("arr", ignore.case = TRUE)) %>% tail(3)
        ArrTime UniqueCarrier ArrDelay
6083257    1031            WN       -4
6083258     812            WN      -13
6083259    1713            WN      -12

排除直欄名稱中含「arr」者(忽略大小寫),顯示其尾端 2 個橫排。

> hf %>% select(-contains("arr", ignore.case = TRUE)) %>% tail(2)
        Year Month DayofMonth DayOfWeek DepTime FlightNum TailNum ActualElapsedTime
6083258 2011    12          6         2     656       621  N727SW                76
6083259 2011    12          6         2    1600      1597  N745SW                73
        AirTime DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode
6083258      64       -4    HOU  TUL      453      3       9         0                 
6083259      59        0    HOU  TUL      453      3      11         0                 
        Diverted
6083258        0
6083259        0

選取直欄名稱以「Day」為開頭者(區別大小寫),顯示其頭端 3 個橫排。

> hf %>% select(starts_with("Day", ignore.case = FALSE)) %>% head(3)
     DayofMonth DayOfWeek
5424          1         6
5425          2         7
5426          3         1

排除直欄名稱以「Day」為開頭者(區別大小寫),顯示其頭端 2 個橫排。

> hf %>% select(-starts_with("Day", ignore.case = FALSE)) %>% head(2)
     Year Month DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
5424 2011     1    1400    1500            AA       428  N576AA                60
5425 2011     1    1401    1501            AA       428  N557AA                60
     AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
5424      40      -10        0    IAH  DFW      224      7      13         0
5425      45       -9        1    IAH  DFW      224      6       9         0
     CancellationCode Diverted
5424                         0
5425                         0

選取直欄名稱以「Time」為結尾者(忽略大小寫),顯示其尾端 3 個橫排。

> hf %>% select(ends_with("Time", ignore.case = TRUE)) %>% tail(3)
        DepTime ArrTime ActualElapsedTime AirTime
6083257     912    1031                79      61
6083258     656     812                76      64
6083259    1600    1713                73      59

排除直欄名稱以「Time」為結尾者(忽略大小寫),顯示其尾端 2 個橫排。

> hf %>% select(-ends_with("Time", ignore.case = TRUE)) %>% tail(2)
        Year Month DayofMonth DayOfWeek UniqueCarrier FlightNum TailNum ArrDelay
6083258 2011    12          6         2            WN       621  N727SW      -13
6083259 2011    12          6         2            WN      1597  N745SW      -12
        DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode
6083258       -4    HOU  TUL      453      3       9         0                 
6083259        0    HOU  TUL      453      3      11         0                 
        Diverted
6083258        0
6083259        0

選取直欄名稱符合「.m.」正則表示式(Regular Expression)者,顯示其頭端 3 個橫排。

> hf %>% select(matches(".m.")) %>% head(3)
     DayofMonth DepTime ArrTime ActualElapsedTime AirTime
5424          1    1400    1500                60      40
5425          2    1401    1501                60      45
5426          3    1352    1502                70      48

排除直欄名稱符合「.m.」正則表示式(Regular Expression)者,顯示其頭端 2 個橫排。

> hf %>% select(-matches(".m.")) %>% head(2)
     Year Month DayOfWeek UniqueCarrier FlightNum TailNum ArrDelay DepDelay Origin
5424 2011     1         6            AA       428  N576AA      -10        0    IAH
5425 2011     1         7            AA       428  N557AA       -9        1    IAH
     Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
5424  DFW      224      7      13         0                         0
5425  DFW      224      6       9         0                         0

選取直欄時能同步更名,新名置等號左側、舊名在等號右側,下例顯示頭端 2 個橫排。

> hf %>% select(出發時間 = DepTime, 到達時間 = ArrTime) %>% head(2)
     出發時間 到達時間
5424     1400     1500
5425     1401     1501

rename()(更名)

直欄更名可用rename()函式,其更名用法與前例select()不同之處在於,前者保留全部直欄,後者只留所選直欄。

> hf %>% rename(出發時間 = DepTime, 到達時間 = ArrTime) %>% head(2)
     Year Month DayofMonth DayOfWeek 出發時間 到達時間 UniqueCarrier FlightNum TailNum
5424 2011     1          1         6     1400     1500            AA       428  N576AA
5425 2011     1          2         7     1401     1501            AA       428  N557AA
     ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
5424                60      40      -10        0    IAH  DFW      224      7      13
5425                60      45       -9        1    IAH  DFW      224      6       9
     Cancelled CancellationCode Diverted
5424         0                         0
5425         0                         0

mutate()(新增直欄)

利用mutate()可在既有數據框的最右側,增加新的直欄。

> hf %>% mutate(機上地面時間 = ActualElapsedTime - AirTime, 延遲時間合計 = ArrDelay + DepDelay) %>% head(3)
  Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
1 2011     1          1         6    1400    1500            AA       428  N576AA
2 2011     1          2         7    1401    1501            AA       428  N557AA
3 2011     1          3         1    1352    1502            AA       428  N541AA
  ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut
1                60      40      -10        0    IAH  DFW      224      7      13
2                60      45       -9        1    IAH  DFW      224      6       9
3                70      48       -8       -8    IAH  DFW      224      5      17
  Cancelled CancellationCode Diverted 機上地面時間 延遲時間合計
1         0                         0           20          -10
2         0                         0           15           -8
3         0                         0           22          -16

transmute()(新增直欄)

新增直欄亦可利用transmute(),其與前例mutate()不同之處在於,前者保留全部原有直欄,後者只留所新增的直欄。

> hf %>% transmute(機上地面時間 = ActualElapsedTime - AirTime, 
+                  延遲時間合計 = ArrDelay + DepDelay) %>% head(3)
  機上地面時間 延遲時間合計
1           20          -10
2           15           -8
3           22          -16

summarise(), summarize()(總結)

使用summarise()summarize()可將多個值總結為單一值,以下求取「機上地面時間」的平均值與標準差、最大飛行距離,以及橫排個數。

> hf %>% summarise(AET_mean = mean(ActualElapsedTime, na.rm = TRUE), 
+                  AET_sd = sd(ActualElapsedTime, na.rm = TRUE), 
+                  max(Distance), n())
  AET_mean   AET_sd max(Distance)    n()
1 129.3237 59.28584          3904 227496

group_by()(分組)

利用group_by()可將數據框分為數個組別,通常搭配summarise()使用,能得出各個組別的總結數值。以下依據「出發城市」分組,再求取其每一組的「機上地面時間」的平均值與標準差、最大飛行距離,以及橫排個數。

hf %>% group_by(Origin) %>% 
+      summarise(AET_mean = mean(ActualElapsedTime, na.rm = TRUE), 
+                AET_sd = sd(ActualElapsedTime, na.rm = TRUE), 
+                max(Distance), n())
Source: local data frame [2 x 5]

  Origin AET_mean   AET_sd max(Distance)    n()
   (chr)    (dbl)    (dbl)         (int)  (int)
1    HOU 101.4465 51.31246          1642  52299
2    IAH 137.6125 58.96823          3904 175197

count()(計數)

利用count()可計算被選取物件的個數,以下計算「出發城市」與「到達城市」的各個組合狀態的個數,並依個數多寡排列後,顯示其數值最高的 5 個組合。

> hf %>% count(Origin, Dest, sort = TRUE) %>% head(5)
Source: local data frame [5 x 3]
Groups: Origin [1]

  Origin  Dest     n
   (chr) (chr) (int)
1    HOU   DAL  8243
2    HOU   MSY  3362
3    HOU   ATL  2889
4    HOU   DFW  2424
5    HOU   HRL  2309

sample_n(), sample_frac()(抽樣)

利用sample_n()sample_frac()可對數據框抽取一定橫排的樣本,前者指定抽樣個數,後者指定抽樣比例。以下用sample_n()抽取 3 個樣本,容許重覆抽取,並以月份為權重。

> set.seed(1)
> hf %>% sample_n(3, replace = TRUE, weight = Month)
        Year Month DayofMonth DayOfWeek DepTime ArrTime
1568379 2011     4          6         3    1006    1333
2199156 2011     5         30         1    2118    2204
3221709 2011     7         22         5     559     957
        UniqueCarrier FlightNum TailNum ActualElapsedTime
1568379            CO      1544  N78511               147
2199156            XE      2261  N14904                46
3221709            WN       525  N493WN               178
        AirTime ArrDelay DepDelay Origin Dest Distance
1568379     128      -20       -4    IAH  CLE     1091
2199156      33       -6       -2    IAH  AEX      190
3221709     158       -3       -1    HOU  BWI     1246
        TaxiIn TaxiOut Cancelled CancellationCode Diverted
1568379      6      13         0                         0
2199156      2      11         0                         0
3221709     11       9         0                         0

以下用sample_frac()抽取 1.3187e-05 比例的樣本數,由於以同以引數 1 設定了set.seed(),所得結果與上例相同。

> set.seed(1)
> hf %>% sample_frac(1.3187e-05, replace = TRUE, weight = Month)

如分組後再抽樣,則抽取個數或抽取比例會適用到各個組別。以下選取 TailNum 至 Distance 共 8 個直欄,依據出發城市分組後,各組抽取 3 個橫排,以距離為權重、容許重覆抽取,結果如下:

> set.seed(1)
> hf %>% select(TailNum:Distance) %>% 
+        group_by(Origin) %>% 
+        sample_n(3, replace = TRUE, weight = Distance)
Source: local data frame [6 x 8]
Groups: Origin [2]

  TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
    (chr)             (int)   (int)    (int)    (int)  (chr)
1  N771SA               223     210      -22        5    HOU
2  N396SW               116     103        4        3    HOU
3  N391SW               177     150       48        6    HOU
4  N754SK               134     107       -1        2    IAH
5  N719SK               161     129       -9       -4    IAH
6  N14947                96      73       -4       -5    IAH
Variables not shown: Dest (chr), Distance (int)

results matching ""

    No results matching ""