swirl笔记(二)

2. Lessons of Getting and Cleaning Data:

1: Manipulating Data with dplyr

2: Grouping and Chaining with dplyr

3: Tidying Data with tidyr

4: Dates and Times with lubridate

2.1 Manipulating Data with dplyr

·The first step of working with data in dplyr is to load the data into what the package authors call a ‘data frame tbl’ or ‘tbl_df’.

> mydf<-read.csv(path2csv,stringsAsFactors = FALSE)#读取文件

> cran<-tbl_df(mydf)#创建新的tbl_df

·dplyr supplies five “verbs” that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().

> select(cran,ip_id,package,country)

# A tibble: 225,468 x 3

ip_id package      country

<int> <chr>        <chr>

1     1 htmltools    US

2     2 tseries      US

3     3 party        US

4     3 Hmisc        US

5     4 digest       CA

6     3 randomForest US

7     3 plyr         US

8     5 whisker      US

9     6 Rcpp         CN

10     7 hflights     US

# … with 225,458 more rows

·A question is “How do I select a subset of rows?” That’s where the filter() function comes in. filter(cran, r_version == “3.1.1”, country == “US”) will return all rows of cran corresponding to downloads from users in the US running R version 3.1.1. filter(cran, country == “US” | country == “IN”) will gives us all rows for which the country variable equals either “US” or “IN”.

> filter(cran,r_version==”3.1.1″,country==”US”)

# A tibble: 1,588 x 11

X date   time    size r_ve~ r_ar~ r_os  pack~ vers~ coun~

<int> <chr>  <chr>  <int> <chr> <chr> <chr> <chr> <chr> <chr>

1  2216 2014-~ 00:4~ 3.85e5 3.1.1 x86_~ darw~ colo~ 1.2-4 US

2 17332 2014-~ 03:3~ 1.97e5 3.1.1 x86_~ darw~ httr  0.3   US

3 17465 2014-~ 03:2~ 2.33e4 3.1.1 x86_~ darw~ snow  0.3-~ US

4 18844 2014-~ 03:5~ 1.91e5 3.1.1 x86_~ darw~ maxL~ 1.2-0 US

5 30182 2014-~ 04:1~ 7.77e4 3.1.1 i386  ming~ rand~ 4.6-7 US

6 30193 2014-~ 04:0~ 2.35e6 3.1.1 i386  ming~ ggpl~ 1.0.0 US

7 30195 2014-~ 04:0~ 2.99e5 3.1.1 i386  ming~ fExt~ 3010~ US

8 30217 2014-~ 04:3~ 5.68e5 3.1.1 i386  ming~ rJava 0.9-6 US

9 30245 2014-~ 04:1~ 5.27e5 3.1.1 i386  ming~ LPCM  0.44~ US

10 30354 2014-~ 04:3~ 1.76e6 3.1.1 i386  ming~ mgcv  1.8-1 US

# … with 1,578 more rows, and 1 more variable: ip_id <int>

·For getting only the rows for which the r_version is not missing. We can use filter(cran, !is.na(r_version)).

·To order the rows of a dataset according to the values of a particular variable. This is the job of arrange(). To order the ROWS of cran2 so that ip_id is in ascending order (from small to large), type arrange(cran2, ip_id). To do the same, but in descending order, change the second argument to desc(ip_id), where desc() stands for “descending”. Arrange(cran2, package, ip_id) will first arrange by package names (ascending alphabetically), then by ip_id.

·The mutate() function is common to create a new variable based on the value of one or more variables already in a dataset.

·The summarize(), collapses the dataset to a single row.

 

2.2 Grouping and Chaining with dplyr

·To break up your dataset into groups of rows based on the values of one or more variables. The group_by() function is reponsible for doing this.

> by_package<-group_by(cran,package)

> by_package

# A tibble: 225,468 x 11

# Groups:   package [6,023]

X date   time    size r_ve~ r_ar~ r_os  pack~ vers~ coun~

<int> <chr>  <chr>  <int> <chr> <chr> <chr> <chr> <chr> <chr>

1     1 2014-~ 00:5~ 8.06e4 3.1.0 x86_~ ming~ html~ 0.2.4 US

2     2 2014-~ 00:5~ 3.22e5 3.1.0 x86_~ ming~ tser~ 0.10~ US

3     3 2014-~ 00:4~ 7.48e5 3.1.0 x86_~ linu~ party 1.0-~ US

4     4 2014-~ 00:4~ 6.06e5 3.1.0 x86_~ linu~ Hmisc 3.14~ US

5     5 2014-~ 00:4~ 7.98e4 3.0.2 x86_~ linu~ dige~ 0.6.4 CA

6     6 2014-~ 00:4~ 7.77e4 3.1.0 x86_~ linu~ rand~ 4.6-7 US

7     7 2014-~ 00:4~ 3.94e5 3.1.0 x86_~ linu~ plyr  1.8.1 US

8     8 2014-~ 00:4~ 2.82e4 3.0.2 x86_~ linu~ whis~ 0.3-2 US

9     9 2014-~ 00:5~ 5.93e3 <NA>  <NA>  <NA>  Rcpp  0.10~ CN

10    10 2014-~ 00:1~ 2.21e6 3.0.2 x86_~ linu~ hfli~ 0.1   US

# … with 225,458 more rows, and 1 more variable: ip_id <int>

At the top of the output above, you’ll see ‘Groups: package’, which tells us that this tbl has been grouped by the package variable. Everything else looks the same, but now any operation we apply to the grouped data will take place on a per package basis.

> pack_sum<-summarize(by_package,

count = n(),

unique = n_distinct(ip_id),

countries = n_distinct(country),

avg_bytes = mean(size))

> pack_sum

# A tibble: 6,023 x 5

package     count unique countries avg_bytes

<chr>       <int>  <int>     <int>     <dbl>

1 A3             25     24        10     62195

2 abc            29     25        16   4826665

3 abcdeFBA       15     15         9    455980

4 ABCExtremes    18     17         9     22904

5 ABCoptim       16     15         9     17807

6 ABCp2          18     17        10     30473

7 abctools       19     19        11   2589394

8 abd            17     16        10    453631

9 abf2           13     13         9     35693

10 abind         396    365        50     32939

# … with 6,013 more rows

The ‘count’ column, created with n(), contains the total number of rows (i.e. downloads) for each package. The ‘unique’ column, created with n_distinct(ip_id), gives the total number of unique downloads for each package, as measured by the number of distinct ip_id’s. The ‘countries’ column, created with n_distinct(), provides the number of countries in which each package was downloaded. And finally, the ‘avg_bytes’ column, created with mean(size), contains the mean download size (in bytes) for each package.

·To know the value of ‘count’ that splits the data into the top 1% and bottom 99% of packages based on total downloads. Use quantile() to determine this number.

> quantile(pack_sum$count,probs=0.99)

99%

679.56

Use filter() to select all rows from pack_sum for which ‘count’ is strictly greater (>) than 679.

> top_counts<-filter(pack_sum,count>679)

> top_counts

# A tibble: 61 x 5

package    count unique countries avg_bytes

<chr>      <int>  <int>     <int>     <dbl>

1 bitops      1549   1408        76     28715

2 car         1008    837        64   1229122

3 caTools      812    699        64    176589

4 colorspace  1683   1433        80    357411

5 data.table   680    564        59   1252721

6 DBI         2599    492        48    206933

7 devtools     769    560        55    212933

8 dichromat   1486   1257        74    134732

9 digest      2210   1894        83    120549

10 doSNOW       740     75        24      8364

# … with 51 more rows

·Since dplyr only shows us the first 10 rows, we can use the View() function to see more.

·Use arrange() the rows of top_counts based on the ‘count’ column. We want the packages with the highest number of downloads at the top, which means we want ‘count’ to be in descending order.

> top_counts_sorted<-arrange(top_counts,desc(count))

> top_counts_sorted

# A tibble: 61 x 5

package  count unique countries avg_bytes

<chr>    <int>  <int>     <int>     <dbl>

1 ggplot2   4602   1680        81   2427716

2 Rcpp      3195   2044        84   2512100

3 plyr      2908   1754        81    799123

4 rJava     2773    963        70    633522

5 DBI       2599    492        48    206933

6 LPCM      2335     17        10    526814

7 stringr   2267   1948        82     65277

8 digest    2210   1894        83    120549

9 reshape2  2032   1652        76    330128

10 foreach   1984    485        53    358070

# … with 51 more rows

·Three examples:

1.

> by_package <- group_by(cran, package)

> pack_sum <- summarize(by_package,

count = n(),

unique = n_distinct(ip_id),

countries = n_distinct(country),

avg_bytes = mean(size))

> top_countries <- filter(pack_sum, countries > 60)

> result1 <- arrange(top_countries, desc(countries), avg_bytes)

> print(result1)

2.

> result2 <-

arrange(

filter(

summarize(

group_by(cran,

package

),

count = n(),

unique = n_distinct(ip_id),

countries = n_distinct(country),

avg_bytes = mean(size)

),

countries > 60

),

desc(countries),

avg_bytes

)

print(result2)

3.The benefit of %>% is that it allows us to chain the function calls in a linear fashion. The code to the right of %>% operates on the result from the code to the left of %>%.

> result3 <-

cran %>%

group_by(package) %>%

summarize(count = n(),

unique = n_distinct(ip_id),

countries = n_distinct(country),

avg_bytes = mean(size)

) %>%

filter(countries > 60) %>%

arrange(desc(countries), avg_bytes)

print(result3)

The results of the last three scripts are all identical.

Use %>%:

cran %>%

select(ip_id, country, package, size) %>%

mutate(size_mb = size / 2^20) %>%

filter(size_mb<=0.5) %>%

print()

 

2.3 Tidying Data with tidyr

·Tidy data satisfies three conditions:

(1) Each variable forms a column

(2) Each observation forms a row

(3) Each type of observational unit forms a table

Any dataset that doesn’t satisfy these conditions is considered’messy’ data.

·To have one column for each of these variables. We’ll use the gather() function from tidyr to accomplish this. Use gather() with the following arguments: students, sex, count, -grade. Note the minus sign before grade, which says we want to gather all columns EXCEPT grade.

·tidyr offers a convenient separate() function for the purpose of separating one column into multiple columns.

·Use the spread() function from readr to spread a key-value pair across multiple columns.

·At first glance, there doesn’t seem to be much of a problem with students4. All columns are variables and all rows are observations. However, notice that each id, name, and sex is repeated twice, which seems quite redundant. This is a hint that our data contains multiple observational units in a single table. Use select() to break students4 into two separate tables — one containing basic student information (id, name, and sex) and the other containing grades (id, class, midterm, final).

> students4

id   name  sex   class  midterm  final

1  168  Brian   F     1       B     B

2  168  Brian   F     5       A     C

3  588  Sally   M     1       A     C

4  588  Sally   M     3       B     C

5  710  Jeff    M     2       D     E

6  710  Jeff    M     4       A     C

7  731  Roger   F     2       C     A

8  731  Roger   F     5       B     A

9  908  Karen   M     3       C     C

10 908  Karen   M     4       A     A

> student_info <- students4 %>%

select(id, name, sex) %>%

print

id  name sex

1  168 Brian   F

2  168 Brian   F

3  588 Sally   M

4  588 Sally   M

5  710  Jeff   M

6  710  Jeff   M

7  731 Roger   F

8  731 Roger   F

9  908 Karen   M

10 908 Karen   M

> student_info <- students4 %>%

select(id, name, sex) %>%

unique() %>% # remove duplicate rows from student_info

print

id  name sex

1 168 Brian   F

3 588 Sally   M

5 710  Jeff   M

7 731 Roger   F

9 908 Karen   M

> gradebook <- students4 %>%

select(id, class, midterm, final)

print

id class midterm final

1  168     1       B     B

2  168     5       A     C

3  588     1       A     C

4  588     3       B     C

5  710     2       D     E

6  710     4       A     C

7  731     2       C     A

8  731     5       B     A

9  908     3       C     C

10 908     4       A     A

An example:

> sat

# A tibble: 6 x 10

score_range read_male read_fem read_total math_male math_fem math_total write_male write_fem write_total

<chr>   <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>

1 700-800 40151  38898  79049  74461  46040 120501  31574  39101  70675

2 600-6~ 121950 126084 248034 162564 133954 296518 100963 125368 226331

3 500-5~ 227141 259553 486694 233141 257678 490819 202326 247239 449565

4 400-4~ 242554 296793 539347 204670 288696 493366 262623 302933 565556

5 300-3~ 113568 133473 247041  82468 131025 213493 146106 144381 290487

6 200-2~ 30728  29154  59882  18788  26562  45350  32500  24933  57433

> sat %>%

select(-contains(“total”)) %>% #Select all columns that do NOT contain the word “total”

gather(part_sex, count, -score_range) %>% #Gather all columns EXCEPT score_range, using key = part_sex and value = count

separate(col = part_sex, into = c(“part”, “sex”)) %>% #Separate part_sex into two separate variables(columns), called “part” and “sex”

print

# A tibble: 36 x 4

score_range part  sex    count

* <chr>       <chr> <chr>  <int>

1 700-800     read  male   40151

2 600-690     read  male  121950

3 500-590     read  male  227141

4 400-490     read  male  242554

5 300-390     read  male  113568

6 200-290     read  male   30728

7 700-800     read  fem    38898

8 600-690     read  fem   126084

9 500-590     read  fem   259553

10 400-490     read  fem   296793

# … with 26 more rows

> sat %>%

select(-contains(“total”)) %>%

gather(part_sex, count, -score_range) %>%

separate(part_sex, c(“part”, “sex”)) %>%

group_by(part, sex)  %>% #Group the data by part and sex

mutate(total = sum(count),

prop = count/ total) %>% #Add two new columns, whose values will be automatically computed group-by-group: total = sum(count), prop = count / total

print

# A tibble: 36 x 6

# Groups:   part, sex [6]

score_range part  sex    count  total   prop

<chr>       <chr> <chr>  <int>  <int>  <dbl>

1 700-800     read  male   40151 776092 0.0517

2 600-690     read  male  121950 776092 0.157

3 500-590     read  male  227141 776092 0.293

4 400-490     read  male  242554 776092 0.313

5 300-390     read  male  113568 776092 0.146

6 200-290     read  male   30728 776092 0.0396

7 700-800     read  fem    38898 883955 0.0440

8 600-690     read  fem   126084 883955 0.143

9 500-590     read  fem   259553 883955 0.294

10 400-490     read  fem   296793 883955 0.336

# … with 26 more rows

 

2.4 Dates and Times with lubridate

·This lesson is only guaranteed to work with an “en_US.UTF-8” locale. Type Sys.getlocale(“LC_TIME”) To view your locale.

> Sys.getlocale(“LC_TIME”)

[1] “Chinese (Simplified)_People’s Republic of China.936”

·The today() function returns today’s date.

> this_day<-today()

> this_day

[1] “2018-06-05”

There are three components to this date. In order, they are year, month, and day.

> year(this_day)

[1] 2018

We can also get the day of the week from this_day using the wday() function.

> wday(this_day)

[1] 3

> wday(this_day,label=TRUE)

[1] 周二

Levels: 周日 < 周一 < 周二 < 周三 < 周四 < 周五 < 周六

·The now() function returns the date-time representing this exact moment in time.

> this_moment<-now()

> this_moment

[1] “2018-06-05 13:56:25 CST”

We can also use hour(), minute(), and second() to extract specific time information.

> second(this_moment)

[1] 25.01769

lubridate offers a variety of functions for parsing date-times. These functions take the form of ymd(), dmy(), hms(), ymd_hms(), etc., where each letter in the name of the function stands for the location of years (y), months (m), days (d), hours (h), minutes (m), and/or seconds (s) in the date-time being read in.

> my_date<-ymd(“1989-05-17”)

> my_date

[1] “1989-05-17”

> dmy(25081985)

[1] “1985-08-25”

> hms(“03:22:14”)

[1] “3H 22M 14S”

·The update() function allows us to update one or more components of a date-time.

> update(this_moment, hours=8, minutes=34, seconds=55)

[1] “2018-06-05 08:34:55 CST”

It’s important to recognize that the previous command does not alter this_moment unless we reassign the result to this_moment.

·To find the current date in New York, we’ll use the now() function again. This time, however, we’ll specify the time zone that we want: “America/New_York”. Store the result in a variable called nyc.

> nyc<-now(“America/New_York”)

> nyc

[1] “2018-06-05 05:24:47 EDT”

To add two days to nyc. One nice aspect of lubridate is that it allows you to use arithmetic operators on dates and times. so we can use the following expression: nyc + days(2).

> depart<-nyc+days(2)

> depart

[1] “2018-06-07 05:24:47 EDT”

> depart<-update(depart, hours=17, minutes=34)

> arrive<-depart + hours(15) + minutes(50)

The with_tz() function returns a date-time as it would appear in another time zone.

> arrive<-with_tz(arrive, “Asia/Hong_Kong”)

> arrive

[1] “2018-06-08 21:24:47 HKT”

The last time you were together was in Singapore on June 17, 2008. Naturally, you’d like to know exactly how long it has been.

> last_time<-mdy(“June 17, 2008″,tz=”Singapore”)

> last_time

[1] “2008-06-17 +08”

Pull up the documentation for interval(), which we’ll use to explore how much time has passed between arrive and last_time.

> how_long<-interval(last_time,end=arrive)

> as.period(how_long)

[1] “9y 11m 22d 21H 24M 47S”

Type stopwatch() to see how long you’ve been working!

> stopwatch()

[1] “4H 8M 59.8285088539124S”

swirl笔记的所有内容源于我在R语言swirl包中学习所记录的笔记。其内容均来自swirl包。

 

swirl笔记(一)