swirl笔记(一)

swirl笔记的所有内容源于我在R语言swirl包中学习所记录的笔记。其内容均来自swirl包。

Basic operation:

Type 0 to exit.

| When you are at the R prompt (>):

| — Typing skip() allows you to skip the current question.

| — Typing play() lets you experiment with R on your own; swirl will ignore what you do…

| — UNTIL you type nxt() which will regain swirl’s attention.

| — Typing bye() causes swirl to exit. Your progress will be saved.

| — Typing main() returns you to swirl’s main menu.

| — Typing info() displays these options again.

 

Proceedings:

1: R Programming: The basics of programming in R

1:R编程:R中的编程基础

2: Getting and Cleaning Data

2:数据获取与整理

3: Exploratory Data Analysis: The basics of exploring data in R

3:探索性数据分析:R中探索数据的基础

4: Statistical Inference: The basics of statistical inference in R

4:统计推断:R中统计推断的基础

5: Regression Models: The basics of regression modeling in R

5:回归模型:R中回归建模的基础

 

1. Lessons of R Programming:

1: Basic Building Blocks      2: Workspace and Files

3: Sequences of Numbers      4: Vectors

5: Missing Values            6: Subsetting Vectors

7: Matrices and Data Frames   8: Logic

9: Functions                10: lapply and sapply

11: vapply and tapply         12: Looking at Data

13: Simulation              14: Dates and Times

15: Base Graphics

1.1 Basic Building Blocks

Any object that contains data is called a data structure and numeric vectors are the simplest type of data structure in R. In fact, even a single number is considered a vector of length one.

The easiest way to create a vector is with the c() function, which stands for ‘concatenate’ or ‘combine’.

If you want more information on the c() function, type ?c without the parentheses that normally follow a function name.

To take the square root, use the sqrt() function and to take the absolute value, use the abs() function.

Type info() for more options.

Enter my_div <- z / my_sqrt. The spaces on either side of the `/` sign are not required, but can often improve readability by making code appear less cluttered. In the end, it’s personal preference.

When given two vectors of the same length, R simply performs the specified arithmetic operation (`+`, `-`, `*`, etc.) element-by-element. If the vectors are of different lengths, R ‘recycles’ the shorter vector until it is the same length as the longer vector. If the length of the shorter vector does not divide evenly into the length of the longer vector, R will still apply the ‘recycling’ method, but will throw a warning to let you know something fishy might be going on.

In many programming environments, the up arrow will cycle through previous commands.

You can type the first two letters of the variable name, then hit the Tab key (possibly more than once). Most programming environments will provide a list of variables that you’ve created that begin with ‘my’.

 

1.2 Workspace and Files

Determine which directory your R session is using as its current working directory using getwd().

List all the objects in your local workspace using ls().

List all the files in your working directory using list.files() or dir().

Using the args() function on a function name is also a handy way to see what arguments a function can take.

Use dir.create() to create a directory in the current working directory. Create a file in your working directory for using the file.create(“”) function.

Set your working directory with the setwd() command.

Check to see if “filename” exists in the working directory using the file.exists(“”) function.

Access information about the file “filename” by using file.info(“”). You can use the $ operator to grab specific items.

Change the name of the file “filename” to “filename2” by using file.rename(“”to””).

Make a copy of “filename” called “filename2” using file.copy(“”to””).

Provide the relative path to the file “filename” by using file.path(“”).

Create a directory in the current working directory called “testdir2” and a subdirectory for it called “testdir3”, all in one command by using dir.create() and file.path(): dir.create(file.path(“testdir2″,”testdir3″),recursive=TRUE)

 

1.3 Sequences of Numbers

The most basic use of seq() does exactly the same thing as the `:` operator.

we want a new vector (1, 2, 3, …) that is the same length as my_seq. One possibility is to combine the `:` operator and the length() function like this: length(my_seq). Another option is to use seq(along.with = my_seq) / seq_along(my_seq).

One more function related to creating sequences of numbers is rep(), which stands for ‘replicate’.

 

1.4 Vectors

If we want to join the elements together into one continuous character string (i.e. a character vector of length 1). We can do this using the paste() function.

Type paste(filename, collapse = ” “) now. Make sure there’s a space between the double quotes in the `collapse` argument. The `collapse`/`sep` argument to the paste() function tells R that when we join together the elements of the filename character vector, we’d like to separate them with single spaces. To add (or ‘concatenate’) something to the end of filename, use the c() function like this: c(filename, “your_addition_here”).

 

1.5 Missing Values

In R, NA is used to represent any value that is ‘not available’ or ‘missing’ (in the statistical sense).

NaN, which stands for ‘not a number’.

Inf stands for infinity.

 

1.6 Subsetting Vectors

The way you tell R that you want to select some particular elements (i.e. a ‘subset’) from a vector is by placing an ‘index vector’ in square brackets immediately following the name of the vector.

We can also get the names of vect by passing vect as an argument to the names() function.

Then, we can add the `names` attribute to vect after the fact with names(vect) <- c(“”, “”, “”).

Check that two vects are the same or not by passing them as arguments to the identical() function.

 

1.7 Matrices and Data Frames

The dim() function tells us the ‘dimensions’ of an object. Another way to see this is by calling the attributes() function.

The identical() function will tell us if its first two arguments are the same.

Use the cbind() function to ‘combine columns’.

Matrices can only contain ONE class of data. Therefore, when we tried to combine a character vector with a numeric matrix, R was forced to ‘coerce’ the numbers to characters.

The data.frame() function takes any number of arguments and returns a single object of class `data.frame` that is composed of the original objects.

The colnames() function to set the `colnames` attribute for our data frame.

 

1.8 Logic

The equals operator `==` tests whether two boolean values or numbers are equal, the not equals operator `!=` tests whether two boolean values or numbers are unequal, and the NOT operator `!` negates logical expressions so that TRUE expressions become FALSE and FALSE expressions become TRUE.

You can use the `&` operator to evaluate AND across a vector. The `&&` version of AND only evaluates the first member of a vector. The `|` version of OR evaluates OR across an entire vector, while the `||` version of OR only evaluates the first member of a vector.

All AND operators are evaluated before OR operators.

The xor() function stands for exclusive OR. If one argument evaluates to TRUE and one argument evaluates to FALSE, then this function will return TRUE, otherwise it will return FALSE.

The which() function takes a logical vector as an argument and returns the indices of the vector that are TRUE.

The any() function will return TRUE if one or more of the elements in the logical vector is TRUE. The all() function will return TRUE if every element in the logical vector is TRUE.

 

1.9 Functions

The Sys.Date() function returns a string representing today’s date.

Most functions in R return a value. Functions like Sys.Date() return a value based on your computer’s environment, while other functions manipulate input data in order to compute a return value.

The mean() function takes a vector of numbers as input, and returns the average of all of the numbers in the input vector.

Inputs to functions are often called arguments. Providing arguments to a function is also sometimes called passing arguments to that function. Arguments you want to pass to a function go inside the function’s parentheses.

If you want to see the source code for any function, just type the function name without any arguments or parentheses.

You may be wondering if there is a way you can see a function’s arguments (besides looking at the documentation). You can use the args() function!

The function for standard deviation is called sd().

You can pass a function as an argument without first defining the passed function. Functions that are not named are appropriately known as anonymous functions.

Let’s use the evaluate function to explore how anonymous functions work. For the first argument of the evaluate function we’re going to write a tiny function that fits on one line. In the second argument we’ll pass some data to the tiny anonymous function in the first argument.

function(func, dat){

func(dat)

}

Try using evaluate() along with an anonymous function to return the first element of the vector c(8, 4, 0). Your anonymous function should only take one argument which should be a variable `x`.

> evaluate(function(x){x[1]}, c(8, 4, 0))

Now try using evaluate() along with an anonymous function to return the last element of the vector c(8, 4, 0). Your anonymous function should only take one argument which should be a variable `x`.

> evaluate(function(x){x[length(x)]}, c(8, 4, 0))

The first argument of paste() is `…` which is referred to as an ellipsis or simply dot-dot-dot. The ellipsis allows an indefinite number of arguments to be passed into a function. In the case of paste() any number of strings can be passed as arguments and paste() will return all of the strings combined into one string.

The ellipses can be used to pass on arguments to other functions that are used within the function you’re writing. Usually a function that has the ellipses as an argument has the ellipses as the last argument.

There have a strict rule in R programming: all arguments after an ellipses must have default values. This is a strict rule in R programming: all arguments after an ellipses must have default values. Take a look at the simon_says function below:

simon_says <- function(…){

paste(“Simon says:”, …)

}

How to “unpack” arguments from an ellipses when you use the ellipses as an argument in a function. Below I have an example function that is supposed to add two explicitly named arguments called alpha and beta.

add_alpha_and_beta <- function(…){

# First we must capture the ellipsis inside of a list and then assign the list to a variable. Let’s name this variable `args`.

args <- list(…)

# We’re now going to assume that there are two named arguments within args with the names `alpha` and `beta.` We can extract named arguments from the args list by using the name of the argument and double brackets. The `args` variable is just a regular list after all!

alpha <- args[[“alpha”]]

beta  <- args[[“beta”]]

# Then we return the sum of alpha and beta.

alpha + beta

}

 

1.10 lapply and sapply

The lapply() function takes a list as input, applies a function to each element of the list, then returns a list of the same length as the original one.Type cls_list <- lapply(flags, class) to apply the class() function to each column of the flags dataset and store the result in a variable called cls_list. The ‘l’ in ‘lapply’ stands for ‘list’.

The sapply() allows you to automate this process by calling lapply() behind the scenes, but then attempting to simplify (hence the ‘s’ in ‘sapply’) the result for you.

The unique() function returns a vector with all duplicate elements removed. In other words, unique() returns a vector of only the ‘unique’ elements.

 

1.11 vapply and tapply

Whereas sapply() tries to ‘guess’ the correct format of the result, vapply() allows you to specify it explicitly. If the result doesn’t match the format you specify, vapply() will throw an error, causing the operation to stop. This can prevent significant problems in your code that might be caused by getting unexpected return values from sapply().

You might think of vapply() as being ‘safer’ than sapply(), since it requires you to specify the format of the output in advance, instead of just allowing R to ‘guess’ what you wanted. In addition, vapply() may perform faster than sapply() for large datasets. However, when doing data analysis interactively (at the prompt), sapply() saves you some typing and will often be good enough.

As a data analyst, you’ll often wish to split your data up into groups based on the value of some variable, then apply a function to the members of each group. The next function we’ll look at, tapply(), does exactly that.

Use tapply(flags$animate, flags$landmass, mean) to apply the mean function to the ‘animate’ variable separately for each of the six landmass groups, thus giving us the proportion of flags containing an animate image WITHIN each landmass group.

 

1.12 Looking at Data

Whenever you’re working with a new dataset, the first thing you should do is look at it! What is the format of the data? What are the dimensions? What are the variable names? How are the variables stored? Are there missing data? Are there any flaws in the data?

The class(). This will give us a clue as to the overall structure of the data.

Use dim() to see exactly how many rows and columns we’re dealing with.

You can also use nrow() to see only the number of rows. And ncol() to see only the number of columns.

Use object.size() to know how much space the dataset is occupying in memory.

The names() will return a character vector of column names.

The head() function allows you to preview the top of the dataset. The same applies for using tail() to preview the end of the dataset.

The summary() provides different output for each variable, depending on its class.

The beauty of str() is that it combines many of the features of the other functions you’ve already seen, all in a concise and readable format.

 

1.13 Simulation

sample(1:6, 4, replace = TRUE) instructs R to randomly select four numbers between 1 and 6, WITH replacement. The sample() function can also be used to permute, or rearrange, the elements of a vector.

Use sample() to draw a sample of size 100 from the vector c(0,1), with replacement. Since the coin is unfair, we must attach specific probabilities to the values 0 (tails) and 1 (heads) with a fourth argument, prob = c(0.3, 0.7). sample(c(0,1), 100, replace = TRUE, prob = c(0.3,0.7)).

We can use rbinom() to simulate a binomial random variable. A binomial random variable represents the number of ‘successes’ (heads) in a given number of independent ‘trials’ (coin flips). Therefore, we can generate a single random variable that represents the number of heads in 100 flips of our unfair coin using rbinom(1, size = 100, prob = 0.7). Note that you only specify the probability of ‘success’ (heads) and NOT the probability of ‘failure’ (tails). Equivalently, if we want to see all of the 0s and 1s, we can request 100 observations, each of size 1, with success probability of 0.7. rbinom(100,size=1,prob=0.7).

The standard normal distribution has mean 0 and standard deviation 1. The default values for the ‘mean’ and ‘sd’ arguments to rnorm() are 0 and 1, respectively. Thus, rnorm(10) will generate 10 random numbers from a standard normal distribution.

Generate 5 random values from a Poisson distribution with mean 10. You can use the function rpois(5,10). Use replicate(100, rpois(5, 10)) to perform this operation 100 times.

We can find the mean of each column for using the colMeans() function.

Plotting a histogram with hist().

 

1.14 Dates and Times

To see the exact number of days since 1970-01-01, We can use the unclass() function.

Use Sys.Date() to get the current date. And the current date and time using the Sys.time() function with no arguments.

The weekdays() function will return the day of week from any date or time object. The quarters() function returns the quarter of the year (Q1-Q4) from any date or time object.

The strptime() converts character vectors to POSIXlt. Except that the input doesn’t have to be in a particular format (YYYY-MM-DD).

For finding the difference in times, you can use difftime(), which allows you to specify a ‘units’ parameter.

 

1.15 Base Graphics

Anytime that you load up a new data frame, you should explore it before using it.

Instead of adding data columns directly as input arguments, as we did with plot(), it is often handy to pass in the entire data frame. This is what the “data” argument in boxplot() allows. The boxplot(), like many R functions, also takes a “formula” argument, generally an expression with a tilde (“~”) which indicates the relationship between the input variables. This allows you to enter something like y ~ x to plot the relationship between x-axis and y-axis.

Like plot(), hist() is best used by just passing in a single vector.