数据科学(四):探索性数据分析

数据科学一系列文章的内容来源于我在Coursera中的一项专项课程Data Science的笔记,其中内容除专门指出外均来自该课程课件。

Week1:

Lesson1 Graphs

  • Principles of analytic graphics

1.Show comparisons

Evidence for a hypothesis is always relative to another competing hypothesis.

Always ask “Compared to What?”

2.Show causality, mechanism, explanation, systematic structure

What is your causal framework for thinking about a question?

3.Show multivariate data

More than 2 variables.

The real world is multivariate.

Need to “escape flatland”.

4.Integration multiple modes of evidence

Completely integrate words, numbers, images, diagrams.

Data graphics should make use of many modes of data presentation.

Do not let the tool drive the analysis.

5.Describe and document the evidence with appropriate labels, scales, sources, etc.

A data graphic should tell a complete story that is credible.

6.Content is king

Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.

  • Exploratory graphs resources

R Graph Gallery

R Bloggers

 

Lesson2 Plotting

  • Plotting systems in R

1.The Base plotting system

Strat with plot() function.

Use annotation functions to add/modify: text(), lines(), points(), axis().

Cannot go back once plot has started.

2.The Lattice system

Plots are created with a single function call(xyplot(), bwplot(), etc.).

Cannot “add” to the plot once it is created.

3.The ggplot2 system

Mixes elements of Base and Lattice.

  • The Base plotting system in R

直方图:hist()

散点图:with(数据名,plot(横坐标,纵坐标))

箱线图:boxplot(y轴变量~x轴变量,数据名,xlab=”x轴名字”,ylab=”y轴名字”)

Parameters: pch绘图符号(默认开环)、lty线型(默认实线)、lwd线宽、col颜色、xlab横轴名、ylab纵轴名。

Functions: lines在图上添加线条、points添加点、text添加标签、title添加注释、mtext在页边距添加文本、axis指定轴刻度。

  • > library(datasets)

> with(airquality,plot(Wind,Ozone))

> title(main=”Ozone and Wind in New York City”)#Add a title

> with(subset(airquality,Month==5),points(Wind,Ozone,col=”blue”))#取Month等于5的子集,将在这些点标为蓝色

> with(subset(airquality,Month!=5),points(Wind,Ozone,col=”red”))

> model<-lm(Ozone~Wind,airquality)#对数据进行线性拟合

> abline(model,lwd=2)#添加线性对象

> par(mfrow=c(1,2))#设置mfrow参数,同时做两个图

> with(airquality,{

+ plot(Wind,Ozone,main=”Ozone and Wind”)

+ plot(Solar.R,Ozone,main=”Ozone and Solar Radration”)

+ })

 

Lesson3 Graphics devices图形设备

  • Plotting to a PDF file

> pdf(file=”myplot.pdf”)#Open PDF device; create ‘myplot.pdf’ in my working directory. Create plot and send to a file (no plot appears on screen)

> with(数据名,plot(变量1,变量2))

> title(main=”plot title”)#Annotate plot; still nothing on screen

> dev.off()#Close the PDF file device

  • Two basic types of file devices: vector and bitmap devices

Vector formats适合线状图: pdf, svg(Scalable Vector Graphic可缩放矢量图形格式), win.metafile(Windows图元文件), postscript.

Bitmap formats适合自然场景: png(Portable Network Graphics便携网络图形), jpeg, tiff, bmp.

  • Copying plots(将屏幕图形复制至文件)

> dev.copy(png,file=”filename.png”)#Copy plot to a PNG file

> dev.off()#Close the PNG device

 

 

Week2:

Lesson1 Lattice plotting system

  • grid包

xyplot() 散点图xyplot(y ~ x | f * g, data) f/g是分类变量

bwplot() 箱型图

histogram() 直方图

stripplot()

dotplot()

splom() 散点图矩阵

levelplot()/contourplot() 绘制图形数据

  • 无法使用基础绘图系统中的函数

边界、间距、标签自动设置

 

Lesson2 ggplot2

  • Web site: http://ggplot2.org
  • qplot(x, y, data=data frame)

> library(ggplot2)

> str(mpg)

> qplot(displ,hwy,data=mpg)

> qplot(displ,hwy,data=mpg,geom=c(“point”,“smooth”))#添加特定对象

> qplot(hwy,data=mpg,fill=drv)

> qplot(displ,hwy,data=mpg,facets=.~drv)#create separate plots条件绘图

stats#统计转换

scales#变量描述

 

 

Week3:

Lesson1 Hierarchical层次聚类

  • 距离度量

Euclidean distance欧式距离度量:直线距离。

Manhattan distance曼哈顿距离度量:网格线距离。等于各坐标绝对值之和。

  • 计算

> set.seed(1234)#设定生成随机数的种子,种子是为了让结果具有重复性。如果不设定种子,生成的随机数无法重现。

> par(mar = c(0, 0, 0, 0))

> x<-rnorm(12, mean = rep(1:3, each = 4),sd = 0.2)

> y<-rnorm(12, mean = rep(c(1, 2, 1), each = 4),sd = 0.2)

> plot(x, y, col = “blue”, pch = 19, cex = 2)

> text(x + 0.05, y + 0.05, labels = as.character(1:12))#标记

生成如下图形:

> dataframe<-data.frame(x = x, y = y)

> dist(dataframe)#计算所有点之间的距离。默认欧氏距离

1          2          3          4          5          6

2  0.34120511

3  0.57493739 0.24102750

4  0.26381786 0.52578819 0.71861759

5  1.69424700 1.35818182 1.11952883 1.80666768

6  1.65812902 1.31960442 1.08338841 1.78081321 0.08150268

7  1.49823399 1.16620981 0.92568723 1.60131659 0.21110433 0.21666557

8  1.99149025 1.69093111 1.45648906 2.02849490 0.61704200 0.69791931

9  2.13629539 1.83167669 1.67835968 2.35675598 1.18349654 1.11500116

10 2.06419586 1.76999236 1.63109790 2.29239480 1.23847877 1.16550201

11 2.14702468 1.85183204 1.71074417 2.37461984 1.28153948 1.21077373

12 2.05664233 1.74662555 1.58658782 2.27232243 1.07700974 1.00777231

7          8          9         10         11

2

3

4

5

6

7

8  0.65062566

9  1.28582631 1.76460709

10 1.32063059 1.83517785 0.14090406

11 1.37369662 1.86999431 0.11624471 0.08317570

12 1.17740375 1.66223814 0.10848966 0.19128645 0.20802789

> distxy<-dist(dataframe)

> hcl<-hclust(distxy)

> plot(hcl)

 

Lesson2 K-Means clustering & dimension reduction

  • k-means clustering k均值聚类

一种将一组观测值划分成一定数量簇的方法。需要一个距离度量、一些簇,并预先划分出一定数量的簇。

> set.seed(1234)

> par(mar=c(0,0,0,0))

> x<-rnorm(12,mean=rep(1:3,each=4),sd=0.2)

> y<-rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)

> plot(x,y,col=”blue”,pch=19,cex=2)

> text(x+0.05,y+0.05,labels=as.character(1:12))#生成随机点并标记

> dataframe<-data.frame(x,y)

> kmeansobj<-kmeans(dataframe,centers=3)#注释有三个几何中心

> names(kmeansobj)#返回一个列表

[1] “cluster”      “centers”      “totss”        “withinss”

[5] “tot.withinss” “betweenss”    “size”         “iter”

[9] “ifault”

> kmeansobj$cluster#查看cluster元素,显示数据点一至十二各属于哪个簇

[1] 3 3 3 3 1 1 1 1 2 2 2 2

> par(mar=rep(0.2,4))

> plot(x,y,col=kmeansobj$cluster,pch=19,cex=2)#作图,根据数据点所在的簇给其上色

> points(kmeansobj$centers,col=1:3,pch=3,cex=3,lwd=3)#添加簇的几何中心

> set.seed(1234)

> datamatrix<-as.matrix(dataframe)[sample(1:12),]

> kmeansobj2<-kmeans(datamatrix,centers=3)#使用热图函数

> par(mfrow=c(1,2),mar=c(2,4,0.1,0.1))

> image(t(datamatrix)[,nrow(datamatrix):1],yaxt=”n”)

> image(t(datamatrix)[,order(kmeansobj$cluster)],yaxt=”n”)

  • 举例

> set.seed(12345)

> par(mar=rep(0.2,4))

> datamatrix<-matrix(rnorm(400),nrow=40)#生成随机正态分布数据

> image(1:10,1:40,t(datamatrix)[,nrow(datamatrix):1])#画出矩阵

> heatmap(datamatrix)#进行层次聚类分析

> set.seed(678910)

> for (i in 1:40){#遍历所有行

+     coinflip<-rbinom(1,size=1,prob=0.5)#抛硬币

+     if (coinflip){

+         datamatrix[i,]<-datamatrix[i,]+rep(c(0,3),each=5)#如果硬币抛得1,增加一个模式:五列均值为3,另外五列均值为0

+     }

+ }

> image(1:10,1:40,t(datamatrix)[,nrow(datamatrix):1])#重新画出矩阵热图

 

Lesson3 Woking with color

  • grDevices package

colors() lists the names of colors you can use in any plotting function

colorRamp() take a palette of colors and return a function that takes values between 0 and 1, indicating the extremes of the color palette (e.g. gray() function)

> pal<-colorRamp(c(“red”,”blue”))#选择在红色和蓝色之间调整

> pal(0) #1、2、3分别代表红色、绿色和蓝色,有0至255一共256个维度,函数指向0则全红

[,1] [,2] [,3]

[1,]  255  0  0

> pal(1)#2指向全蓝,绿色未涉及

[,1] [,2] [,3]

[1,]    0  0  255

> pal(0.5)

[,1] [,2]  [,3]

[1,] 127.5   0  127.5

colorRampPalette() 返回的函数取值不再是0至1,而是一个整数参数

> pal<-colorRampPalette(c(“red”,”blue”))

> pal(2)#取两个颜色,结果为全红和全蓝。FF0000每两位代表红、绿、蓝的十六进制,FF最大,00最小

[1] “#FF0000” “#0000FF”

> pal(10)#取10个

[1] “#FF0000” “#E2001C” “#C60038” “#AA0055” “#8D0071” “#71008D”

[7] “#5500AA” “#3800C6” “#1C00E2” “#0000FF”

> image(volcano,col=pal(2))#画图,volcano是R中自带的图,颜色取自pal()函数

> image(volcano,col=pal(20))#取20种颜色

  • RColorBrewer package

3 type of palettes: sequential连续型, diverging极端型, qualitative离散型

brewer.pal() 接受两个参数,调色板中需要的颜色数目和调色板名称

  • smoothScatter()

> x<-rnorm(10000)

> y<-rnorm(10000)

> plot(x,y)

> smoothScatter(x,y)#生成上图点的二维直方图,并使用颜色(默认蓝色)描绘

  • rgb()

包含四个参数分别代表红、绿、蓝和透明度(alpha),范围从0至1,因此可以:1、将颜色的十进制数值转换为十六进制;2、设置颜色透明度。

> rgb(1,1,1)

[1] “#FFFFFF”

> rgb(0.4,0.5,0.6)

[1] “#668099”

> rgb(1,1,1,0.5)

[1] “#FFFFFF80”

> plot(x,y,col=rgb(0,0,1,0.2),pch=19)#pch=19显示的是实心圆

 

 

数据科学(一):数据学家的工具箱

数据科学(二):R语言

数据科学(三):获取和清理数据