数据科学(一):数据学家的工具箱

数据科学一系列文章的内容来源于我在Coursera中的一项专项课程Data Science的笔记,其中内容除专门指出外均来自该课程课件。

Week1:

The data science Veen diagram

Hacking Skills:处理技术

Match & Statistics Knowledge:数学和统计知识

Substantive Expertise:大量专业知识

What do data scientists do

  • Define the question
  • Define the ideal data set 确定理想数据集
  • Determine what data you can access
  • Obtain the data
  • Clean the data
  • Exploratory data analysis 探索性分析
  • Statistical prediction/modeling 统计预测/建模
  • Interpret results 诠释结论
  • Challenge results 验证结论
  • Synthesize/write up results
  • Create reproducible code 创建可重复代码
  • Distribute results

Install R

https://cran.r-project.org

 

Week2: Installing the Toolbox

Command line interface

Commands:

  • pwd 当前所在目录
  • clear 清空屏幕
  • ls 列出当前工作目录的所有文件
  • cd 改变工作目录 ..上级
  • mkdir 创建新目录
  • touch 创建新文件
  • cp 复制文件 -r(递归进行)复制目录
  • rm(remove) 删除文件 -r删除目录
  • mv(move) 移动文件/重命名
  • date 查看日期
  • echo 输出特定内容

Git 版本控制系统

https://git-scm.com/download/win

for Windows: Git Bash

GitHub 资源库远程管理

https://github.com

  • 在本地创建资源库(repo)副本以便修改:创建文件夹(mkdir /…/test-repo)–进入新建文件夹目录(cd /…/test-repo)–在本地创建资源库(git init)–指向远程资源库(git remote add origin https://github.com/yourusername/test-repo.git)

复制到本地(git clone https://github.com/yourusername/reponame.git)

  • workspace-1-index-2-local repository-3-remote repository

1从工作区添加文件至索引:添加新文件(git add),添加被改名或删除的文件(git add -u),添加之前两种(git add -A)

2从索引提交至本地资源库:git commit –m “注释”

3将修改更新至github:git push

  • 创建分支版本(git checkout –b branchname),查看当前所处分支(git branch),切换至主分支(git checkout master)

Basic markdown

## 二级标题,### 三级标题

*无排序列表

Install R packages and Rtools

install.packages() 在R控制台安装程序包,library() 载入指定程序包

https://cran.r-project.org/bin/windows/Rtools

http://www.rstudio.com/products/RStudio RStudio

install.packages(“devtools”) 安装devtools,library(devtools) 加载,find_rtools(),返回“[1] TRUE”说明运行正常

 

Week3: Conceptual Issues

Type of data science questions

  • Descriptive 描述

Goal: Just to describe a set of data.

The description and interpretation are different steps. 描述和解释是不同的步骤。

Descriptions can usually not be generalized without additional statistical modeling. 没有额外的统计建模,描述通常不能推广。换句话说,只能描述所看到的,不能说可能怎样或者将会怎样。

  • Exploratory 探索

Goal: Find relationships you did not know about.

Good for discovering new connections.

It is useful for defining future study projects.

It is usually not the final say.

For alone usually should not be used for generalizing or predicting. 通常不能单独用来概括或预测。

Correlation does not imply causation.

  • Inferential 推论

Goal: Use a relatively small sample of data to say something about a bigger population.

Inference involves estimating both the quantity you care about and your uncertainly about your estimate.

Inference depends heavily on both the population and the sampling scheme.

  • Predictive 预测

Goal: To use the data on some objects to predict values for another object.

If X predicts Y it does not mean that X causes Y.

Accurate prediction depends heavily on measuring the right variables.

  • Causal 因果

Goal: To find out what happens to one variable when you make another variable change.

In generally using randomized studies or randomized controlled trials to identify causation. 一般采用随机研究或随机对照试验来确定因果关系。

There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions.

Causal relationships are usually identified as average effects. 因果关系通常被确定为平均效应。

  • Mechanistic 机械论

Goal: Understand the exact changes in variables that lead to exact changes in other variables for individual objects. 理解导致单个对象的其他变量发生精确变化的确切变量的精确变化。

Generally the only random component when you’re doing a mechanistic analysis is measurement error. 一般来说,当你进行机械分析时,唯一的随机成分是测量误差。

Data

Definition in Wikipedia: Data are values of qualitative or quantitative variables, belonging to a set of items.

“The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”—John Tukey

Experimental design

Have replication 重复实验

Measure variability 产量变化程度

Generalize to the problem you care about 推广至关心的问题

Are transparent 透明性

  • Beware data dredging 注意数据捕捞问题

Do analysis without first devising a specific hypothesis as to the underlying causality. 没有首先设计一个关于潜在因果关系的具体假设。