基于伯克利ds100的数据科学笔记pandas部分

notebook

  1. Ctrl + Return(或者 Cmd + Return在 Mac 上): 评估当前单元格
  2. Shift + Return评估当前单元格并移至下一个单元格
  3. ESC命令模式 (在使用以下任何命令之前可能需要按)
  4. a: 在上面创建一个单元格
  5. b在下面创建一个单元格
  6. dd: 删除一个单元格
  7. z撤消上一次单元格操作
  8. m将单元格转换为 Markdown
  9. y将单元格转换为代码

pandas

创建表格DataFrame

按列创建

1
2
3
4
5
6
fruit_info = pd.DataFrame(
data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
'color': ['red', 'orange', 'yellow', 'pink'],
'price': [1.0, 0.75, 0.35, 0.05]
})
fruit_info

按行创建

1
2
3
4
fruit_info2 = pd.DataFrame(
[("red", "apple", 1.0), ("orange", "orange", 0.75), ("yellow", "banana", 0.35),
("pink", "raspberry", 0.05)],
columns = ["color", "fruit", "price"])

选择行和列

  • loc 两个详细运算符中的第一个是 loc,它需要两个参数。 第一个是一个或多个行 标签 ,第二个是一个或多个列 标签

所需的行或列可以单独提供、以切片表示法或作为列表提供。 单独提供会返回series 单独切片参数默认返回行

● loc selects items by label. First argument is rows, second argument is columns. ● iloc selects items by number. First argument is rows, second argument is columns. ● [] only takes one argument, which may be: ○ A slice of row numbers. ○ A list of column labels. ○ A single column label. That is,[] is context sensitive.

  • 列名一般唯一,可以强制重复
  • 行名可以重复 pandas支持裸操作,但此时无法使用切片
1
2
# Here we're providing a list of fruits as single argument to []
fruit_info[["fruit", "color", "price"]]

其他行列操作

添加行和列

1
2
fruit_info['rank1'] = [1,2,3,4]
fruit_info.loc[:,'rank2'] = [1,2,3,4]

使用 .drop()方法来 删除 这两个 rank1rank2您创建的列。 确保使用 axis参数正确。 注意 drop不会更改表,而是返回一个具有较少列或行的新表,除非您设置可选 inplace范围

1
2
3
fruit_info
fruit_info_original = fruit_info.drop(['rank1','rank2'],axis=1)
fruit_info_original

重命名

1
2
fruit_info_caps=fruit_info_original.rename(columns={'fruit':'F','color':'C','price':'P'})
fruit_info_caps

布尔数组筛选和查询

1
2
result = baby_names[(baby_names['Year']==2000) & (baby_names['Count']>3000)]
result_using_query = baby_names.query('Year==2000 & Count>3000')

slides实例

提取列数据:

1
2
elections["Candidate"].tail(5).to_frame()

提取目录:

1
2
For row labels, use DataFrame.index:
For column labels, use DataFrame.columns:

支持布尔数组检索

1
2
elections[elections["Party"] == "Independent"]
elections[(elections["Result"] == "win") & (elections["%"] < 47)]

支持用其他数组检索

1
2
a_parties = ["Anti-Masonic", "American", "Anti-Monopoly", "American Independent"]
elections[elections["Party"].isin(a_parties)]

查询类似sql

1
elections.query('Result == "win" and Party not in @parties')

匿名函数辅助查询

1
2
3
4
What if we wanted to find the longest names in California?
babynames.query('Sex == "M" and Year == 2020')
.sort_values("Name", key = lambda x: x.str.len(),
ascending = False)

创建删除列

1
2
3
4
5
babyname_lengths = babynames["Name"].str.len()
#add that series to the dataframe as a column
babynames["name_lengths"] = babyname_lengths
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames = babynames.drop("name_lengths", axis = 'columns')//删除列(默认为行)

map方法使用函数统计数据

1
2
3
4
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)

字典创建表格

1
2
3
4
5
6
7
8
#build dictionary where each entry is the rtp for a given name
#e.g. rtps["jennifer"] should be 0.0231
rtps = {}
for name in babynames["Name"].unique():
counts_of_current_name = female_babynames[female_babynames["Name"] == name]["Count"]
rtps[name] = ratio_to_peak(counts_of_current_name)
#convert to series
rtps = pd.Series(rtps)

使用group建立新表格

1
female_babynames.groupby("Name").agg(ratio_to_peak)

groupby的其他用法

1
2
3
4
5
6
7
Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or
Series). agg is only one choice:
● agg: Creates a new DataFrame with one aggregated row per subframe.
● max: Creates a new DataFrame aggregated using the max function.
● size: Creates a new Series with the size of each subframe.
● filter: Creates a copy of the original DataFrame, but keeping only rows from subframes
that obey the provided condition.

合并表格

1
2
merged = pd.merge(left = elections, right = male_2020_babynames,
left_on = "First Name", right_on = "Name")