2017-06-18

Python-Pandas語法

N/A值處理,文字轉數值,移動平均,One-Hot處理

pandas中N/A值處理

如果要將N/A值補上資料，使用.fillna
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

pad 或 ffill 上一個值
backfill 或 bfill 下一個值
如果要填入平均值則
people[‘age’].fillna(people[‘age’].mean())
如果要刪除存在N/A值的列,使用.dropna
DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)[source]
採用插補法，使用.interpolate
Series.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, downcast=None)
method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}

Please note that only method=’linear’ is supported for DataFrames/Series with a MultiIndex

將文字轉為數值格式

使用.to_numeric
pandas.to_numeric(arg, errors=’raise’, downcast=None)

移動平均

.rolling
DataFrame.rolling(window, min_periods=None, freq=None, center=False, win_type=None, on=None, axis=0, closed=None)
ex. 三日移動平均stock['Close'].rolling(window=3).mean()

欄位名稱空白處理

欄位名稱中有空白時(ex.”總計”)，無法讀取，將空白刪除
p2.columns = [c.replace(' ', '') for c in p2.columns]

One-Hot處理

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False)
將欄位內容(類別變數)轉換為0/1數值，以做後續迴歸處理
ex.方向有東西南北四種項目，則
東 100
西 010
南 001
北 000

house1 = pd.concat([house,pd.get_dummies(house['Brick']),pd.get_dummies(house['Neighborhood'])],axis=1)

concat將多個pandas資料合併，axis=0依欄位順序排列，axis=1依索引值順序排列，如果要維持原欄位名稱順序，設定axis=1
Usually axis=0 is said to be “column-wise” and axis=1 “row-wise”

資料整理

age = ‘建筑年代：1998\r\n’
解析方法有二種
df['age'].map(lambda e: 2017 - int(e.strip().strip('建筑年代：')))
df['age'].str.extract('建筑年代：(\d+)')

讀入excel檔案

import pandas as pd
erp = pd.ExcelFile("creditcardfromerp.xlsx")
erp_FIT = erp.parse('FIT') #指定讀入哪個sheet
erp_GIT = erp.parse('GIT')
erp_ALL = pd.concat([erp_FIT, erp_GIT],ignore_index=True) #多個dataframe合併,並忽略原有index,以避免index重覆
erp_ALL.head()

apply應用

`erp_ALL[“卡號”].apply(lambda x: x[-4:]) #讀取後4碼資料