怎样用python csv 提取提取不同股票csv里特定时间段的数据

为了参数化csv文件中的数据,需要做到两点:
一是逐行读取:用到列表
二是根据列名获取对应单元格的值:用到字典
import csv
bid_info = csv.DictReader(open('bid_info.csv','r'))
dict_data = []
for lines in bid_info:
if bid_info.line_num == 1:
dict_data.append(lines)
row_num = len(dict_data)
# print('this is all the data---' + str(dict))
#循环读取每一行
while(i & row_num):
print('this is'+str(i)+'row----'+ str(dict_data[i]))
print(dict_data[i]['a'])
阅读(...) 评论()用 Python 做数据处理必看:12 个使效率倍增的 Pandas 技巧
我的图书馆
用 Python 做数据处理必看:12 个使效率倍增的 Pandas 技巧
&&12 Useful Pandas Techniques in Python for Data Manipulation&&SHARE&&&&&,&/&&IntroductionPython is fast becoming the preferred&language for data scientists – and for good reasons. It provides the larger ecosystem of a programming language and the depth of good scientific computation libraries. If you are starting to learn Python, have a look at&.Among its scientific computation libraries, I found Pandas to be the most useful for data science operations. Pandas, along with Scikit-learn provides almost the entire stack needed by a data scientist. This article focuses on providing&12 ways&for&data manipulation&in Python. I’ve also shared some&tips & tricks&which will allow you to&work faster.I would recommend that you look at the codes for&&before going ahead.&To help you understand better, I’ve taken a data set to perform these operations and manipulations.Data Set:&I’ve used the data set&of&&problem. Download the data set and get started.Let’s get startedI’ll&start by importing modules and loading the data set into Python environment:import pandas as pd
import numpy as np
data = pd.read_csv("train.csv", index_col="Loan_ID")&#1 – Boolean IndexingWhat do you do, if you want to filter values of a column based on conditions from another set of columns?&For instance, we want a list of all&females&who are not graduate&and&got a loan.&Boolean indexing can help here. You can use the following code:data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]Read More:&&#2 – Apply FunctionIt is one of the commonly used&functions for&playing with data and creating new variables.&Applyreturns some value after passing each row/column of a data frame with&some function. The function can be both default or user-defined. For instance, here it can be used to find the #missing values in each row and column.#Create a new function:
def num_missing(x):
return sum(x.isnull())
#Applying per column:
print "Missing values per column:"
print data.apply(num_missing, axis=0) #axis=0 defines that function is to be applied on each column
#Applying per row:
print "\nMissing values per row:"
print data.apply(num_missing, axis=1).head() #axis=1 defines that function is to be applied on each rowThus we get the desired result.Note: head() function is used in second output because it contains many rows.Read More:&&#3 – Imputing missing files‘fillna()’ does it in one go. It is used for updating missing values with the overall mean/mode/median of the column. Let’s impute the ‘Gender’, ‘Married’ and ‘Self_Employed’ columns with their respective modes.#First we import a function to determine the mode
from scipy.stats import mode
mode(data['Gender'])Output:&ModeResult(mode=array([‘Male’], dtype=object), count=array([489]))This returns both mode and count. Remember that mode can be an array as there can be multiple values with high frequency. We will take the first one by default always using:mode(data['Gender']).mode[0]Now we can fill the missing values and check using technique #2.#Impute the values:
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)
#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)Hence, it is confirmed that missing values are imputed. Please note that this is the most primitive form of imputation. Other sophisticated techniques include modeling the missing values, using grouped averages (mean/mode/median). I’ll cover that part in my next articles.Read More:&&#4 – Pivot TablePandas can be used to create MS Excel style pivot tables. For instance, in this case, a key column is “LoanAmount” which has missing values. We can impute it using mean amount of each ‘Gender’, ‘Married’ and ‘Self_Employed’ group. The mean ‘LoanAmount’ of each group can be determined as:#Determine pivot table
impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)
print impute_grpsMore:&&#5 – Multi-IndexingIf you notice the output of step #3, it has a strange&property. Each index is made up of a combination of 3 values. This is called Multi-Indexing. It helps in performing&operations really fast.Continuing the example from #3, we have the values for each group but they have not been&imputed.This can be done using the various techniques learned till now.#iterate only through rows with missing LoanAmount
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
ind = tuple(,row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]
#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)Note:Multi-index requires tuple for defining groups of indices in loc statement. This a tuple used in function.The .values[0] suffix is required because, by default a series element is returned which has an index not matching with that of the dataframe. In this case, a direct assignment gives an error.&#6. CrosstabThis function is&used to get an initial “feel” (view) of the data. Here, we can validate some basic hypothesis. For instance, in this case, “Credit_History” is expected to affect the loan status significantly. This can be tested using cross-tabulation as shown below:pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)These are absolute numbers. But, percentages can be more intuitive in making some quick insights. We can do this using the apply function:def percConvert(ser):
return ser/float(ser[-1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)Now, it is evident that people with a credit history have much higher chances of getting a loan as 80% people with credit history got a loan as compared to only 9% without credit history.But that’s not it. It tells an interesting story. Since I know that having a credit history is super important, what if I predict loan status to be Y for ones with credit history and N otherwise. Surprisingly, we’ll be right 82+378=460 times out of 614 which is a whopping 75%!I won’t blame you if you’re wondering why the hell do we need statistical models. But trust me, increasing the accuracy by even 0.001% beyond this mark is a challenging task. Would you take thisNote:&75% is on train set. The test set will be slightly different but close. Also, I hope this gives some intuition into why even a 0.05% increase in accuracy can result in jump of 500 ranks on the&Kaggle leaderboard.Read More:&&#7 – Merge DataFramesMerging dataframes become essential when we have information coming from different sources to be collated. Consider a hypothetical case where the average property rates (INR per sq meters) is available for different property types. Let’s define a dataframe as:prop_rates = pd.DataFrame([, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_ratesNow we can merge this information with the original dataframe as:data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)The pivot table validates successful merge operation. Note that the ‘values’ argument is irrelevant here because we are simply counting the values.ReadMore:&&#8 – Sorting DataFramesPandas allow easy sorting based on multiple columns. This can be done as:data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)Note: Pandas “sort” function is now deprecated. We should use “sort_values” instead.More:&&#9 – Plotting (Boxplot & Histogram)Many of you might be unaware that boxplots and histograms can be directly plotted in Pandas and calling matplotlib separately is not necessary. It’s just a 1-line command. For instance, if we want to compare the distribution of ApplicantIncome by Loan_Status:import matplotlib.pyplot as plt
%matplotlib inline
data.boxplot(column="ApplicantIncome",by="Loan_Status")data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)This shows that income is not a big deciding factor on its own as there is no appreciable difference between the people who received and were denied the loan.Read More:&&|&&#10 – Cut function for binningSometimes numerical values make more sense if clustered together. For example, if we’re trying to model traffic (#cars on road) with time of the day (minutes). The exact minute of an hour might not be that relevant for predicting traffic as compared to actual period of the day like “Morning”, “Afternoon”, “Evening”, “Night”, “Late Night”. Modeling traffic this way will be more intuitive and will avoid overfitting.Here we define a simple function which can be re-used for binning any variable fairly easily.#Binning:
def binning(col, cut_points, labels=None):
#Define min and max values:
minval = col.min()
maxval = col.max()
#create list by adding min and max to cut_points
break_points = [minval] + cut_points + [maxval]
#if no labels provided, use default labels 0 ... (n-1)
if not labels:
labels = range(len(cut_points)+1)
#Binning using cut function of pandas
colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
return colBin
#Binning age:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)
print pd.value_counts(data["LoanAmount_Bin"], sort=False)Read More:&&#11 – Coding nominal dataOften, we find&a case where we’ve to modify the categories of a nominal variable. This can be due to various reasons:Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal variables are mostly coded as 0, 1….(n-1)Sometimes a category might be represented in 2 ways. For e.g. temperature might be recorded as “High”, “Medium”, “Low”, “H”, “low”. Here, both “High” and “H” refer to same category. Similarly, in “Low” and “low” there is only a difference of case. But, python would read them as different levels.Some categories might have very low frequencies and its generally a good idea to combine them.Here I’ve defined a generic function which takes in input as a dictionary and codes the values using ‘replace’ function in Pandas.#Define a generic function using Pandas replace function
def coding(col, codeDict):
colCoded = pd.Series(col, copy=True)
for key, value in codeDict.items():
colCoded.replace(key, value, inplace=True)
return colCoded
#Coding LoanStatus as Y=1, N=0:
print 'Before Coding:'
print pd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})
print '\nAfter Coding:'
print pd.value_counts(data["Loan_Status_Coded"])Similar counts before and after proves the coding.Read More:&&#12 – Iterating over rows of a dataframeThis is not a frequently used operation.&Still, you don’t want to get stuck. Right? At times you may need to iterate through all rows using a for loop. For instance, one common problem we face is&the incorrect treatment of variables&in&Python. This generally happens when:Nominal variables with numeric categories are treated as numerical.Numeric variables with characters entered in one of the rows (due to a data error) are considered categorical.So it’s generally a good idea to manually define the column types. If we check the data types of all columns:#Check current type:
data.dtypesHere we see that Credit_History is a nominal variable but appearing as float. A good way to tackle such issues&is to create a csv file with column names and types. This way, we can make a generic function to read the file and assign column data types. For instance, here I have created a&csv file.#Load the file:
colTypes = pd.read_csv('datatypes.csv')
print colTypes&After loading this file, we can iterate through each row and assign the datatype using column ‘type’ to the variable name defined in the ‘feature’ column.#Iterate through each row and assign variable type.
#Note: astype is used to assign types
for i, row in colTypes.iterrows():
#i: row: each row in series format
if row['type']=="categorical":
data]=data].astype(np.object)
elif row['type']=="continuous":
data]=data].astype(np.float)
print data.dtypes&Now the credit history column is modified to ‘object’ type which is used for representing nominal variables in Pandas.Read More:&&End NotesIn this article, we covered various functions of Pandas which can make our life easy while performing data exploration and feature engineering. Also, we defined some generic functions which can be reused for achieving similar objective on different datasets.Also See: If you have any doubts pertaining to Pandas or Python in general, feel free to&&with us.Did you find the article useful? Do you use some better (easier/faster) techniques for performing the tasks discussed above? Do you think there are better alternatives to Pandas in Python? We’ll be glad if you share your thoughts as comments below.&Python正迅速成为数据科学家偏爱的语言,这合情合理。它拥有作为一种编程语言广阔的生态环境以及众多优秀的科学计算库。如果你刚开始学习Python,可以先了解一下Python的学习路线。在众多的科学计算库中,我认为Pandas对数据科学运算最有用。Pandas,加上Scikit-learn几乎能构成了数据科学家所需的全部工具。 本文旨在提供Python数据处理的12种方法。文中也分享了一些会让你的工作更加便捷的小技巧。在继续推进之前,我推荐读者阅览一些关于数据探索 (data exploration)的代码。为了帮助理解,本文用一个具体的数据集进行运算和操作。本文使用了贷款预测(loan prediction) 问题数据集,下载数据集请到开始工作首先我要导入要用的模块,并把数据集载入Python环境。
import pandas as pd
import numpy as np
data = pd.read_csv("train.csv", index_col="Loan_ID")
1.布尔索引(Boolean Indexing)如何你想用基于某些列的条件筛选另一列的值,你会怎么做?例如,我们想要一个全部无大学学历但有贷款的女性列表。这里可以使用布尔索引。代码如下:
data.loc[(data["Gender"]=="Female") & (data["Education"]=="Not Graduate") & (data["Loan_Status"]=="Y"), ["Gender","Education","Loan_Status"]]
想了解更多请阅读 Pandas Selecting and Indexing2.Apply函数Apply是摆弄数据和创造新变量时常用的一个函数。Apply把函数应用于数据框的特定行/列之后返回一些值。这里的函数既可以是系统自带的也可以是用户定义的。例如,此处可以用它来寻找每行每列的缺失值个数:
#创建一个新函数:
def num_missing(x):
return sum(x.isnull())
#Apply到每一列:
print "Missing values per column:"
print data.apply(num_missing, axis=0) #axis=0代表函数应用于每一列
#Apply到每一行:
print "\nMissing values per row:"
print data.apply(num_missing, axis=1).head() #axis=1代表函数应用于每一行
输出结果:由此我们得到了想要的结果。注意:第二个输出使用了head()函数,因为数据包含太多行。想了解更多请阅读 Pandas Reference (apply)3.替换缺失值‘fillna()’ 可以一次解决这个问题。它被用来把缺失值替换为所在列的平均值/众数/中位数。
#首先导入一个寻找众数的函数:
from scipy.stats import mode
mode(data['Gender'])
输出: ModeResult(mode=array([‘Male’], dtype=object), count=array([489]))返回了众数及其出现次数。记住,众数可以是个数组,因为高频的值可能不只一个。我们通常默认使用第一个:
mode(data['Gender']).mode[0]
现在可以填补缺失值,并用上一步的技巧来检验。
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0], inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0], inplace=True)
#再次检查缺失值以确认:
print data.apply(num_missing, axis=0)
由此可见,缺失值确定被替换了。请注意这是最基本的替换方式,其他更复杂的技术,如为缺失值建模、用分组平均数(平均值/众数/中位数)填充,会在今后的文章提到。想了解更多请阅读 Pandas Reference (fillna)4.透视表Pandas可以用来创建 Excel式的透视表。例如,“LoanAmount”这个重要的列有缺失值。我们可以用根据 ‘Gender’、‘Married’、‘Self_Employed’分组后的各组的均值来替换缺失值。每个组的 ‘LoanAmount’可以用如下方法确定:
#Determine pivot table
impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender","Married","Self_Employed"], aggfunc=np.mean)
print impute_grps
想了解更多请阅读 Pandas Reference (Pivot Table)5.多重索引你可能注意到上一步骤的输出有个奇怪的性质。每个索引都是由三个值组合而成。这叫做多重索引。它可以帮助运算快速进行。延续上面的例子,现在我们有了每个分组的值,但还没有替换。这个任务可以用现在学过的多个技巧共同完成。
#只在带有缺失值的行中迭代:
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]
#再次检查缺失值以确认:
print data.apply(num_missing, axis=0)
注:多重索引需要在loc中用到定义分组group的元组(tuple)。这个元组会在函数中使用。需要使用.values[0]后缀。因为默认情况下元素返回的顺序与原数据库不匹配。在这种情况下,直接指派会返回错误。6. 二维表这个功能可被用来获取关于数据的初始“印象”(观察)。这里我们可以验证一些基本假设。例如,本例中“Credit_History” 被认为对欠款状态有显著影响。可以用下面这个二维表进行验证:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
这些数字是绝对数值。不过,百分比数字更有助于快速了解数据。我们可以用apply函数达到目的:
def percConvert(ser):
return ser/float(ser[-1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)
现在可以很明显地看出,有信用记录的人获得贷款的可能性更高:有信用记录的人有80% 获得了贷款,没有信用记录的人只有 9% 获得了贷款。但不仅仅是这样,其中还包含着更多信息。由于我现在知道了有信用记录与否非常重要,如果用信用记录来预测是否会获得贷款会怎样?令人惊讶的是,在614次试验中我们能预测正确460次,足足有75%!如果此刻你在纳闷,我们要统计模型有什么用,我不会怪你。但相信我,在此基础上提高0.001%的准确率都是充满挑战性的。你是否愿意接受这个挑战?注:对训练集而言是75% 。在测试集上有些不同,但结果相近。同时,我希望这个例子能让人明白,为什么提高0.05% 的正确率就能在Kaggle排行榜上跳升500个名次。7 – 数据框合并当我们有收集自不同来源的数据时,合并数据框就变得至关重要。假设对于不同的房产类型,我们有不同的房屋均价数据。让我们定义这样一个数据框:
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_rates
现在可以把它与原始数据框合并:
data_merged = data.merge(right=prop_rates, how='inner',left_on='Property_Area',right_index=True, sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'], aggfunc=len)
这张透视表验证了合并成功。注意这里的 ‘values’无关紧要,因为我们只是单纯计数。想了解更多请阅读Pandas Reference (merge)8 – 给数据框排序Pandas可以轻松基于多列排序。方法如下:
data_sorted = data.sort_values(['ApplicantIncome','CoapplicantIncome'], ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)
注:Pandas 的“sort”函数现在已经不推荐使用,我们用 “sort_values”函数代替。想了解更多请阅读Pandas Reference (sort_values)9 – 绘图(箱型图&直方图)许多人可能没意识到Pandas可以直接绘制箱型图和直方图,不必单独调用matplotlib。只需要一行代码。举例来说,如果我们想根据贷款状态Loan_Status来比较申请者收入ApplicantIncome:
data.boxplot(column="ApplicantIncome",by="Loan_Status")
data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)
可以看出获得/未获得贷款的人没有明显的收入差异,即收入不是决定性因素。想了解更多请阅读Pandas Reference (hist) | Pandas Reference (boxplot)10 – 用Cut函数分箱有时把数值聚集在一起更有意义。例如,如果我们要为交通状况(路上的汽车数量)根据时间(分钟数据)建模。具体的分钟可能不重要,而时段如“上午”“下午”“傍晚”“夜间”“深夜”更有利于预测。如此建模更直观,也能避免过度拟合。这里我们定义一个简单的、可复用的函数,轻松为任意变量分箱。
def binning(col, cut_points, labels=None):
#Define min and max values:
minval = col.min()
maxval = col.max()
#利用最大值和最小值创建分箱点的列表
break_points = [minval] + cut_points + [maxval]
#如果没有标签,则使用默认标签0 ... (n-1)
if not labels:
labels = range(len(cut_points)+1)
#使用pandas的cut功能分箱
colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
return colBin
#为年龄分箱:
cut_points = [90,140,190]
labels = ["low","medium","high","very high"]
data["LoanAmount_Bin"] = binning(data["LoanAmount"], cut_points, labels)
print pd.value_counts(data["LoanAmount_Bin"], sort=False)
想了解更多请阅读 Pandas Reference (cut)11 – 为分类变量编码有时,我们会面对要改动分类变量的情况。原因可能是:有些算法(如罗吉斯回归)要求所有输入项目是数字形式。所以分类变量常被编码为0, 1….(n-1)有时同一个分类变量可能会有两种表现方式。如,温度可能被标记为“High”, “Medium”, “Low”,“H”, “low”。这里 “High” 和 “H”都代表同一类别。同理, “Low” 和“low”也是同一类别。但Python会把它们当作不同的类别。一些类别的频数非常低,把它们归为一类是个好主意。这里我们定义了一个函数,以字典的方式输入数值,用‘replace’函数进行编码。
#使用Pandas replace函数定义新函数:
def coding(col, codeDict):
colCoded = pd.Series(col, copy=True)
for key, value in codeDict.items():
colCoded.replace(key, value, inplace=True)
return colCoded
#把贷款状态LoanStatus编码为Y=1, N=0:
print 'Before Coding:'
print pd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"] = coding(data["Loan_Status"], {'N':0,'Y':1})
print '\nAfter Coding:'
print pd.value_counts(data["Loan_Status_Coded"])
编码前后计数不变,证明编码成功。想了解更多请阅读 Pandas Reference (replace)12 – 在一个数据框的各行循环迭代这不是一个常见的操作。但你总不想卡在这里吧?有时你会需要用一个for循环来处理每行。例如,一个常见的问题是变量处置不当。通常见于以下情况:带数字的分类变量被当做数值。(由于出错)带文字的数值变量被当做分类变量。所以通常来说手动定义变量类型是个好主意。如我们检查各列的数据类型:
#检查当前数据类型:
data.dtypes
这里可以看到分类变量Credit_History被当作浮点数。对付这个问题的一个好办法是创建一个包含变量名和类型的csv文件。通过这种方法,我们可以定义一个函数来读取文件,并为每列指派数据类型。举例来说,我们创建了csv文件&&。
#载入文件:
colTypes = pd.read_csv('datatypes.csv')
print colTypes
载入这个文件之后,我们能对每行迭代,把用‘type’列把数据类型指派到‘feature’ 列对应的项目。
#迭代每行,指派变量类型。
#注,astype用来指定变量类型。
for i, row in colTypes.iterrows(): #i: dataframe索引; row: 连续的每行
if row['feature']=="categorical":
data[row['feature']]=data[row['feature']].astype(np.object)
elif row['feature']=="continuous":
data[row['feature']]=data[row['feature']].astype(np.float)
print data.dtypes&
TA的最新馆藏[转]&[转]&[转]&[转]&[转]&
喜欢该文的人也喜欢

我要回帖

更多关于 python 提取csv数据 的文章

 

随机推荐