There are lots of different types of files that you can load
into pandas and these each have quite a few different options that you can use
during the loading process. I won’t go into detail on most of these but the
book on the left (Python for Data Analysis) covers most of the options. I am
going to cover the following and a few of the options you can use when loading
the data. I won’t post code github hosted code for all of them as it is basically
import pandas with a single line of code.
CSV:
pd.read_csv('filename.csv',sep = ',', encoding =
"ISO-8859-1",parse_dates=[1,2], dayfirst=True)
pd.read_csv('filename.csv', encoding =
"ISO-8859-1", dtype={ColumnName:float}, low_memory =
False,skiprows=[0])
So above there are 2 different statements loading a csv file
called filename. The sep is the delimiter that has been used, for a csv this is
normally a comma, this is followed by the encoding type. I have then parsed
dates in columns 1 and 2 (remembering that python starts at 0 so if opened in
Excel these will be columns B and C. Finally, I have said that this is using
British date formatting with the dates first.
In the second example I have included 3 other arguments. The
first, Dtype, allows you to specify the data type of the column being loaded.
So in this case I had a large number that would potentially be stored as either
text or scientific notation if I did not specify it as float on load. Looking
at online sources and from personal experience when you are loading the same
file in again and again it is worthwhile setting the datatypes, see the answer
to the stackoverflow
question on this link.
Secondly, I have added the low memory = False option which
is linked to the datatypes option above. If pandas does not know the datatype
of all the columns it cannot tell what attribute to give to each column until it
has fully loaded the file, as for example, an id could be a number 10000 times but
row 10001 is a letter. Low memory = false has been useful for me when creating
a generic import for csv files where I am just quickly importing some data.
Thirdly, I have added the skiprows opton. This allows you to
not load certain rows. So in this case I am not going to load in the first row,
this is really useful when you keep getting sent documents that have 2 header rows
or the date in the first row or something annoying like that. I use all of the
above options on a daily basis to help automate my data analysis and report
production.
Text file:
pd.read_table(‘filename.txt',
low_memory = False)
As I have gone into detail on the load csv I don’t need to
cover so much for the text documents. Basically the above loads in a text file.
This assumes by default that the data is tab delimited.
Excel File:
As much as I enjoy using XLwings for exporting data from
Excel it is not needed in most cases to load data in. Instead the following
line can be used for importing data into pandas.
pd.read_excel(latest_file, sheet_name = 'Sheet1')
SQL:
pd.read_sql('select * from table', cnxn )
The above allows you to run a sql query into a pandas
dataframe.
Other file types:
There are lots of other useful functions, for example you
can convert the clipboard to a dataframe, read HDF5 files or JSON files.
Basically with most types of data you can get it into a pandas
dataframe.
No comments:
Post a Comment