Wednesday, 3 January 2018

Loading data into Pandas

There are lots of different types of files that you can load into pandas and these each have quite a few different options that you can use during the loading process. I won’t go into detail on most of these but the book on the left (Python for Data Analysis) covers most of the options. I am going to cover the following and a few of the options you can use when loading the data. I won’t post code github hosted code for all of them as it is basically import pandas with a single line of code.

CSV:

pd.read_csv('filename.csv',sep = ',', encoding = "ISO-8859-1",parse_dates=[1,2], dayfirst=True)

pd.read_csv('filename.csv', encoding = "ISO-8859-1", dtype={ColumnName:float}, low_memory = False,skiprows=[0])

So above there are 2 different statements loading a csv file called filename. The sep is the delimiter that has been used, for a csv this is normally a comma, this is followed by the encoding type. I have then parsed dates in columns 1 and 2 (remembering that python starts at 0 so if opened in Excel these will be columns B and C. Finally, I have said that this is using British date formatting with the dates first.

In the second example I have included 3 other arguments. The first, Dtype, allows you to specify the data type of the column being loaded. So in this case I had a large number that would potentially be stored as either text or scientific notation if I did not specify it as float on load. Looking at online sources and from personal experience when you are loading the same file in again and again it is worthwhile setting the datatypes, see the answer to the stackoverflow question on this link.

Secondly, I have added the low memory = False option which is linked to the datatypes option above. If pandas does not know the datatype of all the columns it cannot tell what attribute to give to each column until it has fully loaded the file, as for example, an id could be a number 10000 times but row 10001 is a letter. Low memory = false has been useful for me when creating a generic import for csv files where I am just quickly importing some data.

Thirdly, I have added the skiprows opton. This allows you to not load certain rows. So in this case I am not going to load in the first row, this is really useful when you keep getting sent documents that have 2 header rows or the date in the first row or something annoying like that. I use all of the above options on a daily basis to help automate my data analysis and report production.

Text file:
 pd.read_table(‘filename.txt', low_memory = False)

As I have gone into detail on the load csv I don’t need to cover so much for the text documents. Basically the above loads in a text file. This assumes by default that the data is tab delimited.

Excel File:

As much as I enjoy using XLwings for exporting data from Excel it is not needed in most cases to load data in. Instead the following line can be used for importing data into pandas.

pd.read_excel(latest_file, sheet_name = 'Sheet1')

SQL:

pd.read_sql('select * from table', cnxn )

The above allows you to run a sql query into a pandas dataframe.  

Other file types:

There are lots of other useful functions, for example you can convert the clipboard to a dataframe, read HDF5 files or JSON files.

Basically with most types of data you can get it into a pandas dataframe. 

No comments:

Post a Comment