====== Creating a DataFrame ====== ===== Create empty DataFrame ===== import pandas index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min') df = pandas.DataFrame(index=index, columns=['A','B', 'C']) ===== Create DataFrame from pandas.Series ===== import pandas s1 = pandas.Series([1,2,3], index=[0,1,2]) s2 = pandas.Series(['a','b','c'], index=[2,3,4]) df = pandas.DataFrame( {'column1':s1, 'column2':s2}) # Note: # pandas uses the index of both series to join the correct rows together: print df column1 column2 0 1 NaN 1 2 NaN 2 3 a 3 NaN b 4 NaN c ===== Import data from CSV ===== import pandas data = pandas.read_csv("data.csv", index_col="time", parse_dates=True) Parse index as timestamp: import pandas data = pandas.read_csv("data.csv", parse_dates=True) # tries to parse index as timestamp Parse some columns as timestamp import pandas data = pandas.read_csv("data.csv", index_col="time", parse_dates=["from_time", "to_time"]) # list of ints or names, list of lists, or dict ===== Import data from DB ===== import pandas.io.sql as sql import psycopg2 c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret') frame = sql.read_sql("select * from my_table", c) ====== Working with DataFrames ====== ===== Multi-axis indexing ===== [[http://pandas.pydata.org/pandas-docs/dev/indexing.html|Starting in 0.11.0]], more explicit location based indexing is supported. Pandas now supports three types of multi-axis indexing, by label (**loc**), by position (**iloc**) and mixed (**ix**): df.loc['a'] # row with index a df.time_stamp.iloc[0] # first row, column "time_stamp" df.time_stamp.iloc[-1] # last row, column "time_stamp" ix supports mixed integer and label based access: it will try to use arguments as labels first, then fallback to integer positional access). The documentations warns that it is better to be explicit and use .iloc or .loc whenever possible. ===== Iterate over rows ===== for (index_value, series) in data_frame.iterrows(): # do something maybe a little bit more intuitive, but not so sure about the performance for (idx, col1, col2) in zip(df.index, df.col1, df.col2): # do something ===== Iterate over columns ===== for (column_name, timeseries) in data_frame.iterkv(): # do something ===== Add Column ===== # simple version data['new_column'] = 0 # complicated version (not sure if this has an advantage) data['new_column'] = pandas.Series(numpy.zeros(len(data)), index=data.index) ===== Extend ===== Suppose you have a dataframe with an incomplete timeseries of a day, and you want to fill the missing values with nan: complete_day = pandas.date_range('2013-05-18', periods=96, freq='15min') complete_data=incomplete_data.ix[complete_day] ===== Sort ===== data_frame.sort(inplace=True, ascending=True) # sort by index ===== Group ===== from pandas.tseries.resample import TimeGrouper for name, group in groupby(TimeGrouper(freq='15min')): print name, group ===== Get row index ===== From [[http://stackoverflow.com/questions/18199288/getting-the-integer-index-of-a-pandas-dataframe-row-fulfilling-a-condition|SO]] from numpy import where import pandas from datetime import datetime index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min') df = pandas.DataFrame(index=index, columns=['A','B', 'C']) ts = datetime.strptime("2013-03-01 01","%Y-%m-%d %H") # row index: 4 print where(df.index == ts)[0] # The value returned is an array since there could be more than one row with a particular index or value in a column. ===== Select ===== Suppose you have a pandas DataFrame with a Datetimeindex and you want all entries in the hours 0-3 data[data.index.hour < 4] For multiple criteria it is easier to define an explicit variable: where = (data.speed > 10 ) | ( data.speed < 90 ) # dont forget the () or you get an error. Also note the bit-wise or-operator! data[where] ===== Modify Selection ===== This can be very counter-intuitive in pandas. The correct way to modify a DataFrame via boolean selection is to **specify the field to be modifed FIRST and the boolean index afterwards**. data.ix[where, 'is_filtered'] = 1 # probably this is the recommended version? # Be aware that the version where the boolean index is specified first DOES NOT WORK # data[where]['is_filtered'] = 1 # THIS IS WRONG # because data[where] calls __getitem__ which creates a COPY for boolean-indexed-slices see http://stackoverflow.com/questions/14192741/understanding-pandas-dataframe-indexing ===== Transform ===== from pandas.tseries.resample import TimeGrouper def filter_func(grp): grp = grp.copy() # otherwise the original dataframe will be modified! where = ( grp['speed'] < 10 ) | (grp['speed'] > 90 ) grp.filtered[where] = 1 return grp filtered_data = data.groupby(TimeGrouper(freq='15min')).transform(filter_func) ====== Combining DataFrames ====== ===== Combining Data with different intervals ===== The Series.asof Function is well suited to do that: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.asof.html Example: Join every row with the 15min Median of the previous time interval data = .... median15 = data.resample("15min", how="median", label="right") # note: label="right" causes every record in data to be joined with the PREVIOUS median data['last_median'] = [median15.speed.asof(t) for t in data.index] ===== Join ===== Details in [[http://pandas.pydata.org/pandas-docs/dev/merging.html|Pandas doc]] result = df1.join(df2, how="outer", lsuffix="_df1", rsuffix="_df2") ===== Merge ===== # add column data['new_column'] = 0 # add entries data.append(otherdata) ====== Aligning ====== from pandas import Series, date_range from numpy.random import randn rng1 = date_range('2011-04-01', periods=5, freq='H') rng2 = date_range('2011-04-02', periods=7, freq='H') df1 = Series(randn(len(rng1)), index=rng1) df2 = Series(randn(len(rng2)), index=rng2) # align adf1, adf2 = df1.align(df2) # returns two aligned dataframes, default: outer join ====== Stack / Pivot ====== ===== Pivoting ===== Suppose you have data like this: ^ time ^attribute ^ value ^ |12:00 |speed |45.0 | |12:00 |count |150 | But you prefer to have data like this: ^ time ^speed ^ count ^ |12:00 |45.0 |150 | from datetime import time import pandas df = pandas.DataFrame(dict(timestamp=[time(12,0), time(12,0)], attribute=["speed","count"], value=[45.0, 150.0])) df.pivot(index="timestamp", columns="attribute", values="value") ====== Output / Conversion ====== Pandas offers a lot of good output functionality. Some examples sf.to_csv('tmp.csv') sf.to_excel('tmp.xls') to create a excel file with more sheets: writer = pd.ExcelWriter('tmp.xls') df1.to_excel(writer, sheet_name='sheet1') df2.to_excel(writer, sheet_name='sheet2') ... writer.save() save dataframe into a database if_exists can be fail, replace or append from pandas.io import sql import psycopg2 # database connection c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret') # if_exists checks if the table exists and acts accordingly # fail: throw a ValueError if the table exists # replace: drop the table and create a new one # append: append the data to the table sql.write_frame(sf, 'table_name', con, if_exists='append') ===== Convert from Unix Timestamp ===== import pandas as pd import numpy as np import time # create a dataframe with unix timestamp value = np.arange(100) ts = time.time() + value df = pd.DataFrame(dict(time=ts, value=value)) # convert unix timestamp to a valid numpy datetime df['time'] = df.time.astype('M8[s]') ===== DataFrame to numpy recarray ====== The integrated method to_records creates a numpy recarray recarray = data.to_records() To select only specific columns recarray = data[['col1','col2']].to_records(index=False) ===== DataFrame to matrix ===== ma = data.as_matrix(['col1', 'col2']) ====== Plot ====== ===== Plot days on top of each other ===== data = ... # contains data of several days pylab.plot(data.index.time, data.value1) ===== Colouring ===== Colouring can be done by selecting a [[http://matplotlib.org/examples/color/colormaps_reference.html|colormap]]: df.plot(colormap='cubehelix')