====== Creating a DataFrame ======
===== Create empty DataFrame =====
import pandas
index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min')
df = pandas.DataFrame(index=index, columns=['A','B', 'C'])
===== Create DataFrame from pandas.Series =====
import pandas
s1 = pandas.Series([1,2,3], index=[0,1,2])
s2 = pandas.Series(['a','b','c'], index=[2,3,4])
df = pandas.DataFrame( {'column1':s1, 'column2':s2})
# Note:
# pandas uses the index of both series to join the correct rows together:
print df
column1 column2
0 1 NaN
1 2 NaN
2 3 a
3 NaN b
4 NaN c
===== Import data from CSV =====
import pandas
data = pandas.read_csv("data.csv", index_col="time", parse_dates=True)
Parse index as timestamp:
import pandas
data = pandas.read_csv("data.csv", parse_dates=True) # tries to parse index as timestamp
Parse some columns as timestamp
import pandas
data = pandas.read_csv("data.csv", index_col="time", parse_dates=["from_time", "to_time"]) # list of ints or names, list of lists, or dict
===== Import data from DB =====
import pandas.io.sql as sql
import psycopg2
c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret')
frame = sql.read_sql("select * from my_table", c)
====== Working with DataFrames ======
===== Multi-axis indexing =====
[[http://pandas.pydata.org/pandas-docs/dev/indexing.html|Starting in 0.11.0]], more explicit location based indexing is supported. Pandas now supports three types of multi-axis indexing, by label (**loc**), by position (**iloc**) and mixed (**ix**):
df.loc['a'] # row with index a
df.time_stamp.iloc[0] # first row, column "time_stamp"
df.time_stamp.iloc[-1] # last row, column "time_stamp"
ix supports mixed integer and label based access: it will try to use arguments as labels first, then fallback to integer positional access). The documentations warns that it is better to be explicit and use .iloc or .loc whenever possible.
===== Iterate over rows =====
for (index_value, series) in data_frame.iterrows():
# do something
maybe a little bit more intuitive, but not so sure about the performance
for (idx, col1, col2) in zip(df.index, df.col1, df.col2):
# do something
===== Iterate over columns =====
for (column_name, timeseries) in data_frame.iterkv():
# do something
===== Add Column =====
# simple version
data['new_column'] = 0
# complicated version (not sure if this has an advantage)
data['new_column'] = pandas.Series(numpy.zeros(len(data)), index=data.index)
===== Extend =====
Suppose you have a dataframe with an incomplete timeseries of a day, and you want to fill the missing values with nan:
complete_day = pandas.date_range('2013-05-18', periods=96, freq='15min')
complete_data=incomplete_data.ix[complete_day]
===== Sort =====
data_frame.sort(inplace=True, ascending=True) # sort by index
===== Group =====
from pandas.tseries.resample import TimeGrouper
for name, group in groupby(TimeGrouper(freq='15min')):
print name, group
===== Get row index =====
From [[http://stackoverflow.com/questions/18199288/getting-the-integer-index-of-a-pandas-dataframe-row-fulfilling-a-condition|SO]]
from numpy import where
import pandas
from datetime import datetime
index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min')
df = pandas.DataFrame(index=index, columns=['A','B', 'C'])
ts = datetime.strptime("2013-03-01 01","%Y-%m-%d %H") # row index: 4
print where(df.index == ts)[0] # The value returned is an array since there could be more than one row with a particular index or value in a column.
===== Select =====
Suppose you have a pandas DataFrame with a Datetimeindex and you want all entries in the hours 0-3
data[data.index.hour < 4]
For multiple criteria it is easier to define an explicit variable:
where = (data.speed > 10 ) | ( data.speed < 90 )
# dont forget the () or you get an error. Also note the bit-wise or-operator!
data[where]
===== Modify Selection =====
This can be very counter-intuitive in pandas. The correct way to modify a DataFrame via boolean selection is to **specify the field to be modifed FIRST and the boolean index afterwards**.
data.ix[where, 'is_filtered'] = 1 # probably this is the recommended version?
# Be aware that the version where the boolean index is specified first DOES NOT WORK
# data[where]['is_filtered'] = 1 # THIS IS WRONG
# because data[where] calls __getitem__ which creates a COPY for boolean-indexed-slices
see http://stackoverflow.com/questions/14192741/understanding-pandas-dataframe-indexing
===== Transform =====
from pandas.tseries.resample import TimeGrouper
def filter_func(grp):
grp = grp.copy() # otherwise the original dataframe will be modified!
where = ( grp['speed'] < 10 ) | (grp['speed'] > 90 )
grp.filtered[where] = 1
return grp
filtered_data = data.groupby(TimeGrouper(freq='15min')).transform(filter_func)
====== Combining DataFrames ======
===== Combining Data with different intervals =====
The Series.asof Function is well suited to do that: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.asof.html
Example: Join every row with the 15min Median of the previous time interval
data = ....
median15 = data.resample("15min", how="median", label="right") # note: label="right" causes every record in data to be joined with the PREVIOUS median
data['last_median'] = [median15.speed.asof(t) for t in data.index]
===== Join =====
Details in [[http://pandas.pydata.org/pandas-docs/dev/merging.html|Pandas doc]]
result = df1.join(df2, how="outer", lsuffix="_df1", rsuffix="_df2")
===== Merge =====
# add column
data['new_column'] = 0
# add entries
data.append(otherdata)
====== Aligning ======
from pandas import Series, date_range
from numpy.random import randn
rng1 = date_range('2011-04-01', periods=5, freq='H')
rng2 = date_range('2011-04-02', periods=7, freq='H')
df1 = Series(randn(len(rng1)), index=rng1)
df2 = Series(randn(len(rng2)), index=rng2)
# align
adf1, adf2 = df1.align(df2) # returns two aligned dataframes, default: outer join
====== Stack / Pivot ======
===== Pivoting =====
Suppose you have data like this:
^ time ^attribute ^ value ^
|12:00 |speed |45.0 |
|12:00 |count |150 |
But you prefer to have data like this:
^ time ^speed ^ count ^
|12:00 |45.0 |150 |
from datetime import time
import pandas
df = pandas.DataFrame(dict(timestamp=[time(12,0), time(12,0)], attribute=["speed","count"], value=[45.0, 150.0]))
df.pivot(index="timestamp", columns="attribute", values="value")
====== Output / Conversion ======
Pandas offers a lot of good output functionality. Some examples
sf.to_csv('tmp.csv')
sf.to_excel('tmp.xls')
to create a excel file with more sheets:
writer = pd.ExcelWriter('tmp.xls')
df1.to_excel(writer, sheet_name='sheet1')
df2.to_excel(writer, sheet_name='sheet2')
...
writer.save()
save dataframe into a database
if_exists can be fail, replace or append
from pandas.io import sql
import psycopg2
# database connection
c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret')
# if_exists checks if the table exists and acts accordingly
# fail: throw a ValueError if the table exists
# replace: drop the table and create a new one
# append: append the data to the table
sql.write_frame(sf, 'table_name', con, if_exists='append')
===== Convert from Unix Timestamp =====
import pandas as pd
import numpy as np
import time
# create a dataframe with unix timestamp
value = np.arange(100)
ts = time.time() + value
df = pd.DataFrame(dict(time=ts, value=value))
# convert unix timestamp to a valid numpy datetime
df['time'] = df.time.astype('M8[s]')
===== DataFrame to numpy recarray ======
The integrated method to_records creates a numpy recarray
recarray = data.to_records()
To select only specific columns
recarray = data[['col1','col2']].to_records(index=False)
===== DataFrame to matrix =====
ma = data.as_matrix(['col1', 'col2'])
====== Plot ======
===== Plot days on top of each other =====
data = ... # contains data of several days
pylab.plot(data.index.time, data.value1)
===== Colouring =====
Colouring can be done by selecting a [[http://matplotlib.org/examples/color/colormaps_reference.html|colormap]]:
df.plot(colormap='cubehelix')