====== Creating a DataFrame ======

===== Create empty DataFrame =====
<code python>
import pandas

index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min')
df = pandas.DataFrame(index=index, columns=['A','B', 'C'])
</code>

===== Create DataFrame from pandas.Series =====
<code python>
import pandas

s1 = pandas.Series([1,2,3], index=[0,1,2])
s2 = pandas.Series(['a','b','c'], index=[2,3,4])

df = pandas.DataFrame( {'column1':s1, 'column2':s2})  

# Note:
# pandas uses the index of both series to join the correct rows together:

print df

   column1 column2
0        1     NaN
1        2     NaN
2        3       a
3      NaN       b
4      NaN       c

</code>
===== Import data from CSV =====
<code python>
import pandas
data = pandas.read_csv("data.csv", index_col="time", parse_dates=True) 
</code>

Parse index as timestamp:
<code python>
import pandas
data = pandas.read_csv("data.csv", parse_dates=True) # tries to parse index as timestamp
</code>

Parse some columns as timestamp
<code python>
import pandas
data = pandas.read_csv("data.csv", index_col="time", parse_dates=["from_time", "to_time"]) # list of ints or names, list of lists, or dict
</code>


===== Import data from DB =====
<code python>
import pandas.io.sql as sql
import psycopg2

c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret')
frame = sql.read_sql("select * from my_table", c)
</code>


====== Working with DataFrames ======

===== Multi-axis indexing =====

[[http://pandas.pydata.org/pandas-docs/dev/indexing.html|Starting in 0.11.0]], more explicit location based indexing is supported. Pandas now supports three types of multi-axis indexing, by label (**loc**), by position (**iloc**) and mixed (**ix**):

<code python>
df.loc['a'] # row with index a
</code>

<code python>
df.time_stamp.iloc[0]  # first row, column "time_stamp"
df.time_stamp.iloc[-1] # last row, column "time_stamp"
</code>

ix supports mixed integer and label based access: it will try to use arguments as labels first, then fallback to integer positional access). The documentations warns that it is better to be explicit and use .iloc or .loc whenever possible.
===== Iterate over rows =====

<code python>
for (index_value, series) in data_frame.iterrows():
   # do something
</code>


maybe a little bit more intuitive, but not so sure about the performance

<code python>
for (idx, col1, col2) in zip(df.index, df.col1, df.col2):
    # do something
</code>
===== Iterate over columns =====
<code python>
for (column_name, timeseries) in data_frame.iterkv():
   # do something
</code>


===== Add Column =====
<code python>
# simple version
data['new_column'] = 0 

# complicated version (not sure if this has an advantage)
data['new_column'] = pandas.Series(numpy.zeros(len(data)), index=data.index)
</code>

===== Extend =====
Suppose you have a dataframe with an incomplete timeseries of a day, and you want to fill the missing values with nan:
<code python>
    complete_day = pandas.date_range('2013-05-18', periods=96, freq='15min')
    complete_data=incomplete_data.ix[complete_day]
</code>

===== Sort =====
<code python>
data_frame.sort(inplace=True, ascending=True) # sort by index
</code>

===== Group =====
<code python>
from pandas.tseries.resample import TimeGrouper
for name, group in groupby(TimeGrouper(freq='15min')):
    print name, group
</code>

===== Get row index =====

From [[http://stackoverflow.com/questions/18199288/getting-the-integer-index-of-a-pandas-dataframe-row-fulfilling-a-condition|SO]]

<code python>
from numpy import where
import pandas
from datetime import datetime
 
index = pandas.date_range('2013-03-01', '2013-03-02 23:45:00', freq='15min')
df = pandas.DataFrame(index=index, columns=['A','B', 'C'])
ts = datetime.strptime("2013-03-01 01","%Y-%m-%d %H") # row index: 4

print where(df.index == ts)[0] # The value returned is an array since there could be more than one row with a particular index or value in a column.
</code>

===== Select =====
Suppose you have a pandas DataFrame with a Datetimeindex and you want all entries in the hours 0-3 
<code python>
data[data.index.hour < 4]
</code>

For multiple criteria it is easier to define an explicit variable:

<code python>
where = (data.speed > 10 ) |  ( data.speed < 90 )  
# dont forget the ()  or you get an error. Also note the bit-wise or-operator!
data[where]
</code>

===== Modify Selection =====
This can be very counter-intuitive in pandas. The correct way to modify a DataFrame via boolean selection is to **specify the field to be modifed FIRST and the boolean index afterwards**.

<code python>
data.ix[where, 'is_filtered'] = 1 # probably this is the recommended version? 


# Be aware that the version where the boolean index is specified first DOES NOT WORK
# data[where]['is_filtered'] = 1    # THIS IS WRONG
# because data[where] calls __getitem__ which creates a COPY for boolean-indexed-slices
</code>

see http://stackoverflow.com/questions/14192741/understanding-pandas-dataframe-indexing  

===== Transform =====

<code python>
from pandas.tseries.resample import TimeGrouper
def filter_func(grp):
    grp = grp.copy()  # otherwise the original dataframe will be modified!
    where = ( grp['speed'] < 10 ) | (grp['speed'] > 90 ) 
    
    grp.filtered[where] = 1
    return grp

filtered_data = data.groupby(TimeGrouper(freq='15min')).transform(filter_func)

</code>


====== Combining DataFrames ======

===== Combining Data with different intervals =====
The Series.asof Function is well suited to do that: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Series.asof.html

Example: Join every row with the 15min Median of the previous time interval
<code python>
data = ....
median15 = data.resample("15min", how="median", label="right") # note: label="right" causes every record in data to be joined with the PREVIOUS median
data['last_median']  =  [median15.speed.asof(t) for t in data.index]

</code>
===== Join =====
Details in [[http://pandas.pydata.org/pandas-docs/dev/merging.html|Pandas doc]]

<code python>
result = df1.join(df2, how="outer", lsuffix="_df1", rsuffix="_df2")
</code>

===== Merge =====


<code python>
# add column
data['new_column'] = 0 

# add entries
data.append(otherdata)
</code>


====== Aligning ======

<code python>
from pandas import Series, date_range
from numpy.random import randn

rng1 = date_range('2011-04-01', periods=5, freq='H')
rng2 = date_range('2011-04-02', periods=7, freq='H')

df1 = Series(randn(len(rng1)), index=rng1)
df2 = Series(randn(len(rng2)), index=rng2)

# align
adf1, adf2 = df1.align(df2) # returns two aligned dataframes, default: outer join

</code>

====== Stack / Pivot ======
===== Pivoting =====

Suppose you have data like this:
^ time      ^attribute      ^ value          ^
|12:00      |speed          |45.0            |
|12:00      |count          |150             |

But you prefer to have data like this: 
^ time      ^speed      ^ count          ^
|12:00      |45.0       |150             |


<code python>
from datetime import time 
import pandas

df = pandas.DataFrame(dict(timestamp=[time(12,0), time(12,0)], attribute=["speed","count"], value=[45.0, 150.0]))
df.pivot(index="timestamp", columns="attribute", values="value")
</code>


====== Output / Conversion ======

Pandas offers a lot of good output functionality. Some examples

<code python>
sf.to_csv('tmp.csv')
sf.to_excel('tmp.xls')
</code>

to create a excel file with more sheets:

<code python>
writer = pd.ExcelWriter('tmp.xls')
df1.to_excel(writer, sheet_name='sheet1')
df2.to_excel(writer, sheet_name='sheet2')
...
writer.save()
</code>

save dataframe into a database
if_exists can be fail, replace or append

<code python>
from pandas.io import sql
import psycopg2

# database connection
c = psycopg2.connect(host='localhost', database='big', user='sec', password='ret')

# if_exists checks if the table exists and acts accordingly
#   fail: throw a ValueError if the table exists
#   replace: drop the table and create a new one
#   append: append the data to the table
sql.write_frame(sf, 'table_name', con, if_exists='append')
</code>


===== Convert from Unix Timestamp =====

<code python>
import pandas as pd
import numpy as np
import time

# create a dataframe with unix timestamp
value = np.arange(100)
ts = time.time() + value
df = pd.DataFrame(dict(time=ts, value=value))

# convert unix timestamp to a valid numpy datetime
df['time'] = df.time.astype('M8[s]')

</code>


===== DataFrame to numpy recarray ======

The integrated method to_records creates a numpy recarray

<code python>
recarray = data.to_records()
</code>

To select only specific columns
<code python>
recarray = data[['col1','col2']].to_records(index=False)
</code>

===== DataFrame to matrix =====

<code python>
ma = data.as_matrix(['col1', 'col2'])
</code>


====== Plot ======
===== Plot days on top of each other =====

<code python>
data = ... #  contains data of several days
pylab.plot(data.index.time, data.value1)
</code>

===== Colouring =====

Colouring can be done by selecting a [[http://matplotlib.org/examples/color/colormaps_reference.html|colormap]]:

<code python>
df.plot(colormap='cubehelix')
</code>