User Tools

Site Tools


pandas

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
pandas [2015/07/30 18:22]
mantis [Modify Selection]
pandas [2016/05/20 10:24] (current)
mantis [Import data from DB]
Line 1: Line 1:
 +====== Creating a DataFrame ======
  
 +===== Create empty DataFrame =====
 +<code python>
 +import pandas
 +
 +index = pandas.date_range('​2013-03-01',​ '​2013-03-02 23:​45:​00',​ freq='​15min'​)
 +df = pandas.DataFrame(index=index,​ columns=['​A','​B',​ '​C'​])
 +</​code>​
 +
 +===== Create DataFrame from pandas.Series =====
 +<code python>
 +import pandas
 +
 +s1 = pandas.Series([1,​2,​3],​ index=[0,​1,​2])
 +s2 = pandas.Series(['​a','​b','​c'​],​ index=[2,​3,​4])
 +
 +df = pandas.DataFrame( {'​column1':​s1,​ '​column2':​s2})  ​
 +
 +# Note:
 +# pandas uses the index of both series to join the correct rows together:
 +
 +print df
 +
 +   ​column1 column2
 +0        1     NaN
 +1        2     NaN
 +2        3       a
 +3      NaN       b
 +4      NaN       c
 +
 +</​code>​
 +===== Import data from CSV =====
 +<code python>
 +import pandas
 +data = pandas.read_csv("​data.csv",​ index_col="​time",​ parse_dates=True) ​
 +</​code>​
 +
 +Parse index as timestamp:
 +<code python>
 +import pandas
 +data = pandas.read_csv("​data.csv",​ parse_dates=True) # tries to parse index as timestamp
 +</​code>​
 +
 +Parse some columns as timestamp
 +<code python>
 +import pandas
 +data = pandas.read_csv("​data.csv",​ index_col="​time",​ parse_dates=["​from_time",​ "​to_time"​]) # list of ints or names, list of lists, or dict
 +</​code>​
 +
 +
 +===== Import data from DB =====
 +<code python>
 +import pandas.io.sql as sql
 +import psycopg2
 +
 +c = psycopg2.connect(host='​localhost',​ database='​big',​ user='​sec',​ password='​ret'​)
 +frame = sql.read_sql("​select * from my_table",​ c)
 +</​code>​
 +
 +
 +
 +
 +====== Working with DataFrames ======
 +
 +===== Multi-axis indexing =====
 +
 +[[http://​pandas.pydata.org/​pandas-docs/​dev/​indexing.html|Starting in 0.11.0]], more explicit location based indexing is supported. Pandas now supports three types of multi-axis indexing, by label (**loc**), by position (**iloc**) and mixed (**ix**):
 +
 +<code python>
 +df.loc['​a'​] # row with index a
 +</​code>​
 +
 +<code python>
 +df.time_stamp.iloc[0] ​ # first row, column "​time_stamp"​
 +df.time_stamp.iloc[-1] # last row, column "​time_stamp"​
 +</​code>​
 +
 +ix supports mixed integer and label based access: it will try to use arguments as labels first, then fallback to integer positional access). The documentations warns that it is better to be explicit and use .iloc or .loc whenever possible.
 +===== Iterate over rows =====
 +
 +<code python>
 +for (index_value,​ series) in data_frame.iterrows():​
 +   # do something
 +</​code>​
 +
 +
 +maybe a little bit more intuitive, but not so sure about the performance
 +
 +<code python>
 +for (idx, col1, col2) in zip(df.index,​ df.col1, df.col2):
 +    # do something
 +</​code>​
 +===== Iterate over columns =====
 +<code python>
 +for (column_name,​ timeseries) in data_frame.iterkv():​
 +   # do something
 +</​code>​
 +
 +
 +===== Add Column =====
 +<code python>
 +# simple version
 +data['​new_column'​] = 0 
 +
 +# complicated version (not sure if this has an advantage)
 +data['​new_column'​] = pandas.Series(numpy.zeros(len(data)),​ index=data.index)
 +</​code>​
 +
 +===== Extend =====
 +Suppose you have a dataframe with an incomplete timeseries of a day, and you want to fill the missing values with nan:
 +<code python>
 +    complete_day = pandas.date_range('​2013-05-18',​ periods=96, freq='​15min'​)
 +    complete_data=incomplete_data.ix[complete_day]
 +</​code>​
 +
 +===== Sort =====
 +<code python>
 +data_frame.sort(inplace=True,​ ascending=True) # sort by index
 +</​code>​
 +
 +===== Group =====
 +<code python>
 +from pandas.tseries.resample import TimeGrouper
 +for name, group in groupby(TimeGrouper(freq='​15min'​)):​
 +    print name, group
 +</​code>​
 +
 +===== Get row index =====
 +
 +From [[http://​stackoverflow.com/​questions/​18199288/​getting-the-integer-index-of-a-pandas-dataframe-row-fulfilling-a-condition|SO]]
 +
 +<code python>
 +from numpy import where
 +import pandas
 +from datetime import datetime
 + 
 +index = pandas.date_range('​2013-03-01',​ '​2013-03-02 23:​45:​00',​ freq='​15min'​)
 +df = pandas.DataFrame(index=index,​ columns=['​A','​B',​ '​C'​])
 +ts = datetime.strptime("​2013-03-01 01","​%Y-%m-%d %H") # row index: 4
 +
 +print where(df.index == ts)[0] # The value returned is an array since there could be more than one row with a particular index or value in a column.
 +</​code>​
 +
 +===== Select =====
 +Suppose you have a pandas DataFrame with a Datetimeindex and you want all entries in the hours 0-3 
 +<code python>
 +data[data.index.hour < 4]
 +</​code>​
 +
 +For multiple criteria it is easier to define an explicit variable:
 +
 +<code python>
 +where = (data.speed > 10 ) |  ( data.speed < 90 )  ​
 +# dont forget the ()  or you get an error. Also note the bit-wise or-operator!
 +data[where]
 +</​code>​
 +
 +===== Modify Selection =====
 +This can be very counter-intuitive in pandas. The correct way to modify a DataFrame via boolean selection is to **specify the field to be modifed FIRST and the boolean index afterwards**.
 +
 +<code python>
 +data.ix[where,​ '​is_filtered'​] = 1 # probably this is the recommended version? ​
 +
 +
 +# Be aware that the version where the boolean index is specified first DOES NOT WORK
 +# data[where]['​is_filtered'​] = 1    # THIS IS WRONG
 +# because data[where] calls __getitem__ which creates a COPY for boolean-indexed-slices
 +</​code>​
 +
 +see http://​stackoverflow.com/​questions/​14192741/​understanding-pandas-dataframe-indexing  ​
 +
 +===== Transform =====
 +
 +<code python>
 +from pandas.tseries.resample import TimeGrouper
 +def filter_func(grp):​
 +    grp = grp.copy() ​ # otherwise the original dataframe will be modified!
 +    where = ( grp['​speed'​] < 10 ) | (grp['​speed'​] > 90 ) 
 +    ​
 +    grp.filtered[where] = 1
 +    return grp
 +
 +filtered_data = data.groupby(TimeGrouper(freq='​15min'​)).transform(filter_func)
 +
 +</​code>​
 +
 +
 +
 +====== Combining DataFrames ======
 +
 +===== Combining Data with different intervals =====
 +The Series.asof Function is well suited to do that: http://​pandas.pydata.org/​pandas-docs/​dev/​generated/​pandas.Series.asof.html
 +
 +Example: Join every row with the 15min Median of the previous time interval
 +<code python>
 +data = ....
 +median15 = data.resample("​15min",​ how="​median",​ label="​right"​) # note: label="​right"​ causes every record in data to be joined with the PREVIOUS median
 +data['​last_median'​] ​ =  [median15.speed.asof(t) for t in data.index]
 +
 +</​code>​
 +===== Join =====
 +Details in [[http://​pandas.pydata.org/​pandas-docs/​dev/​merging.html|Pandas doc]]
 +
 +<code python>
 +result = df1.join(df2,​ how="​outer",​ lsuffix="​_df1",​ rsuffix="​_df2"​)
 +</​code>​
 +
 +===== Merge =====
 +
 +
 +
 +<code python>
 +# add column
 +data['​new_column'​] = 0 
 +
 +# add entries
 +data.append(otherdata)
 +</​code>​
 +
 +
 +====== Aligning ======
 +
 +<code python>
 +from pandas import Series, date_range
 +from numpy.random import randn
 +
 +rng1 = date_range('​2011-04-01',​ periods=5, freq='​H'​)
 +rng2 = date_range('​2011-04-02',​ periods=7, freq='​H'​)
 +
 +df1 = Series(randn(len(rng1)),​ index=rng1)
 +df2 = Series(randn(len(rng2)),​ index=rng2)
 +
 +# align
 +adf1, adf2 = df1.align(df2) # returns two aligned dataframes, default: outer join
 +
 +</​code>​
 +
 +====== Stack / Pivot ======
 +===== Pivoting =====
 +
 +Suppose you have data like this:
 +^ time      ^attribute ​     ^ value          ^
 +|12:​00 ​     |speed ​         |45.0            |
 +|12:​00 ​     |count ​         |150             |
 +
 +But you prefer to have data like this: 
 +^ time      ^speed ​     ^ count          ^
 +|12:​00 ​     |45.0       ​|150 ​            |
 +
 +
 +<code python>
 +from datetime import time 
 +import pandas
 +
 +df = pandas.DataFrame(dict(timestamp=[time(12,​0),​ time(12,​0)],​ attribute=["​speed","​count"​],​ value=[45.0,​ 150.0]))
 +df.pivot(index="​timestamp",​ columns="​attribute",​ values="​value"​)
 +</​code>​
 +
 +
 +
 +
 +====== Output / Conversion ======
 +
 +Pandas offers a lot of good output functionality. Some examples
 +
 +<code python>
 +sf.to_csv('​tmp.csv'​)
 +sf.to_excel('​tmp.xls'​)
 +</​code>​
 +
 +to create a excel file with more sheets:
 +
 +<code python>
 +writer = pd.ExcelWriter('​tmp.xls'​)
 +df1.to_excel(writer,​ sheet_name='​sheet1'​)
 +df2.to_excel(writer,​ sheet_name='​sheet2'​)
 +...
 +writer.save()
 +</​code>​
 +
 +save dataframe into a database
 +if_exists can be fail, replace or append
 +
 +<code python>
 +from pandas.io import sql
 +import psycopg2
 +
 +# database connection
 +c = psycopg2.connect(host='​localhost',​ database='​big',​ user='​sec',​ password='​ret'​)
 +
 +# if_exists checks if the table exists and acts accordingly
 +#   fail: throw a ValueError if the table exists
 +#   ​replace:​ drop the table and create a new one
 +#   ​append:​ append the data to the table
 +sql.write_frame(sf,​ '​table_name',​ con, if_exists='​append'​)
 +</​code>​
 +
 +
 +===== Convert from Unix Timestamp =====
 +
 +<code python>
 +import pandas as pd
 +import numpy as np
 +import time
 +
 +# create a dataframe with unix timestamp
 +value = np.arange(100)
 +ts = time.time() + value
 +df = pd.DataFrame(dict(time=ts,​ value=value))
 +
 +# convert unix timestamp to a valid numpy datetime
 +df['​time'​] = df.time.astype('​M8[s]'​)
 +
 +</​code>​
 +
 +
 +===== DataFrame to numpy recarray ======
 +
 +The integrated method to_records creates a numpy recarray
 +
 +<code python>
 +recarray = data.to_records()
 +</​code>​
 +
 +To select only specific columns
 +<code python>
 +recarray = data[['​col1','​col2'​]].to_records(index=False)
 +</​code>​
 +
 +===== DataFrame to matrix =====
 +
 +<code python>
 +ma = data.as_matrix(['​col1',​ '​col2'​])
 +</​code>​
 +
 +
 +====== Plot ======
 +===== Plot days on top of each other =====
 +
 +<code python>
 +data = ... #  contains data of several days
 +pylab.plot(data.index.time,​ data.value1)
 +</​code>​
 +
 +===== Colouring =====
 +
 +Colouring can be done by selecting a [[http://​matplotlib.org/​examples/​color/​colormaps_reference.html|colormap]]:​
 +
 +<code python>
 +df.plot(colormap='​cubehelix'​)
 +</​code>​
pandas.txt ยท Last modified: 2016/05/20 10:24 by mantis