23. Reshaping variables to observations and back

SAS:

\\# wide to long
proc transpose data = mylib.mydata
out = mylib.mylong;
var q1 - q4;
by id workshop gender;
run;
data mylib.mylong;
  set mylib.mylong(rename = (COL1 = value));
  time = input(substr(_name_, 2), 1.);
  drop _name_;
run;

\\# long to wide
proc transpose data = mylib.mylong;
  out= mylib.mywide prefix = q;
  by id workshop gender;
  id time;
  var value;
run;

data mylib.mywide;
  set mylib.mywide(drop = _name_);
run;

R:

library("reshap2")
mychanges <- c(
  q1 = "time1",
  q2 = "time2",
  q3 = "time3",
  q4 = "time4")
  mydata <- rename(mydata, mychanges)
  mydata$subject <- factor(1:8)

\# reshaping from wide to long

 library("reshap2")
  mylong <- melt(mydata)

\# again, specifying arguments

mylong <- melt(mydata, 
    id.vars = c("subject", "workshop", "gender"),
    measure.vars = c("time1", "time2", "time3", "time4"),
    value.name = "variable")

    \\# reshaping from long to wide 
     mywide <- dcast(mylong, subject + workshop + gender ~ variable)

PYTHON:

#1. Reshaping by pivoting DataFrame objects:

Data is often stored in CSV files or databases in so-called 'stacked' or 'record' format:

df.pivot(index = 'date', columns = 'variable', values = 'value')

#2. Reshaping by stacking and unstacking

Closely related to the pivot function are the related stack and unstack functions currently available on series and DataFrame. Here are essentially what these functions do:

stack: 'pivot' a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.

unstack: inverse operation from stack: 'pivot' a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns; with a 'stacked' DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level. Notice that the stack and unstack methods implicitly sort the index levels involved. Hence a call to stack and then unstack will result in a sorted copy of the original DataFrame or Series.

stacked.unstack()

stacked.unstack(1)

stacked.unstack('second') # If the indexes have names, you can use the level names instead of specifying the level members

#3. Multiple Levels

You may also stack or unstack more than one level at a time by passing a list of levels, in which case the end result is as if each level in the list were processed individually.

df.stack(level = ['animal', 'hair_length'])

#4. Reshaping by Melt

The melt() function is useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are 'unpivoted' to the row axis, leaving just two non-identifier columns, 'variable' and 'value'. The names of those columns can be customized by supplying the varname and value_name parameters.

cheese = pd.DataFrame({'first': ['Join', 'Mary'], 'last', : ['Doe', 'Bo'], 'height' : [5.5, 6.0], 'weight': [130, 150]})

pd.melt(cheese, id_vars = ['first', 'last'])

out :

    first   last   variable   value

0 Join Doe height 5.5

1 Mary Bo height 6.0

2 Join Doe weight 130.0

  1. Join Bo weight 150.0

pd.melt(cheese, idvars = ['first', 'last'], var_name = 'quantity')

Get the same results with ones above

Another way to transform is to use the wide_to_long panel data convenience function

#5. Combining with stats and GroupBy

It should be no shock that combining pivot/stack/unstack with GroupBy and the basic Series and DataFrame statistical functions can produce some very expressive and fast data manipulations.

df.stack() .mean(1). unstack() == df.groupby(level =1, axis = 1).mean()

df.stack().groupby(level = 1).mean()

df.mean().unstack(0)

#6. Pivot tables

The function pandas.pivot_table can be used to create spreadsheet-style pivot tables

pd.pivot_table(df, values = 'D', index = ['A', 'B'], columns = ['C'])

#7. Adding margins

If you pass margins = True to pivot_table, special All columns and rows will be added with partial group aggregates across the categories on the rows and columns .

df.pivot_table(index = ['A', 'B'], columns = 'C', margins = True, aggfunc = np.std)

#8. Cross tabulations

Use the crosstab function to compute a cross-tabulation of two factors. By default crosstab computes a frequency table of the factors unless an array of values and an aggregation function are passed.

df = pd.DataFrame({'A': [1,2,2,2,2], 'B': [3,3,4,4,4], 'C': [1.1, NP.NAN, 1, 1]})

pd.crosstab(df.A, df.B)

$9. Normalization

Frequency tables can also be normalized to show percentages rather than counts using the normalize argument:

pd.crosstab(df.A, df.B, normalize = True)

or

pd.crosstab(df.A, df.B, normalize = 'columns')

or

pd.crosstab(df.A, df.B, values = df.C, aggfunc = np.sum, normalize = True, margins = True)

#10. Computing indicator /dummy variables

To convert a categorical variable into a 'dummy' or 'indicator' DataFrame, fro example a column in a DataFrame ( a series) which has k distinct values, can derive a DataFrame containing k columns of 1s and 0s.

df = pd.DataFrame({'key': list('bbacab'), 'data1', : range(6)})

pd.get_dummies(df['key'])

or

dummies = pd.get_dummies (df['key'], prefix = 'key')

This function is often used along with discretization functions like cut:

values = np.random.randn(10)

bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

pd.get_dummies(pd.cut(values, bins))

get_dummies() also accepts a DataFrame. By default all categorical variables(categorical in the statistical sense, those with object or categorical dtype) are encoded as dummy variables.

df = pd.DataFrame ({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'], 'C': [1,2,3]})

pd.get_dummies(df)

out:

  C  A\__a  A\_b B\_b B\_c_

0 1 1 0 0 1

1 2 0 0 0 1

2 3 1 0 1 0

or you can control the columns that are encoded with the columns keyword.

pd.get_dummies(df, columns = ['A'])

#11. Factorizing values

To encode 1-d values as an enumerated type use factorize:

24. sorting data frames

SAS :

proc sort data = mylib.mydata;
  by workshop;
run;

proc sort data = mylib.mydata;
  by descending gender workshop;
  run;

PYTHON:

obj = series(range(4), index = ['d', 'a', 'b', 'c']) obj.sort_index ()

with a dataframe, you can sort by index on either axis:

frame = dataframe(np.arrange(8). reshape((2, 4)), index = ['three', 'one'], columns = ['d', 'a', 'b', 'c'])

frame.sort_index()

frame.sort_index(axis = 1)

frame.sort_index(axis = 1, ascending = False)

\# To sort a series by its values, use its order method:
in : obj = series(4,7,-3, 2]) obj.order()

\# any missing values are sorted to the end of the series by default

in: obj = series([4, np.nan, 7, np.nan, -3, 2]) obj.order()

\# on dataframe, you may want to sort by the values in one or more columns. to do so, pass one or more column names to the by options:

in : frame = dataframe({'b': [4,7,-3, 2], 'a': [0, 1, 0, 1]})

frame.sort_index(by ='b') frame.sort_index (by = 'a', 'b'])

R:

  \\# show first four observations in order
    mydata[c(1,2,3,4), ]
  \\# show them in reverse order
    mydata [c(4,3,2,1),]
  \\# create order variable for workshop
    myW <- order(mydata$gender, mydata$workshop)
  \\# creating order variable for descending (-) workshop then gender
    myWdG <- order(-mydata$workshop, mydata$gender)
  \\# print data in WG order
  mydata[myWdG, ]
  \\# save data in WdG order
    mydataSorted <- mydata[myWdG, ]

25. Converting data structure

26. Character string manipulations

SAS:

data mylib.giants;
  infile '\myRfolder\giants.txt'
  MISSOVER DSD LRECL = 32767
  INPUT name $char 14. @16 born mmddyy10. @27 died yymmdd10.;
  format born mmddyy10. died yymmdd10.;
  myVarlength = length(name)
  born= strip(born)

  data mylib.giants;
    set mylib.giants;
   mylower = lowcase(name);
   myupper = upcase(name)
   myproper = propcase(name)
 run;

 data mylib.giants;
   set mylib.giants;
   myFirst5 = substr(name, 1 ,5)
     \# split names using substr;
       myblank = find(name, " ");
       myfirst = strip(substr(name, 1, myBlank));
       mylast = strip(substr(name, myBlank));
     \# splip name using scan;
       myfirst = scan(name, 1, " ");
       mylast = scan(name, 2, " ");
       myfirst = tranwrd(myfirst, "R.A.", "Ronald A.");
       length mylastfirst $ 17;
       mylastfirst = strip(mylast)||","||strip(myfirst);
         or call CATX(",", mylastfirst, mylast, myfirst);

      data tukey;
        set mylib.giants;
          where mylast = "Tukey";
      run;

      data tukey;
        set mylib.giants;
          where find(mylast, "key");
      run;

      data mysubset;
        set mylib.giants;
        where mylast in ("Box", "Bayes", "Fisher", "Tukey");
      run;

      data fishorkey;
        set mylib.giants;
        if find(mylast, "Box") |
           find(mylast, "Bayes") |
           find(mylast, "Fish")|
           find(mylast, "key");
           run;

       data ArthruM;
         set mylib.giants;
           firstletter = substr(mylast, 1,1);
           if "A" <= firstletter <= "M";
           run;

PYTHON:

1. split a commma-separated string into a broken pieces

val = 'a,b, guido' val.split(,)

out : ['a', 'b', 'guido']

split is often combined with strip to trim whitespace(including newlines)

pieces = [x.strip() for x in val.split(',')

out : ['a', 'b', 'guido']

2. join together

in : first + '::' + second '::' + third out : 'a::b::guido'

3. detect a substring, through index and find

in : 'guido' in val val.index(,) out : 1 val.find(':') out : -1

note the difference between find and index is that index raises an exception if the string isn't found(versus returning -1)

4. relatively, count returns the number of occurrences of a particular substring

in: val.count(',') out : 2

5. replace will substitute occurrences of one pattern for another. this is commonly used to delete patterns, too, by passing an empty string:

in: val.replace(',', '::') out: 'a::b:: guido'

in: val.place(',', '') out: 'ab guido'

6. python built-in string methods

count: return the number of non-overlapping occurrences of substring in the string
endwith, startwith: returns true if sting ends with suffix
join: use string as delimiter for concatenating a sequence of other strings
index: return position of first character in substring if found in the string. raises valueError if not found.
find: return position of first character of first occurrence of substring in the string. like index, but return -1 if not found
rfind: return position of first character of last occurrence of substring in the string, returens -1 if not found
replace: replace occurrences of string with another string
strip,rstrip, lstrip: trim whitespace, including newlines;

split: break string into list of substrings using passed delimiter
lower, upper
ljust, rjust: left justify or right justify, respectively. pad opposite side of string with spaces to return a string a minimum width.

R:

gender <- c("m", “f", "m", NA, "m", "f", "m", "f")
options(width = 58)
library("stringr")
myVars <- str_c("Var", LETTERS[1:6])
setwd("c:/myRfolder")
giants <- read.fwf(
  file  = "giants.txt",
  width = c(15, 11, 11),
  col.names = c("name", "born", "died"),
  colClasses = c("character", "character", "POSIXct")

  str_length(giants$name)
  giants[giants$name =="R.A. Fisher", ]
  giants[giants$name == "R.A. Fisher   ", ]
  giant$name <- str_trim(giants$name)
  attach(giants)
  str_length(name)
  toupper(name)
  tolower(name)
  library("ggplot2")
  firstupper(tolower(name))
  str_sub(name, 1, 5)
  myNamesMatrix <- str_split_fixed(name, " ", 2)
  myfirst <- myNamesMatrix [ ,1 ]
  myLast <- myNamesMatrix [, 2]
  myFirst <- str_replace_all(myfirst, "R.A.", "Ronald A.")
  mylastFirst <- str_c(mylast, ",", "myfirst)
  myobs <myLast =="Tukey"
  myObs <- which(myLast == "Tukey")
  giants[myObs, ]

myObs <- str_detect(myLast, "key")

 myTable<- c("Box", "Bayes", "Fisher", "Tukey")
  myObs <- mylast %in% myTable

myObs <- str_detect(mylast, "Box|Bayes|Fish|key")

myAthruM <- str_detect(myLastFirst, "^[A-M]")

27. Dates and Times

1. Calculating durations

SAS:

Infile '\myRfolder\giants.txt' MISSOVER DSD LRECL = 32767 input name $char14. @16 born mmddyy10. @27 died mmddyy10.; proc print; run;

proc print; format died born mmddyy10.; run;

data mylib.giants; set mylib.giants; age = (died - born)/365.2425; longAgo = (today() - died)/365.2425; run;

proc print; format died born mmddyy10. age longAgo 5.2; run;

PYTHON:
R:

giants<- read.fwf( file = "giants.txt", width = c(15,11, 11) col.names = c("name", "born", "died") colClasses = c("character", "character", "POSIXct"), row.names = "name", strip.white = TRUE; )

library("lubridate") giants$born <- mdy(giants$born)

as.POSIXct( c(-2520460800, -3558556800, -2207952000, -1721347200, -2952201600), origin = "1960-01-01, tz = "UTC")

age<- difftime(died, born, units = "secs") age <- difftime(died, born) # default age in days

giants$age <- round(as.numeric(age/365.2425), 2)

difftime(now(), died)/365.2425

2. adding durations to date-time variables

SAS:

data mylib.giants; set mylib.giants; died = born +age; run;

PYTHON:
R:

age <- as.duration( c(2286057600, 2495664000, 2485382400, 2685916800, 1935705600) )

3. accessing date-time elements

SAS:

data mylib.giants; set mylib.giants; myYear = YEAR(born); myMonth =MONTH(born); myDay = DAY(born);

PYTHON:
R:

year(born) month(born) day(born) # day of month wday(born) # day of week

4. creating date-time variables from elements

SAS:

data mylib.giants; set mylib.giants; born = MDY(myMonth, myDay, myYear); run;

PYTHON:

In working with time series data, we will frequently seek to:

1) generate sequences of fixed-frequency dates and time spans

2) conform or convert time series to a particular frequency

3) compute 'relative' dates based on various non-standard time increments, or 'roll' dates forward or backward

Overview

Following table shows the type of time-related classes pandas can handle and how to create them.

Class Remarks How to create
Timestamp Represents a single time stamp to_datetime, Timestamp
DatetimeIndex Index of Timestamp to_datetime, date_range, DatetimeIndee
Period Represents a single time span Period
PeriodIndex Index of Period period_range, periodIndex

#1. Time Stamps vs. Time Spans

Time-stamped data is the most basic type of timeseries data that associates values with points in time. For pandas objects it means using the point in time.

pd.Timestamp(datetime(2012, 5, 1) -> Timestamp('2012-05-01 00:00:00')

pd.Timestamp('2012-05-01')

pd.Timestamp(2012,5,1)

However, in many cases it is more natural to associate things like change variables with a time span instead. The span represented by period can be specified explicitly, or inferred from datetime string format.

pd.Period('2011-01') -> Period('2011-01', M)

pd.Period('2012--05', freq = 'D') -> Period ('2012-05-01', 'D')

#2. Converting to Timestamps

To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function. When passed a Series, this returns a Series , while a list-like is converted to a DatetimeIndex:

pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', none]))

pd.to_datetime(['2005/11/23', '2010.12.31'])

If you use dates which start with the day first(i.e. European style), you can pass the dayfirst flag:

pd.to_datetime(['04-01-2012 10:00'], dayfirst = True)

if you pass a single string to todatetime, it returns single Timestamp. Also, Timestamp can accept the string input. note that Timestamp doesn't accept string parsing option like dayfirst or format, use to_datetime if these are required.

pd.to_datetime('2010/11/12') -> Timestamp('2010-11-12 00:00:00')

pd.Timestamp('2010/11/12') -> Timestamp('2010-11-12 00:00:00)

You can also pass a DataFrame of integer or string columns to assemble into a series of Timestamps.

df = pd.DataFrame({'year': [2015, 2016], 'month': [2,3], 'day': [4,5], 'hour': [2,3]})

pd.to_datetime(df)

pd.to_datetime looks for standard designations of the datetime component in the column names, including :

required: year, month, day

optional: hour, minute, second, millsecond, microsecond, nanosecond

#3. Invalid data

in version 0.17.0, the default for to_datetime is now errors = 'raise', rather than errors = 'ignore'. this means that invalid parsing will raise rather than return the original input as in previous versions.

Pass errors = 'coerce' to convert invalid data to NaT:

pd.to_datetime(['2009/07/31', 'asd'], errors = 'coerce')

#4. Epoch Timestamps

it is also possible to convert integer or float epoch times. the default unit for these is nanoseconds (since these are how timestamps are stored). However, often epochs are stored in another unit which can be specified:

pd.to_datetime([1349720105, 1349806505, 1349892905, 1349979305, 1350065705], unit = 's')

#5. Generating ranges of timestamps

To generate an index with time stamps, you can use either the DatetimeIndex or index constructor and pass in a list of datetime objects:

dates = [datetime(2012, 5,1), datetime(2012, 5,2), datetime(2012, 5, 3)]

# note the frequency information

    index = pd.DatetimeIndex\(dates\)

# Automatically converted to DatetimeIndex

    index = pd.Index\(dates\)

Practically, this becomes very cumbersome because we often need a very long index with a large number of timestamps. if we need timestamps on a regular frequency, we can use the pandas functions date_range and bdate_range to create timestamp indexes.

index = pd.date_range('2001-1-1', periods = 1000, freq = 'M')

or

start = datetime(2011, 1,1)

end = datetime(2012, 1,1)

rng = pd.date_range(start, end)

R:

myYear <- year(died) myMonth <- month(died) myDay <- day(died)

myDateString <<- paste(myYear, myMonth, myDay, SEP = "/") died2 <- ymd(myDateString)

5. logical comparisons with date-time variables

SAS:

data born1900s; set mylib.giants; if born > "01jan1900"d; run;

R:

giants[born >mdy("1/1/1900"), ]

6. formatting date-time output

proc format; picture myFormatI LOW-HIGH = '%B %d, %Y is day %j of %Y' (DATATYPE = DATE) RUN;

proc print data = mylib.giants; var born; format born myFormatI40.; run;

data null; set mylib.giants; put name $char14. born myFormatII34.; run;

7. two-digit years

8. date-time conclusion

28. Loops

29. Functions

results matching ""

    No results matching ""