# Statistical data analysis with Pandas and Pingouin (extended)

This chapter is an extension of the Statistical data analysis with Pingouin and Reading data with Pandas chapter from the Python Basics course.

We start again from scratch and first create an artificial test data set:

## Exercise 1: Create some dummy data samples

1. Create a new script and define the following NumPy arrays as dummy data arrays:
np.random.seed(1)
Group_A = np.random.randn(10) * 10 + 5
Group_B = np.random.randn(10) * 10 + 2

2. Plot the averages of Group_A and Group_B in a bar-plot in a new figure window:
• use the plot command:

plt.bar([1, 2], ["Mean of Group A", "Mean of Group B"]).

Hint: “Mean of Group A” and “Mean of Group B” are just placeholders! Replace this with the according NumPy averaging command :-)

• define the x-tick labels via plt.xticks([1,2], labels=["A", "B"])
• add appropriate x- and y-labels and a title to your plot.
• save your plot as a PDF.
3. Plot the values of Group_A and Group_B, respectively, in a boxplot in another figure window:
• set the figure aspect ratio to 5x6 via fig=plt.figure(2, figsize=(5,6))
• use the plot command plt.boxplot([Group_A, Group_B])
• define the x-tick labels as in 2.
• add appropriate x- and y-labels and a title to your plot.
• save your plot as a PDF.
4. Same as 3., but now use the command plt.violinplot([Group_A, Group_B], showmedians=True) to plot a violin plot in another figure.
# Your solution 1.1-1.2 here



Toggle solution

# Your solution 1.3 here:



Toggle solution

# Your solution 1.4 here:



Toggle solution

## Statistical signficance test

We would like to know, whether the difference between the two groups is significant or not.

To perform a significance test in Python we use Pingouin , a compact package that provides the most important test tools for a significance study.

Info: It is worth visiting the Pingouin website . It provides a very good overview of available significance tests and also a decision tree that helps to select the correct test for the respective data set.

Screenshots from pingouin-stats.org/guidelines.html (taken on March 2, 2021):

Let’s assume that our data is normally distributed and the two samples are independent. The corresponding test would be an unpaired, two-sample student’s t-test. The corresponding Pingouin command is:

import pingouin as pg
test_result = pg.ttest(Group_A, Group_B, paired=False)
#print(test_result)
pg.ttest(Group_A, Group_B, paired=False)

T dof tail p-val CI95% cohen-d BF10 power
T-test 0.718772 18 two-sided 0.481509 [-7.16, 14.61] 0.321445 0.477 0.104604

As we can see, the output of Pingouin is not just a single value, e.g., the p-value, but a table of useful statistical properties:

• T : T-value
• p-val : p-value
• dof : degrees of freedom
• cohen-d : Cohen’s d effect size
• CI95% : 95% confidence intervals of the difference in means
• power : achieved power of the test ( = 1 - type II error)
• BF10 : Bayes Factor of the alternative hypothesis

To be more correct, this output is actually a so-called Pandas DataFrame, a type of variable construct. Pandas DataFrames are a handy and powerful representation of tables in Python, which has - similar to a 2D NumPy array - rows and columns, that can be directly accessed via their column names, called key (similar to the keys of dictionaries). In very long tables/DataFrames, by default not all columns/keys are shown. To get a list of all available keys, use the following command: DataFrame.keys()

# show all keys of the test_result DataFrame:
print(test_result.keys())


Index(['T', 'dof', 'tail', 'p-val', 'CI95%', 'cohen-d', 'BF10', 'power'], dtype='object')

To access a specific column/key, just use the same syntax as for dictionaries and add .values (in order to extract the values from the DataFrame structure and retrieve a NumPy array):

print(f"the p-value of our test is:",
f"{test_result['p-val'].values}")


the p-value of our test is: [0.48150881]

## Exercise 2

Add a new cell to the script of the solution from Exercise 1. In the new cell, extend your script by the following functions:

1. Write an if-statement that checks whether the p-value of the significance test is lower or greater than 0.05. If the p-value is lower, then print out “there is a significant difference” together with the according p-value. Otherwise print out “no significant difference”, again, together with the according p-value.
2. Extent your if-statement, so that also Cohen’s d effect size is printed out, but only if the p-value indicates a significant difference.
3. Plug both, the significance test and your written if-statement into a function called normal_unpaired_sigtest(A, B), which also returns the p-value back to your main script.
4. In your main script, call your newly created function with the Group_A and Group_B data.
5. Modify the Group_A definition by Group_A = np.random.randn(10)*10+15 and re-run your script.
6. Create another function for the two-sample, non-parametric, un-paired signficance test:
• visit pingouin-stats.org/guidelines.html to find the appropriate test
• copy your function from 2. and rename the copy to notnormal_unpaired_sigtest(A, B)
# Your solution 2.2 - 2.4:



no significant difference (p-value: [0.48150881])

Toggle solution

# Your solution 2.6 here:



no significant difference (p-value: [0.42735531])

Toggle solution
Toggle full code solution

## Let’s apply the analysis to real world data

In order to import some data, which are stored in Excel files, we use the Pandas command pd.read_excel(path_to_file, index_col=0):

import pandas as pd
import os

import numpy as np
import matplotlib.pyplot as plt
import pingouin as pg

# Define file paths:
file_path = "Data/Pandas_1/"

""" file_path is the main root path.

The definition above is a so-called relative file path -
relative from the folder, that contains the script you
are currently executing.

A so-called absolute path (and therefore indepenent from
your script's folder) can be defined as follows:

file_path = "/Users/Fabrizio/Python/Kurs/Data/Pandas_1/"

or

file_path = "C:/Users/Fabrizio/Python/Kurs/Data/Pandas_1/"

"""

file_name_1 = "Group_A_data.xls"
file_name_2 = "Group_B_data.xls"

file_1 = os.path.join(file_path, file_name_1)
#  actually it does: file_1 = file_path + file_name_1
file_2 = os.path.join(file_path, file_name_2)

""" The os.path.join() command just sticks the different
file-path components together. You can also just write:

file_1 = file_path + file_name_1
file_2 = file_path + file_name_2

Never forget to put the requiered '/' at the end of
the file_path defintion::

file_path = "Data/Pandas_1/"  ⟵ okay
file_path = "Data/Pandas_1"   ⟵ not okay
"""

# Read the Excel files with Pandas into a Pandas Dataframe:

# some print test:
Group_A_df.iloc[0:10] # with the iloc command we can access
# DataFrame rows by standard indexing and
# slicing rules. Here, we didn't want to
# print out all rows of the DataFrame.
#print(Group_A_df.iloc[0:10])

Data
0 18.074201
1 13.086849
2 2.272670
3 20.283832
4 26.357661
5 45.342706
6 47.139108
7 1.427794
8 7.846927
9 26.665037

The two Excel files are imported as DataFrames into Group_A_df and Group_B_df, respectively. In the next step, we extract the DataFrame data into two NumPy arrays:

# Extracting the DataFrame import data:
Group_A = Group_A_df["Data"].values
Group_B = Group_B_df["Data"].values


## A few Pandas basics

As stated in the introductory course, Pandas can do much more than just Excle file reading. The Pandas DataFrame comes – simialar to NumPy – equipped with a bunch of useful built-in function:

print(f"shape of Group A data: {Group_A_df.shape}")
print(f"shape of Group B data: {Group_B_df.shape}")


shape of Group A data: (50, 1)
shape of Group B data: (50, 1)

We see that the data of our two groups is one-dimensional, and

print(f"keys (columns) in Group A data: {Group_A_df.keys()}")
print(f"keys (columns) in Group B data: {Group_B_df.keys()}")


keys (columns) in Group A data: Index(['Data'], dtype='object')
keys (columns) in Group B data: Index(['Data'], dtype='object')

the main (and only - take a look at the Excel files-) column’s key-name is “Data”.

We can do some further first examination of our data:

print(f"first 5 rows in Group A data: {Group_A_df.head(5)}")


first 5 rows in Group A data: Data
0 18.074201
1 13.086849
2 2.272670
3 20.283832
4 26.357661

print(f"last 5 rows in Group A data: {Group_A_df.tail(5)}")


last 5 rows in Group A data: Data
45 24.808727
46 9.782053
47 2.438080
48 31.641609
49 24.247804

print(f"average of Group A data: {Group_A_df.mean().values}",
f"+/- {Group_A_df.std().values}",)
print(f"average of Group B data: {Group_B_df.mean().values}",
f"+/- {Group_B_df.std().values}",)


average of Group A data: [17.47999084] +/- [14.40494573] average of Group B data: [14.43276578] +/- [11.85600989]

# general descriptive statistis of your dataframe:
Group_A_df.describe()

Data
count 50.000000
mean 17.479991
std 14.404946
min 0.922640
25% 7.307007
50% 13.226441
75% 26.161001
max 60.000000

## Exercise 3

1. Copy your script from the Pingouin Exercise 2.
2. Add the Pandas Excel file import commands from above to your script.
3. Uncomment or redefine your Group_A and Groud_B variable definitions by the following expression:
Group_A = Group_A_df["Data"].values
Group_B = Group_B_df["Data"].values

# Your solution 3 here (full script):



here is a significant difference (p-value: [0.02157014], cohen-d:[0.46704775])

Toggle full script solution

## Exercise 4: Extended real world data example

In the second part of our statistics tutorial we expand our script, so that it can analyse Excel tables with more than just one column. Also, we work with .xlsx files instead of .xls files. Unfortuantely, I just discovered that xlrd has explicitly removed the support for anything other than .xls files (due to “potential security vulnerabilities”, source ). To overcome this problem, please install the

• openpyxl

package.

1. Copy the Excel file reading part from your previous script (or copy it from “Solution 3 (Full script)”) into a new script.
2. Change the main data path to Data/Pandas_3/ and adjust the file names to the Group A and Group B data in that folder.
3. In order to be able to read .xlsx files, add the argument engine='openpyxl to the Pandas read command, e.g.
pd.read_excel(file_1, index_col=0, engine='openpyxl')

4. Similar to dictionaries, print out all column names (“keys”) of the read Excel files, e.g.:
for key in Group_A_df.keys():
print(f"{key}")

# Your solution 4.1-4.4 here:



Mouse 1
Mouse 2
Mouse 3
Mouse 4
Mouse 5
Mouse 6
Mouse 7
Mouse 8
Mouse 9
Mouse 10

Toggle solution

Group_A_df.keys()


Index(['Mouse 1', 'Mouse 2', 'Mouse 3', 'Mouse 4', 'Mouse 5', 'Mouse 6', 'Mouse 7', 'Mouse 8', 'Mouse 9', 'Mouse 10'], dtype='object')

Group_A_df.head(7)

Mouse 1 Mouse 2 Mouse 3 Mouse 4 Mouse 5 Mouse 6 Mouse 7 Mouse 8 Mouse 9 Mouse 10
0 18.074201 23.066936 60.000000 20.711322 60.000000 28.763525 42.427278 15.598740 13.465411 13.196517
1 13.086849 10.323111 19.974086 38.248952 25.587832 53.735452 16.635397 16.495615 44.782381 23.109105
2 2.272670 17.602005 22.705996 19.140288 24.739332 12.594575 41.966899 13.633330 25.891547 14.908281
3 20.283832 28.179423 36.580909 38.064844 29.033290 9.057918 60.000000 20.301198 18.622461 13.755511
4 26.357661 60.000000 29.650851 20.149923 44.583438 5.217009 48.860477 37.083126 18.306000 9.591119
5 45.342706 9.259463 10.029454 35.167710 14.413221 26.552184 3.281541 9.668325 15.764203 19.503650
6 47.139108 25.335681 10.954845 19.216845 45.489924 48.642575 26.566834 3.103062 12.845208 34.832844
print(Group_A_df.shape)


(50, 10)

#print(Group_A_df.describe())
Group_A_df.describe()

Mouse 1 Mouse 2 Mouse 3 Mouse 4 Mouse 5 Mouse 6 Mouse 7 Mouse 8 Mouse 9 Mouse 10
count 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000 50.000000
mean 17.479991 24.648196 27.360234 26.124974 28.200733 18.042231 21.736532 18.573062 26.391034 24.842926
std 14.404946 14.363733 12.986157 11.570712 11.908364 13.552271 15.680920 13.948133 12.477335 12.394661
min 0.922640 9.259463 9.270425 9.429189 9.777123 0.062906 0.376473 0.472434 9.382286 9.549946
25% 7.307007 13.435652 19.699324 18.230936 18.421391 7.620068 7.607388 7.634176 17.837596 14.267775
50% 13.226441 19.900726 24.955239 24.172694 26.158784 14.903794 18.857441 16.047178 23.196840 23.064810
75% 26.161001 28.072406 33.454855 32.736302 34.073416 27.297384 35.213392 26.562242 34.536572 31.782738
max 60.000000 60.000000 60.000000 60.000000 60.000000 53.735452 60.000000 56.025684 60.000000 60.000000

Exercise 4 continued:

1. Calculate and print the median of each column of each data table. Hint: Take a look, how we calculated the mean with the built-in Pandas mean function shown in Section 5.
2. Save the calculated medians into the variables Group_A and Group_B.
3. Copy the remaining plot and analysis part from your previous script/”Solution 3 (Full script)” into your script and run your pipeline.
4. Re-run your script with the Group A2 and Group B2 data.
# Your solution 4c.1-4c.2 here:



Group A medians: [13.22644106 19.90072561 24.95523869 24.17269392 26.15878378 14.9037943 18.85744129 16.04717778 23.19683969 23.06480993]
Group B medians: [23.59495388 11.61376136 18.68395653 20.8270858 21.82640825 14.64554514 20.27448066 17.02615332 11.41902818 17.77334372]

Toggle solution

print(f"{type(Group_A_df)}")
print(f"{type(Group_A)}")


<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>

dummy = Group_A_df.values
print(f"{type(dummy)}")
print(f"{dummy.shape}")


<class 'numpy.ndarray'>
(50, 10)

# the other way around: create a new dataframe from a given
# NumPy array:
df2 = pd.DataFrame(data=dummy)
df2

0 1 2 3 4 5 6 7 8 9
0 18.074201 23.066936 60.000000 20.711322 60.000000 28.763525 42.427278 15.598740 13.465411 13.196517
1 13.086849 10.323111 19.974086 38.248952 25.587832 53.735452 16.635397 16.495615 44.782381 23.109105
2 2.272670 17.602005 22.705996 19.140288 24.739332 12.594575 41.966899 13.633330 25.891547 14.908281
3 20.283832 28.179423 36.580909 38.064844 29.033290 9.057918 60.000000 20.301198 18.622461 13.755511
4 26.357661 60.000000 29.650851 20.149923 44.583438 5.217009 48.860477 37.083126 18.306000 9.591119
5 45.342706 9.259463 10.029454 35.167710 14.413221 26.552184 3.281541 9.668325 15.764203 19.503650
6 47.139108 25.335681 10.954845 19.216845 45.489924 48.642575 26.566834 3.103062 12.845208 34.832844
7 1.427794 12.321790 31.902779 18.709343 22.324181 13.551915 20.291572 28.838343 41.150187 42.268679
8 7.846927 60.000000 26.200162 26.720212 19.633351 31.817556 33.593050 2.118014 40.808585 24.551126
9 26.665037 14.995325 20.649107 25.755907 18.867057 10.498226 2.170260 3.044227 13.793600 14.054273
10 0.922640 60.000000 11.326179 23.560684 22.050686 0.172303 32.466781 25.394094 23.179101 26.588131
11 4.326630 27.751354 21.395401 19.939066 23.044234 28.377354 41.063390 2.535108 47.836518 50.371135
12 9.157503 37.444383 22.488169 27.579036 33.046569 30.581410 24.489401 11.408426 12.707566 23.685907
13 29.647979 60.000000 10.276356 18.532377 50.371736 14.600049 6.179677 0.599478 9.884708 31.209116
14 8.742000 13.402487 13.829782 40.112201 33.829378 15.207540 7.091670 32.604269 30.323622 31.973945
15 25.571023 18.113789 30.915942 26.935311 45.771603 15.868325 9.154542 26.951625 24.522718 14.945380
16 10.862005 43.721057 54.936913 25.452780 34.154762 4.038185 13.411006 41.784804 20.798726 9.617624
17 4.951999 11.777482 19.827426 12.798417 34.864567 1.881381 50.341528 24.640953 13.600837 24.316330
18 12.602611 44.989141 31.756518 60.000000 50.767570 20.788953 10.415866 5.427315 32.450598 38.791586
19 11.634356 18.588259 24.700600 17.620996 43.425558 45.023374 21.967371 9.176215 27.008075 24.106791
20 34.598150 23.195795 26.490087 12.446582 30.696737 5.961492 2.909811 0.472434 39.430237 44.301112
21 8.010439 17.133937 21.597454 17.322758 12.003452 13.956856 17.423311 4.319208 49.054840 47.313228
22 14.213581 13.013907 42.684656 52.144749 43.825369 20.861080 39.967672 16.577972 13.891236 35.827849
23 17.283408 45.924542 51.593317 55.568849 33.648052 15.822137 35.904421 8.848024 24.704299 11.604016
24 7.231297 18.351918 15.818481 28.129001 31.764006 7.846267 22.262767 12.155190 25.912714 27.630061
25 37.431148 14.271629 10.372286 18.130455 9.777123 43.152748 3.728966 49.090144 18.187796 51.745546
26 60.000000 27.090419 28.996761 14.138737 44.961747 15.467501 2.651519 18.647221 19.375649 19.259184
27 15.263485 24.453000 39.158395 29.291996 24.467756 28.393238 30.909583 15.293277 35.231897 26.813921
28 19.726142 19.591767 17.227544 15.210792 31.066936 7.544668 17.086464 16.578254 21.923898 18.220372
29 53.398856 35.515243 24.882107 23.198237 16.201861 11.142045 0.376473 3.806705 37.843522 21.460346
30 1.999801 41.338417 19.234954 34.182026 15.984169 20.502174 37.557506 10.702520 18.995861 11.977590
31 23.165529 9.897616 10.463554 9.429189 15.821055 14.059610 5.548009 30.102911 18.080977 15.059878
32 17.817273 26.074054 26.262574 31.639008 13.894720 49.942595 0.544130 7.229560 27.405940 18.695652
33 26.506697 12.475491 28.759075 26.614726 29.884389 6.496984 13.228164 29.737218 19.913229 17.977265
34 1.204476 37.155684 9.270425 18.976416 34.428537 22.663872 12.654010 5.705055 17.756469 36.962189
35 7.534135 11.614547 30.238235 47.987283 20.302559 3.249226 17.072194 22.333842 9.382286 28.667410
36 10.269124 25.800587 49.434714 39.438373 32.977586 5.673104 1.781834 48.147802 49.861266 9.549946
37 1.679551 22.546229 25.028371 34.567647 15.254913 18.676291 39.989288 14.035692 26.532216 22.792065
38 34.870874 12.062923 33.691626 19.715068 18.272835 27.545784 1.809349 46.898462 18.139465 11.661021
39 3.548040 20.209684 19.656624 31.864163 25.997314 0.553830 35.753506 3.232649 17.635220 11.559999
40 13.366033 13.707039 34.500235 23.116394 27.482170 31.895751 12.781879 18.189214 17.681920 33.061807
41 3.950274 10.492658 33.823570 15.996040 19.019034 4.585402 22.200923 56.025684 40.288950 12.298093
42 4.664843 32.356748 35.535390 14.709471 17.504142 22.546489 27.582803 14.589289 48.318855 22.106622
43 8.136822 13.535150 21.903188 10.697075 10.426118 28.748127 16.332092 33.934607 26.286119 60.000000
44 28.295757 22.036282 48.424268 24.784703 41.437238 0.062906 40.008105 18.073233 23.214578 39.247507
45 24.808727 11.418740 24.051138 33.027015 24.838067 17.460130 12.595379 31.639951 10.456252 10.361826
46 9.782053 12.168356 32.744544 14.103582 32.015211 8.679609 25.330024 21.230605 60.000000 23.020514
47 2.438080 17.025026 14.555416 25.361097 15.838776 12.936283 5.739870 4.977422 29.575528 25.947061
48 31.641609 18.787834 21.511244 14.716543 26.320254 12.441082 43.123952 14.849948 44.978574 13.966244
49 24.247804 26.292905 60.000000 35.324496 17.926226 6.274465 29.598034 20.822737 21.749836 27.680942
df2.to_csv("Data/outcsv_df2.csv")
# there is also an Excel export. But I have to look it up (one
# needs an "ExcelWriter" module)

# Your solution 4c.3-4c.4 here:



no significant difference (p-value: [0.1858791])

Toggle solution

## Outlook

We can also easily and quickly plot all data columns of an Excel file via:

# Matplotlib:
plt.plot(Group_A_df)


# or use the Pandas built-in plot function:
Group_A_df.plot()


Also, we can easily plot the Grand average over all columns via:

#Group_A_df.plot()
Group_A_df.plot(xlabel="time", ylabel="voltage")
#Group_A_df.mean(axis=1).plot()
Group_A_df.mean(axis=1).plot(lw=5, c="k", label="average")
plt.legend(loc="best",bbox_to_anchor=(1.05, 1), title="Legend:")


updated: