# Chapter 13: Statistical Analysis with Pingouin

*Pingouin* ^{} is a compact package that provides the most important test tools for a significance study.

**Info**: It is worth visiting the *Pingouin* website ^{}. It provides a very good overview of available significance tests and also a decision tree that helps to select the correct test for the respective data set.

Screenshots from pingouin-stats.org/guidelines.html ^{} (taken on March 2, 2021):

Let’s recap our last example from the *Pandas* chapter:

```
# Example from the Pandas chapter (just with the Violin plot):
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import pingouin as pg
# Define file paths:
file_path = "Data/Pandas_1/"
file_name_1 = "Group_A_data.xls"
file_name_2 = "Group_B_data.xls"
file_1 = os.path.join(file_path, file_name_1)
file_2 = os.path.join(file_path, file_name_2)
# Read the Excel files with Pandas into a Pandas Dataframe:
Group_A_df = pd.read_excel(file_1, index_col=0)
Group_B_df = pd.read_excel(file_2, index_col=0)
# Broadcast the DataFrame data into the approproate variables:
Group_A = Group_A_df["Data"].values
Group_B = Group_B_df["Data"].values
# VIOLIN-PLOTS:
fig=plt.figure(3, figsize=(5,6))
fig.clf()
plt.violinplot([Group_A, Group_B], showmedians=True)
plt.xticks([1,2], labels=["A", "B"])
plt.xlabel("Groups")
plt.ylabel("measurements")
plt.title("Violin plot")
plt.tight_layout
# plt.ylim(-40, 40)
plt.show()
fig.savefig("violinplot with data.pdf", dpi=120)
```

We would like to know whether the difference between the two groups is significant or not. Let’s assume, that our data is normally distributed and the two samples are independent. The corresponding test would be an unpaired, two-sample student’s t-test. The corresponding *Pingouin* command is the following:

```
test_result = pg.ttest(Group_A, Group_B, paired=False)
print(test_result)
```

T | dof | tail | p-val | CI95% | cohen-d | BF10 | power | |
---|---|---|---|---|---|---|---|---|

T-test | 1.154936 | 98 | two-sided | 0.250925 | [-2.19, 8.28] | 0.230987 | 0.381 | 0.208113 |

As we can see, the output of this *Pingouin* command is not just a single value, e.g., the *p*-value, but a table. To be more correct, this table actually is a *Pandas* DataFrame. Besides the *p*-value (`p-val`

), the DataFrame contains other useful statistical properties:

`T`

:*T*-value`p-val`

:*p*-value`dof`

: degrees of freedom`cohen-d`

: Cohen’s*d*effect size^{}`CI95%`

: 95% confidence intervals of the difference in means`power`

: achieved power of the test ( = 1 - type II error)`BF10`

: Bayes Factor of the alternative hypothesis

Note, that for DataFrames with many keys, by default the `print`

command doesn’t print out all of them. To get a list of all available keys, use the command `DataFrame.keys()`

:

```
# applied to our test_result-DataFrame:
print(test_result.keys())
```

`Index(['T', 'dof', 'tail', 'p-val', 'CI95%', 'cohen-d', 'BF10', 'power'], dtype='object')`

To access the data of a specific key, just use the same syntax as for dictionaries and add “`.values`

” in order to “extract” the value from the DataFrame structure:

```
print(f"the p-value of our test is:",
f"{test_result['p-val'].values}")
```

`the p-value of our test is: [0.25092508]`

## Exercise 1

Copy the solution of the *Pandas* example above into a new script. Extend your script by the following functionalities:

- Apply the significance test to the
`Group_A`

and`Group_B`

data. - Write an
*if*-statement that checks, whether the*p*-value of the significance test is lower or greater than 0.05. If the*p*-value is lower, then print out “*there is a significant difference*” together with the according*p*-value, otherwise print out “*no significant difference*” together with the according*p*-value. - Plug-in the significance check-up into a function, e.g. named
`unpaired_two_sample_ttest(...)`

.

```
# Your solution 1.1 and 1.2 here:
```

`no significant difference (p-value: [0.25092508])`

## Toggle solution

```
# Your solution 1.3 here:
```

`Group_A vs. Group_B:`

`no significant difference (p-value: [0.25092508])`

`Group_A vs. Group_B2:`

`significant difference (p-value: [0.02157014])`

## Toggle solution

## Outlook

Now imagine, that you have several Excel files to be read and processed. In this case it’s highly recommended to plug-in, e.g., the data import and the analysis part into *for*-loops, while the imported data have to be plugged into iterable *NumPy* Arrays. Also think about that you have several data project folders and you don’t want to look into each of them. In order prevent your routine from crashing, you could implement an *if*-statement that checks during the data import, whether files could be imported or not. And, whenever possible, plug-in complex and repetitive calculations and procedures into a function definition in order to keep your code readble and to save lines of unnecessary code.