This is the Chapter 9 of Py4Bio.
NumPy is great for numerical computation.
However, in Biology (and in any other domain of data-science), we often have to deal with datasets which not only consists of number.
Think about a spreadsheet, or a table.
Your data will contain sample id, information about the sample, categorical variables as well as numerical variables.
Your data will probably be messy. Before doing any specific analysis, you would need to clean it.
Welcome to Pandas.
Pandas is a powerful open-source data manipulation and analysis library built on top of the Python programming language. It provides efficient data structures, such as DataFrame
and Series
, which are ideal for handling large datasets commonly encountered in bioinformatics.
Once installed, you can import it into your Python script:
import pandas as pd
Data Structures in Pandas: Series and Dataframe
A Series
is a one-dimensional array-like object that can hold data of any type, such as integers, strings, or floats. Each element in a Series
has an associated index.
import pandas as pd
# Example: Gene expression data for a single gene across different conditions
genes = ['Gene1', 'Gene2', 'Gene3', 'Gene4']
expression_values = [10.5, 7.8, 13.2, 8.1]
gene_expression = pd.Series(expression_values, index=genes)
print(gene_expression)
A DataFrame
is a two-dimensional table, much like a database table or an Excel spreadsheet, where each column can be a different type (e.g., strings, numbers). This is the most widely used structure for storing bioinformatics data.
# Sample sequencing data in a dictionary
data = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4'],
'Sample1': [10.5, 7.8, 13.2, 8.1],
'Sample2': [9.3, 6.5, 14.1, 7.9],
'Sample3': [11.2, 8.4, 12.9, 8.5]
}
# Convert it in a data frame
df = pd.DataFrame(data)
print(df)
However, most of the time you will load your data from a file.
Pandas can read and write a variety of file formats, such as CSV, Excel, and TSV. In bioinformatics, data is often stored in these formats.
# Loading a CSV file containing gene expression data
df = pd.read_csv('gene_expression.csv')
print(df.head())
Pandas also supports reading tab-delimited files, common in bioinformatics.
# Loading data from a tab-delimited file
df = pd.read_csv('gene_expression.tsv', sep='\t')
Indexing and Subsetting in Pandas: Descriptions and Examples
Indexing in Pandas refers to selecting specific rows, columns, or both from a DataFrame or Series. The index is essentially the label for rows, and in Pandas, it can be a simple range of integers, dates, or any other values. You can access and manipulate data in Pandas using labels, integer-based indices, or Boolean conditions.
There are several ways to index and subset data, such as using .loc[]
, .iloc[]
, and .at[]
. Each of these has its specific use case for selecting data based on labels or positions.
The .loc[]
function is used to index a DataFrame or Series by labels. This is useful when you want to select data using row and column labels (index names).
import pandas as pd
# Sample gene expression data
data = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4'],
'Sample1': [10.5, 7.8, 13.2, 8.1],
'Sample2': [9.3, 6.5, 14.1, 7.9],
'Sample3': [11.2, 8.4, 12.9, 8.5]
}
# Create DataFrame
df = pd.DataFrame(data)
# Set 'Gene' as index
df.set_index('Gene', inplace=True)
# Accessing the expression values of 'Gene1' using .loc[]
gene1_expression = df.loc['Gene1']
print(gene1_expression)
Here, the .loc[]
method is used to access the expression values for Gene1
across the samples. The gene name ('Gene1'
) is used as the label (index), and the corresponding row is returned with expression values for Sample1
, Sample2
, and Sample3
.
Indexing by Position: .iloc[]
The .iloc[]
function is used to index a DataFrame or Series by integer positions. This method is useful when you want to select rows or columns based on their position rather than the label.
# Accessing the first row and second column using .iloc[]
first_row_second_column = df.iloc[0, 1]
print(first_row_second_column)
In this example, the .iloc[]
method is used to access the value at the first row (index 0
) and the second column (index 1
) in the DataFrame
. This gives the expression value for the first gene in Sample2
.
Subsetting in Pandas
Subsetting in Pandas refers to the process of selecting a specific subset of data from a DataFrame or Series. This can be done by filtering rows or columns using labels, conditions, or positional indexing. Subsetting is often used to focus on specific groups or measurements in biological datasets.
You can subset a DataFrame by selecting one or more columns using column labels.
# Subsetting columns 'Sample1' and 'Sample2'
subset_columns = df[['Sample1', 'Sample2']]
print(subset_columns)
Subsetting by Row with Conditions
You can also subset data by applying conditions to filter rows. This is particularly useful in bioinformatics for selecting genes based on certain criteria, such as expression levels above a threshold.
# Subsetting rows where expression in 'Sample1' is greater than 10
subset_rows_condition = df[df['Sample1'] > 10]
print(subset_rows_condition)
Subsetting with .loc[]
for Row and Column Selection
# Subsetting specific rows and columns using .loc[]
subset_loc = df.loc[['Gene1', 'Gene3'], ['Sample1', 'Sample2']]
print(subset_loc)
Advanced use of Pandas
Data Manipulation and Cleaning
Example: Cleaning a gene expression dataset to handle missing values and transform data.
import pandas as pd
import numpy as np
# Sample gene expression data
data = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4'],
'Sample1': [10.5, np.nan, 13.2, 8.1],
'Sample2': [9.3, 6.5, np.nan, 7.9],
'Sample3': [11.2, 8.4, 12.9, np.nan]
}
# Create DataFrame
df = pd.DataFrame(data)
# Handle missing data (fill with the mean of each column)
df.fillna(df.mean(), inplace=True)
print(df)
Data Aggregation and Grouping
Example: Grouping gene expression data by experimental condition (e.g., genotype) and calculating the mean expression for each gene.
# Sample data with a 'Condition' column
data = {
'Gene': ['Gene1', 'Gene1', 'Gene2', 'Gene2', 'Gene3', 'Gene3'],
'Condition': ['Control', 'Treatment', 'Control', 'Treatment', 'Control', 'Treatment'],
'Expression': [10.5, 12.2, 7.8, 8.5, 13.2, 15.1]
}
df = pd.DataFrame(data)
# Group by 'Gene' and calculate the mean expression for each condition
grouped = df.groupby(['Gene', 'Condition']).mean()
print(grouped)
Data Indexing and Subsetting
Indexing by Label: .loc[]
# Accessing data for 'Gene2' using .loc[]
gene2_data = df.loc[df['Gene'] == 'Gene2']
print(gene2_data)
Indexing by Position: .iloc[]
# Accessing the first row and second column using .iloc[]
first_row_second_col = df.iloc[0, 1]
print(first_row_second_col)
Boolean Indexing
# Sample gene expression data
data = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4'],
'Sample1': [10.5, 7.8, 13.2, 8.1],
'Sample2': [9.3, 6.5, 14.1, 7.9],
'Sample3': [11.2, 8.4, 12.9, 8.5]
}
df = pd.DataFrame(data)
# Filter genes with expression in 'Sample1' greater than 10
high_expression_genes = df[df['Sample1'] > 10]
print(high_expression_genes)
Merging and Joining Data
# Gene expression data
expr_data = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4'],
'Expression': [10.5, 7.8, 13.2, 8.1]
}
# Sample metadata
metadata = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene4'],
'Species': ['Human', 'Mouse', 'Human', 'Mouse']
}
df_expr = pd.DataFrame(expr_data)
df_metadata = pd.DataFrame(metadata)
# Merging the data on 'Gene'
merged_df = pd.merge(df_expr, df_metadata, on='Gene')
print(merged_df)
Reshaping and Pivoting
Pivoting gene expression data to show expression values for different conditions as columns.
# Sample data
data = {
'Gene': ['Gene1', 'Gene1', 'Gene2', 'Gene2', 'Gene3', 'Gene3'],
'Condition': ['Control', 'Treatment', 'Control', 'Treatment', 'Control', 'Treatment'],
'Expression': [10.5, 12.2, 7.8, 8.5, 13.2, 15.1]
}
df = pd.DataFrame(data)
# Pivot the data
pivoted_df = df.pivot(index='Gene', columns='Condition', values='Expression')
print(pivoted_df)
Melting the data to turn conditions into a single column.
# Melting the dataset
melted_df = pd.melt(df, id_vars=['Gene'], value_vars=['Control', 'Treatment'], var_name='Condition', value_name='Expression')
print(melted_df)
Time Series Data Handling
Handling time series gene expression data, where data is recorded over multiple time points.
# Sample time-series data
time_data = {
'Gene': ['Gene1', 'Gene1', 'Gene2', 'Gene2', 'Gene3', 'Gene3'],
'Time': ['0h', '24h', '0h', '24h', '0h', '24h'],
'Expression': [10.5, 11.2, 7.8, 8.0, 13.2, 15.0]
}
df = pd.DataFrame(time_data)
# Convert 'Time' to a datetime object
df['Time'] = pd.to_datetime(df['Time'], format='%Hh')
# Set the 'Gene' column as the index
df.set_index(['Gene', 'Time'], inplace=True)
print(df)
Basic Data Visualization
import matplotlib.pyplot as plt
# Plotting the data
df.reset_index().groupby(['Gene', 'Time'])['Expression'].mean().unstack().plot(kind='line')
plt.ylabel('Gene Expression')
plt.title('Gene Expression Over Time')
plt.show()
Handling Large Datasets
Reading a large gene expression file in chunks to avoid memory issues.
The dataset is read in smaller chunks to avoid memory overload, making it easier to process large biological datasets efficiently.
# Reading large dataset in chunks
chunk_size = 1000
for chunk in pd.read_csv('large_gene_expression_data.csv', chunksize=chunk_size):
process_chunk(chunk) # Define your processing function here
Exporting Data
Saving processed data to a CSV file.
# Save processed data to CSV
df.to_csv('processed_gene_expression.csv')