Bioinformatics Class Project Day 3

Author

Solomon Chak

Published

April 24, 2023

1 Introduction

Last week, we set out to test the hypothesis that TEs are contributing to the evolution of the worker-queen differentiation in eusocial shrimp. We began with raw RNAseq data from three workers and three queens. I have assembled a transcriptome from the RNAseq data and ran Repeat Masker on the transcriptome assembly. We analyzed the raw RNAseq data with Galaxy to find the differential expressed genes and Finally, we loaded and explored two data sets in R:

df - List of transcripts that are SHARED in all individuals with results related to differential expression between queens and workers (mainly log2FoldChange and padj).
df_transcript_te - List of transposable elements (TEs) and their locations in the transcripts that has TEs.

Our task today is to combine these data sets into a master data set and to decide what analyses can test whether the data support the hypothesis.

2 Load libraries

library("tidyr")
library("ggpubr")
library("stringr")
library("plyr")
library("dplyr")

3 Make a master data set

How do we combine df and df_transcript_te? Examine these files to see if there’s common column.

Note that df has only transcripts that are mapped across all individuals, while df_transcript_te has only transcripts that have TEs. None of these files have ALL transcripts in the assembled transcriptome. But we do not need that information.

Write some codes to combine df and df_transcript_te and save the it as df_tranTE. Look at the different ways to combine dataframes using dplyr here, decide which one should be used.

# your code

# This should return true if you did this correctly
# nrow(df_tranTE) == nrow(df)

4 Discussion

If the hypothesis is true, what pattern would you expect to see in df_tranTE? These are your predictions/analysis.
What are the predictor and response in each analysis?
What statistic tests should be used for each analysis?

5 Predictions

Below, we will run through a few example analysis using all TEs.

You do not need to report Prediction 1 in your project report. Instead, I’ll provide a sample report based on Prediction 1.
For Prediction 2 and Prediction 3, you will need to
- (1) report the main analysis here (with all TEs) and
- (2) modify the codes to run it for your assigned TE class and report it in the project report.
  
  Name TE_class
  
  Akhila DNA
  
  Diandra DNA
  
  Mohab LTR
  
  Rebeca LTR
  
  Sanjna LINE
  
  Steven LINE
  
  Tyla SINE
  
  Solomon SINE

Name	TE_class
Akhila	DNA
Diandra	DNA
Mohab	LTR
Rebeca	LTR
Sanjna	LINE
Steven	LINE
Tyla	SINE
Solomon	SINE

Prediction 1

If TEs are contributing to the evolution of the worker-queen differentiation (our hypothesis), there would be more TEs in DE genes than non-DE genes.

What are the predictor & response variables?

Predictor = DE (Y or N, categorical)
Response = TE (Y or N, categorical)

What statistic tests should be used?

Chi-sq test of independence

To do this, we will need to create two new columns DE and TE in df_tranTE that indicated whether the gene is DE or not, and has TE or not. We can use mutate in combination with case_when.

df_tranTE = mutate(df_tranTE, 
                   DE = case_when(padj < 0.05 ~ 'DE', 
                                  padj >= 0.05 ~ 'NS'),
                   TE = case_when(TE_class %in% c("DNA", 
                                                  "LTR", 
                                                  "LINE", 
                                                  "SINE") ~ 'TE',
                                  .default = 'Not')
                   )
# In the second case_when(), when the TE_class is not the ones in c(), the cell will be assigned "Not"

Then we can use table to make a contingency table that has the count of transcripts in all combinations of the columns DE and TE.

obs_table = table(df_tranTE$TE, df_tranTE$DE)
obs_table

     
         DE    NS
  Not   224 26108
  TE     39  3462

Now write some codes to run the Chi-squared test of independence.

# Your code

Remember that for a chi-square test of independence, the null hypothesis is that the two variables (TE and DE) are independent. When p < 0.05, we can reject the null hypothesis. When p > 0.05, we fail to reject the null hypothesis.

What does the test result tell you?

Let’s plot this using the raw data df_tranTE.

ggplot(data = df_tranTE, aes(x = TE, fill = TE)) + 
  geom_bar() + 
  facet_grid(DE~., scales = "free_y") +
  labs(x = "Transposable Elements", y = "Count")

We can further test whether TE_class affect whether a transcript is DE or not.

chisq.test(with(df_tranTE, table(TE_class, DE)))

Warning in chisq.test(with(df_tranTE, table(TE_class, DE))): Chi-squared
approximation may be incorrect


    Pearson's Chi-squared test

data:  with(df_tranTE, table(TE_class, DE))
X-squared = 1.2042, df = 3, p-value = 0.752

What does this result tell you?

Summary Statement

In summary we can say that (note the use of 3 significant figures in reporting stats.):

The proportions of transcripts having TEs or not are not statistically different between transcripts that are and are not differentially expressed between queens and workers (Chi-squared = 2.160, d.f. = 1, p = 0.142). Further, whether a transcript is differentially expressed or not is not dependent on the major TE class (DNA, LTR, LINE, and SINE) (Chi-squared = 1.204, d.f. = 3, p = 0.752). These results do not support our hypothesis that TEs are contributing to the evolution of the worker-queen differentiation.

Prediction 2

If the hypothesis is true (TEs are contributing to the evolution of the worker-queen differentiation), transcripts that has TE may be more strongly differentially expressed in queens.

What are the predictor & response variables?

Predictor = TE (Y or N, categorical & binomial)
Response = log2FoldChange (continuous)
- Note that log2FoldChange can be -ve and +ve. If we use it as is, we will be testing whether transcripts with TE are more UP regulated. We are not doing that here.
- Instead, we should use log(abs(log2FoldChange)), which tests whether transcripts with TE are differentially expressed REGARDLESS of being up/down regulated. The log transformation is used to normalize the data.

What statistic tests should be used?

t-test

Write some codes to perform the t.test and plot the data.

# Your code

The null hypothesis of a t-test is that the means of the two groups are not different. You can reject the null when p < 0.05, but fail to reject the null when p > 0.05. Use the reported means of the two group to interpret which group has higher means (or less negative). The difference between the two groups can also be seen in the plot.

Tip

In your analysis, filter df_tranTE to have your assigned TE class OR is.na(TE_class) . Then rerun the t-test and plotting.

Prediction 3

If the hypothesis is true (TEs are contributing to the evolution of the worker-queen differentiation), transcripts that has longer TE may be more strongly differentially expressed.

What are the predictor & response variables?

Predictor = TE_length (continuous)
- Note that not all genes in df_tranTE has TE, so we’ll need to filter it to just rows with TEs.
- We should transform it with log(TE_length) to conform with normality.
Response = log2FoldChange (continuous)
- Note again that log2FoldChange can be analyzed in two ways. We will use log(abs(log2FoldChange)) here.

What statistic tests should be used?

Linear regression

Write some codes to perform the filter and regression steps. And plot the data with a linear regression line (use linetype = “dashed” for non-significant results).

# Your code

What does the result tell you?

For your analysis, filter df_tranTE to have your assigned TE class. Then do the regression and plotting.