Final project

Author

Diandra Gordon

Published

April 26, 2023

1 Introduction

See here for the background of the project.

Here, we hypothesize that TEs are contributing to the differentiation between queens and workers in eusocial shrimps, We expect to find more TEs in differentially expressed genes than in non-DE genes and transcripts that have longer TE may be more strongly deferentially expressed. Apart from the contribution of all TEs, we will also test whether DNA transposons have a significant contribution to worker-queen differentiation. DNA transposons are known to move from one genomic location to another by a cut-and-paste mechanism. DNA transposons are powerful forces of genetic change and have played a significant role in the evolution of many genomes (Munoz-Lopez&Garcia 2010). They can be used to introduce foreign DNA into a genome, analyse the regulatory genome, study embryonic development, identify genes and pathways implicated in disease, and contribute to gene therapy.

2 Methods & Results

From the eusocial shrimp Synalpheus elizabethae, we obtained RNA-seq data from three queens and three workers. The transcriptome is assembled with Trinity (Haas et al. 2013) and transposable elements were identified with Repeat Masker (Tarailo-Graovac and Chen 2009). We performed differential gene expression analysis using Galaxy (Blankenberg et al. 2010) and DESeq2 (Love et al. 2014). All statistical analyses were performed in R and detailed below.

library("tidyr")
library("dplyr")
library("ggpubr")
library("stringr")
library("plyr")

2.1 Could TE transcripts be strongly deferentially expressed?

Hypothesis: If the hypothesis is true, it’ll be considered that transcripts that have TE may be more strongly deferentially expressed in queens.

t.test(log(abs(log2FoldChange))~TE, data = df_tranTE)

    Welch Two Sample t-test

data:  log(abs(log2FoldChange)) by TE
t = -6.6975, df = 4541.1, p-value = 2.379e-11
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
 -0.1872263 -0.1024362
sample estimates:
mean in group Not  mean in group TE 
       -0.7029916        -0.5581603 
# Plot your data 
ggplot(data = df_tranTE, aes(x = TE, y = log(abs(log2FoldChange)))) + 
  geom_boxplot() 

Summary, we rejected the null meaning the hypothesis is true that TE has a significant of effect on dealing with differential expression

# create a table filtering out TE class with DNA 

df_tranTE_DNA = filter(df_tranTE, TE_class=="DNA"| is.na(TE_class))

# use t.test
t.test(log(abs(log2FoldChange))~TE, data = df_tranTE_DNA)

    Welch Two Sample t-test

data:  log(abs(log2FoldChange)) by TE
t = -3.8127, df = 1606.9, p-value = 0.0001426
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
 -0.18244839 -0.05849529
sample estimates:
mean in group Not  mean in group TE 
       -0.7029916        -0.5825198 
#What are the predictor & response variables?

#Predictor = TE (Y or N, categorical & binomial)
#Response =  log2FoldChange(continuous) = log(abs(log2FoldChange))

Summary, we rejected the null meaning the hypothesis is true, DNA has a significant effect on dealing with differential expression. (t= = -6.6975, df = 4541.1, p = 2.379e-11).

# Plot your data 
ggplot(data = df_tranTE_DNA, aes(x = TE, y = log(abs(log2FoldChange)))) + 
  geom_boxplot() +
  labs(x = "DNA")

2.2 Does TE length have an effect on differential expression?

Predictions: If the hypothesis is true it’ll be considered that, transcripts that have longer TE may be more strongly deferentially expressed.

# # use liner regression

lm_HA = lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE)
summary(lm_HA)

Call:
lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.1856 -0.6526  0.1265  0.7574  3.8145 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.539360   0.120689  -4.469 8.11e-06 ***
log(TE_length) -0.004298   0.027203  -0.158    0.874    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.198 on 3499 degrees of freedom
  (26332 observations deleted due to missingness)
Multiple R-squared:  7.136e-06, Adjusted R-squared:  -0.0002787 
F-statistic: 0.02497 on 1 and 3499 DF,  p-value: 0.8745

Summary: The null hypothesis failed to reject meaning that the length of TE has no effect on differential expression and there are is significant relationship between the two variables (log2foldchange) and (TE_length).

# create a table filtering out df_tranTE with DNA

lm_HA_DNA = lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE_DNA)
summary(lm_HA_DNA)

Call:
lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE_DNA)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.3724 -0.6296  0.1058  0.7590  3.0876 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.72527    0.19490  -3.721 0.000206 ***
log(TE_length)  0.03261    0.04396   0.742 0.458389    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.159 on 1426 degrees of freedom
  (26332 observations deleted due to missingness)
Multiple R-squared:  0.0003856, Adjusted R-squared:  -0.0003154 
F-statistic: 0.5501 on 1 and 1426 DF,  p-value: 0.4584

Summary, the null hypothesis failed to reject meaning that the length of TE has no effect on differential expression (DE). (F= 0.05501, df = 1 and 1426, p =0.4584)

# Plot your data
ggplot(data = df_tranTE_DNA, aes(x = log(TE_length), y = log(abs(log2FoldChange)))) +
  geom_point() +
  geom_smooth(method = lm)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 26332 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 26332 rows containing missing values (`geom_point()`).

3 Conclusion

In summary, we found that transcripts that have TEs are more deferentially expressed. Also, the class of DNA does effect whether a transcribe is deferentially expressed. These results support the hypothesis that TEs are contributing to the evolution of the worker-queen differentiation.

Although the results did not support the conclusion, the dichotomous classification of transcripts into merely deferentially expressed or not may conceal important pattern in the strength of differential expression. Further analysis should consider using a continuous variable based on the fold change of differential expression.