library("tidyr")
library("dplyr")
library("ggpubr")
library("stringr")
library("plyr")
Final project
1 Introduction
See here for the background of the project.
Here, we hypothesize that TEs are contributing to the differentiation between queens and workers in eusocial shrimps, We expect to find more TEs in differentially expressed genes than in non-DE genes and transcripts that have longer TE may be more strongly deferentially expressed. Apart from the contribution of all TEs, we will also test whether DNA transposons have a significant contribution to worker-queen differentiation. DNA transposons are known to move from one genomic location to another by a cut-and-paste mechanism. DNA transposons are powerful forces of genetic change and have played a significant role in the evolution of many genomes (Munoz-Lopez&Garcia 2010). They can be used to introduce foreign DNA into a genome, analyse the regulatory genome, study embryonic development, identify genes and pathways implicated in disease, and contribute to gene therapy.
2 Methods & Results
From the eusocial shrimp Synalpheus elizabethae, we obtained RNA-seq data from three queens and three workers. The transcriptome is assembled with Trinity (Haas et al. 2013) and transposable elements were identified with Repeat Masker (Tarailo-Graovac and Chen 2009). We performed differential gene expression analysis using Galaxy (Blankenberg et al. 2010) and DESeq2 (Love et al. 2014). All statistical analyses were performed in R and detailed below.
2.1 Could TE transcripts be strongly deferentially expressed?
Hypothesis: If the hypothesis is true, it’ll be considered that transcripts that have TE may be more strongly deferentially expressed in queens.
t.test(log(abs(log2FoldChange))~TE, data = df_tranTE)
Welch Two Sample t-test
data: log(abs(log2FoldChange)) by TE
t = -6.6975, df = 4541.1, p-value = 2.379e-11
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
-0.1872263 -0.1024362
sample estimates:
mean in group Not mean in group TE
-0.7029916 -0.5581603
# Plot your data
ggplot(data = df_tranTE, aes(x = TE, y = log(abs(log2FoldChange)))) +
geom_boxplot()
Summary, we rejected the null meaning the hypothesis is true that TE has a significant of effect on dealing with differential expression
# create a table filtering out TE class with DNA
= filter(df_tranTE, TE_class=="DNA"| is.na(TE_class))
df_tranTE_DNA
# use t.test
t.test(log(abs(log2FoldChange))~TE, data = df_tranTE_DNA)
Welch Two Sample t-test
data: log(abs(log2FoldChange)) by TE
t = -3.8127, df = 1606.9, p-value = 0.0001426
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
-0.18244839 -0.05849529
sample estimates:
mean in group Not mean in group TE
-0.7029916 -0.5825198
#What are the predictor & response variables?
#Predictor = TE (Y or N, categorical & binomial)
#Response = log2FoldChange(continuous) = log(abs(log2FoldChange))
Summary, we rejected the null meaning the hypothesis is true, DNA has a significant effect on dealing with differential expression. (t= = -6.6975, df = 4541.1, p = 2.379e-11).
# Plot your data
ggplot(data = df_tranTE_DNA, aes(x = TE, y = log(abs(log2FoldChange)))) +
geom_boxplot() +
labs(x = "DNA")
2.2 Does TE length have an effect on differential expression?
Predictions: If the hypothesis is true it’ll be considered that, transcripts that have longer TE may be more strongly deferentially expressed.
# # use liner regression
= lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE)
lm_HA summary(lm_HA)
Call:
lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE)
Residuals:
Min 1Q Median 3Q Max
-8.1856 -0.6526 0.1265 0.7574 3.8145
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.539360 0.120689 -4.469 8.11e-06 ***
log(TE_length) -0.004298 0.027203 -0.158 0.874
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.198 on 3499 degrees of freedom
(26332 observations deleted due to missingness)
Multiple R-squared: 7.136e-06, Adjusted R-squared: -0.0002787
F-statistic: 0.02497 on 1 and 3499 DF, p-value: 0.8745
Summary: The null hypothesis failed to reject meaning that the length of TE has no effect on differential expression and there are is significant relationship between the two variables (log2foldchange) and (TE_length).
# create a table filtering out df_tranTE with DNA
= lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE_DNA)
lm_HA_DNA summary(lm_HA_DNA)
Call:
lm(formula = log(abs(log2FoldChange)) ~ log(TE_length), data = df_tranTE_DNA)
Residuals:
Min 1Q Median 3Q Max
-7.3724 -0.6296 0.1058 0.7590 3.0876
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.72527 0.19490 -3.721 0.000206 ***
log(TE_length) 0.03261 0.04396 0.742 0.458389
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.159 on 1426 degrees of freedom
(26332 observations deleted due to missingness)
Multiple R-squared: 0.0003856, Adjusted R-squared: -0.0003154
F-statistic: 0.5501 on 1 and 1426 DF, p-value: 0.4584
Summary, the null hypothesis failed to reject meaning that the length of TE has no effect on differential expression (DE). (F= 0.05501, df = 1 and 1426, p =0.4584)
# Plot your data
ggplot(data = df_tranTE_DNA, aes(x = log(TE_length), y = log(abs(log2FoldChange)))) +
geom_point() +
geom_smooth(method = lm)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 26332 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 26332 rows containing missing values (`geom_point()`).
3 Conclusion
In summary, we found that transcripts that have TEs are more deferentially expressed. Also, the class of DNA does effect whether a transcribe is deferentially expressed. These results support the hypothesis that TEs are contributing to the evolution of the worker-queen differentiation.
Although the results did not support the conclusion, the dichotomous classification of transcripts into merely deferentially expressed or not may conceal important pattern in the strength of differential expression. Further analysis should consider using a continuous variable based on the fold change of differential expression.