library("tidyr")
library("dplyr")
library("ggpubr")
library("stringr")
library("plyr")
Project Report
1 Introduction
See here for the background of the project.
Here, we hypothesize that TEs are contributing to the differentiation between queens and workers in eusocial shrimps. It hypothesized that transcripts with more TE are strongly differentially expressed in queens. It was also hypothesized that transcripts with longer TE are more strongly expressed. When investigating different TE classes it was hypothesized that TE class SINE has an affect on longer TE’s
1.1
1.2 Are transcripts that have TE more strongly differentially expressed in queens?
We preformed a t-test on the Y or N, categorical and binomial predictor TE against the continuos response variable log2FoldChange
#t-test
t.test(log(abs(log2FoldChange))~TE, df_tranTE)
Welch Two Sample t-test
data: log(abs(log2FoldChange)) by TE
t = -6.6975, df = 4541.1, p-value = 2.379e-11
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
-0.1872263 -0.1024362
sample estimates:
mean in group Not mean in group TE
-0.7029916 -0.5581603
It can be said based on the results that transcripts with TE have a higher log fold change. Since the p value is 2.379e-11 which is <0.05 we can accept the null hypothesis.
#Create a box plot
ggplot(df_tranTE , aes(TE , log(abs(log2FoldChange)))) +
geom_boxplot()
We further investigated if TE_class SINE has an affect on TE
#filter df_TranTE to have SINE OR is.na.(TE_class)
=filter(df_tranTE,TE_class=="SINE"|is.na(TE_class))
df_tranTE_SINE
#Preform a t-test based on the new dataframe
t.test(log(abs(log2FoldChange))~TE,df_tranTE_SINE )
Welch Two Sample t-test
data: log(abs(log2FoldChange)) by TE
t = 0.18889, df = 30.1, p-value = 0.8514
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
-0.3443377 0.4145348
sample estimates:
mean in group Not mean in group TE
-0.7029916 -0.7380902
It can be said based on the results that SINE has no affect on TE. Since the p value is 0.851 which is >0.05 we can reject the null hypothesis
#Plot the data
ggplot(df_tranTE_SINE , aes(TE , log(abs(log2FoldChange)))) +
geom_boxplot()
1.3 Are transcripts with longer TE more stongly differentially expressed?
We prefomed a linear regression on continous predictor TE_length against the continuos response variable log2FoldChange
#linear regression
=lm(formula =log(TE_length)~ log(abs(log2FoldChange)) , data =df_tranTE )
lm_log
#Preform a summary
summary(lm_log)
Call:
lm(formula = log(TE_length) ~ log(abs(log2FoldChange)), data = df_tranTE)
Residuals:
Min 1Q Median 3Q Max
-1.9795 -0.4413 -0.1231 0.3103 3.5771
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.37279 0.01388 314.942 <2e-16 ***
log(abs(log2FoldChange)) -0.00166 0.01051 -0.158 0.874
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7447 on 3499 degrees of freedom
(26332 observations deleted due to missingness)
Multiple R-squared: 7.136e-06, Adjusted R-squared: -0.0002787
F-statistic: 0.02497 on 1 and 3499 DF, p-value: 0.8745
Since the p value is 0.8745 which is >0.05 we failed to reject that the length of TE length has a strong affect on differential expression.
#plot the data
ggplot(data =df_tranTE , aes(x =log(TE_length) , y =log(abs(log2FoldChange)))) +
geom_point() +
geom_smooth(method = lm , se = F, linetype="dashed")
We further investigated if TE_class SINE has an affect on longer TE’s
#filter dataframe
=filter(df_tranTE,TE_class=="SINE"|is.na(TE_class))
df_tranTE_SINE
#preform a linear regresson based on filtered dataframe
=lm(formula =log(TE_length)~ log(abs(log2FoldChange)) ,data = df_tranTE_SINE)
lm_HA_SINE #Summary
summary(lm_HA_SINE)
Call:
lm(formula = log(TE_length) ~ log(abs(log2FoldChange)), data = df_tranTE_SINE)
Residuals:
Min 1Q Median 3Q Max
-1.24478 -0.34758 -0.06514 0.50506 0.76118
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.40603 0.11901 37.022 <2e-16 ***
log(abs(log2FoldChange)) 0.03007 0.09471 0.318 0.753
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5363 on 29 degrees of freedom
(26332 observations deleted due to missingness)
Multiple R-squared: 0.003465, Adjusted R-squared: -0.0309
F-statistic: 0.1008 on 1 and 29 DF, p-value: 0.7531
Because the p value is 0.7531 which is >0.05 we fail to reject the null hypothesis. SINE has no affect.
#Change the line to a dashed line
ggplot(data =df_tranTE_SINE , aes(x =log(TE_length) , y =log(abs(log2FoldChange)))) +
geom_point() +
geom_smooth(method = lm , se = F, linetype="dashed")
2 Conclusion
In summary, It can be concluded based on the results that our hypothesis was wrong based on what we found. It was revealed that transcripts that have TEs are not more differentially expressed. Also, the class of TE does not affect whether a transcribe is differentially expressed. Results showed that transcripts with TE have a higher log fold change.This means that the null hypothesis was accepted. When investigating if TE class SINE had an affect on TE results showed that SINE has no affect on TE which meant the null hypothesis stating that TE class SINE has an affect on TE. Wen investigating if TE length has an affect on expression, we failed to reject that the length of TE length has a strong affect on differential expression. Because the p value was larger than <0.05 when finding out if SINE has an affect on longer TE’s, the null hypothesis was rejected.