Project Report

Author

Tyla Perry

Published

May 9, 2023

1 Introduction

See here for the background of the project.

Here, we hypothesize that TEs are contributing to the differentiation between queens and workers in eusocial shrimps. It hypothesized that transcripts with more TE are strongly differentially expressed in queens. It was also hypothesized that transcripts with longer TE are more strongly expressed. When investigating different TE classes it was hypothesized that TE class SINE has an affect on longer TE’s

1.1

We expect transcripts that have TE to be more strongly expressed. It is expected to see transcipts with longer TE more strongly expressed in worker-queens. When looking at TE class SINE it is expected to see longer TE’s in TE class SINE

Apart from the contribution of all TEs, we will also test whether SINE has a significant contribution to worker-queen differentiation. Transposable components (TEs) possess close to half(46%), of the human genome, making the TE content of our genome one of the greatest among well evolved creatures.The only mobile elements in the human genome that have clear evidence of current retrotrans-positional activity are the autonomous long interspersed element-1 (LINE-1 or L1) and its non-autonomous partners, such as “SINER, VNTR, and Alu” (SVA) and the short interspersed element (SINE) Alu (Belancio et.,al 2009)

https://doi.org/10.1186/gm97
library("tidyr")
library("dplyr")
library("ggpubr")
library("stringr")
library("plyr")

1.2 Are transcripts that have TE more strongly differentially expressed in queens?

We preformed a t-test on the Y or N, categorical and binomial predictor TE against the continuos response variable log2FoldChange

#t-test

t.test(log(abs(log2FoldChange))~TE, df_tranTE)

    Welch Two Sample t-test

data:  log(abs(log2FoldChange)) by TE
t = -6.6975, df = 4541.1, p-value = 2.379e-11
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
 -0.1872263 -0.1024362
sample estimates:
mean in group Not  mean in group TE 
       -0.7029916        -0.5581603 

It can be said based on the results that transcripts with TE have a higher log fold change. Since the p value is 2.379e-11 which is <0.05 we can accept the null hypothesis.

#Create a box plot

ggplot(df_tranTE , aes(TE , log(abs(log2FoldChange)))) + 

  geom_boxplot() 

We further investigated if TE_class SINE has an affect on TE

#filter df_TranTE to have SINE OR is.na.(TE_class)
df_tranTE_SINE=filter(df_tranTE,TE_class=="SINE"|is.na(TE_class))


#Preform a t-test based on the new dataframe
t.test(log(abs(log2FoldChange))~TE,df_tranTE_SINE )

    Welch Two Sample t-test

data:  log(abs(log2FoldChange)) by TE
t = 0.18889, df = 30.1, p-value = 0.8514
alternative hypothesis: true difference in means between group Not and group TE is not equal to 0
95 percent confidence interval:
 -0.3443377  0.4145348
sample estimates:
mean in group Not  mean in group TE 
       -0.7029916        -0.7380902 

It can be said based on the results that SINE has no affect on TE. Since the p value is 0.851 which is >0.05 we can reject the null hypothesis

#Plot the data
ggplot(df_tranTE_SINE , aes(TE , log(abs(log2FoldChange)))) + 
  geom_boxplot() 

1.3 Are transcripts with longer TE more stongly differentially expressed?

We prefomed a linear regression on continous predictor TE_length against the continuos response variable log2FoldChange

#linear regression

lm_log=lm(formula =log(TE_length)~ log(abs(log2FoldChange)) , data =df_tranTE )

#Preform a summary
summary(lm_log)

Call:
lm(formula = log(TE_length) ~ log(abs(log2FoldChange)), data = df_tranTE)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9795 -0.4413 -0.1231  0.3103  3.5771 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.37279    0.01388 314.942   <2e-16 ***
log(abs(log2FoldChange)) -0.00166    0.01051  -0.158    0.874    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7447 on 3499 degrees of freedom
  (26332 observations deleted due to missingness)
Multiple R-squared:  7.136e-06, Adjusted R-squared:  -0.0002787 
F-statistic: 0.02497 on 1 and 3499 DF,  p-value: 0.8745

Since the p value is 0.8745 which is >0.05 we failed to reject that the length of TE length has a strong affect on differential expression.

#plot the data 
ggplot(data =df_tranTE , aes(x =log(TE_length) , y =log(abs(log2FoldChange)))) +
  geom_point() +
   geom_smooth(method = lm , se = F, linetype="dashed") 

We further investigated if TE_class SINE has an affect on longer TE’s

#filter dataframe
df_tranTE_SINE=filter(df_tranTE,TE_class=="SINE"|is.na(TE_class))

#preform a linear regresson based on filtered dataframe
lm_HA_SINE =lm(formula =log(TE_length)~ log(abs(log2FoldChange)) ,data = df_tranTE_SINE)
#Summary
summary(lm_HA_SINE)

Call:
lm(formula = log(TE_length) ~ log(abs(log2FoldChange)), data = df_tranTE_SINE)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24478 -0.34758 -0.06514  0.50506  0.76118 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.40603    0.11901  37.022   <2e-16 ***
log(abs(log2FoldChange))  0.03007    0.09471   0.318    0.753    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5363 on 29 degrees of freedom
  (26332 observations deleted due to missingness)
Multiple R-squared:  0.003465,  Adjusted R-squared:  -0.0309 
F-statistic: 0.1008 on 1 and 29 DF,  p-value: 0.7531

Because the p value is 0.7531 which is >0.05 we fail to reject the null hypothesis. SINE has no affect.

#Change the line to a dashed line
 ggplot(data =df_tranTE_SINE , aes(x =log(TE_length) , y =log(abs(log2FoldChange)))) +
  geom_point() +
   geom_smooth(method = lm , se = F, linetype="dashed") 

2 Conclusion

In summary, It can be concluded based on the results that our hypothesis was wrong based on what we found. It was revealed that transcripts that have TEs are not more differentially expressed. Also, the class of TE does not affect whether a transcribe is differentially expressed. Results showed that transcripts with TE have a higher log fold change.This means that the null hypothesis was accepted. When investigating if TE class SINE had an affect on TE results showed that SINE has no affect on TE which meant the null hypothesis stating that TE class SINE has an affect on TE. Wen investigating if TE length has an affect on expression, we failed to reject that the length of TE length has a strong affect on differential expression. Because the p value was larger than <0.05 when finding out if SINE has an affect on longer TE’s, the null hypothesis was rejected.