install.packages("dplyr")
install.packages("tidyr")
R dataframe
1 Aim
This is an introduction for Intro. Bioinformatics student to “play” with dataframes in R.
You will learn to do the following:
extract data from rows and columns in a dataframe
sort a dataframe
subset rows of a dataframe based on values in columns
create new columns based on existing columns
combine dataframes
2 Install & load packages
We will need to install two packages. Once this is done, you won’t need to install again.
But you will need to load the library everytime:
library("dplyr")
library("tidyr")
library("ggplot2")
3 What is a dataframe?
In R, dataframe is a table format. Most of the time, you would import a large spreadsheet, or delimited text file into R as a dataframe. For example, we have imported the .gff3 file as a dataframe in Assignment 8 and 9.
Rerun this if you’re not in the same Project as Assignment 8. If you haven’t, click this link to download the compressed file (.gff3.gz). Then either put it in your working directory or upload it to posit.cloud, so that Rstudio has access to it.
This is an annotation file (.gff3) of Drosophila melanogaster (BDGP6.32).
= read.delim(gzfile("Drosophila_melanogaster.BDGP6.32.109.chr.gff3.gz"), header=F, comment.char="#", col.names = c("seqid", "source", "type", "start" , "end", "score", "strand", "phase", "attributes"))
gff head(gff)
seqid source type start end score strand phase
1 2L BDGP6.32 region 1 23513712 . . .
2 2L FlyBase pseudogene 7529 9484 . + .
3 2L FlyBase pseudogenic_transcript 7529 9484 . + .
4 2L FlyBase exon 7529 8116 . + .
5 2L FlyBase exon 8193 9484 . + .
6 2L FlyBase transposable_element_gene 9726 9859 . + .
attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2 ID=gene:FBgn0031208;Name=CR11023;biotype=pseudogene;gene_id=FBgn0031208;logic_name=refseq_import_visible
3 ID=transcript:FBtr0475186;Parent=gene:FBgn0031208;Name=CR11023-RE;biotype=pseudogene;tag=Ensembl_canonical;transcript_id=FBtr0475186
4 Parent=transcript:FBtr0475186;Name=FBtr0475186-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E1;rank=1
5 Parent=transcript:FBtr0475186;Name=FBtr0475186-E2;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E2;rank=2
6 ID=gene:FBti0060580;biotype=transposable_element;gene_id=FBti0060580;logic_name=refseq_import_visible
We can check the class of data by:
class(gff)
[1] "data.frame"
We can summarize a dataframe to get some ideas of what it is:
summary(gff)
seqid source type start
Length:503280 Length:503280 Length:503280 Min. : 1
Class :character Class :character Class :character 1st Qu.: 7222717
Mode :character Mode :character Mode :character Median :12812830
Mean :13210764
3rd Qu.:19174339
Max. :32078883
end score strand phase
Min. : 65 Length:503280 Length:503280 Length:503280
1st Qu.: 7223898 Class :character Class :character Class :character
Median :12814880 Mode :character Mode :character Mode :character
Mean :13212331
3rd Qu.:19174783
Max. :32079331
attributes
Length:503280
Class :character
Mode :character
We can check the column names of a dataframe:
names(gff)
[1] "seqid" "source" "type" "start" "end"
[6] "score" "strand" "phase" "attributes"
We can also check the row names, but typically a dataframe only has numbers as the row names
row.names(gff) %>% head()
[1] "1" "2" "3" "4" "5" "6"
# the %>% head() shows the first few results, since there's a lot of rows here. It's another way to write:
# head(row.names(gff))
For character columns, it’s useful to see the unique characters. For example, if we can to see what are the different type
in gff
:
unique(gff$type)
[1] "region" "pseudogene"
[3] "pseudogenic_transcript" "exon"
[5] "transposable_element_gene" "transposable_element"
[7] "gene" "mRNA"
[9] "three_prime_UTR" "CDS"
[11] "five_prime_UTR" "ncRNA_gene"
[13] "ncRNA" "snoRNA"
[15] "pre_miRNA" "miRNA"
[17] "snRNA" "tRNA"
[19] "rRNA"
We see that the annotation file include many types of genomic elements.
Try to write a code to get the unique value of strand
:
Code
unique(gff$strand)
[1] "." "+" "-"
4 Call rows and columns
Data frames are table, and you can extract data from each rows and columns.
We can do this based on the row/col numbers, which starts from 1 in R.
The convention is dataframe[row, col]
Get the first row:
1,] gff[
seqid source type start end score strand phase
1 2L BDGP6.32 region 1 23513712 . . .
attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
Get the first column
1]
gff[,# the results are too long to be show here
Using the same format, we can extract, for example, a cell in the third row and the third column
3,3] gff[
[1] "pseudogenic_transcript"
We can even extract a few rows/columns at the same time using the vector c().
For example, to get the first two rows in the third column:
c(1,2), 3] gff[
[1] "region" "pseudogene"
Write a code to extract the cells in the 10th row and the 4th and 5th column:
Code
10, c(4,5)] gff[
start end
10 9839 18570
In many case, you know the column names and it’s more logical to call columns by the column names, following the same convention but with “colname.
Get the 100-105th row of the column “type””
100:105, "type"] gff[
[1] "exon" "CDS" "exon" "CDS" "exon" "CDS"
# Not that 100:105 is the same as c(100, 101, 102, 103, 104, 105)
We have seen this before, but we can use the $
sign if we are specifying just one column.
$seqid %>% head() gff
[1] "2L" "2L" "2L" "2L" "2L" "2L"
5 Sort
There are many ways to sort, but it is easier to use functions in tidyr
For example to sort gff by start
in ascending order:
arrange(gff, start) %>% head()
seqid source type start end score strand phase
1 2L BDGP6.32 region 1 23513712 . . .
2 2R BDGP6.32 region 1 25286936 . . .
3 3L BDGP6.32 region 1 28110227 . . .
4 3R BDGP6.32 region 1 32079331 . . .
5 4 BDGP6.32 region 1 1348131 . . .
6 X BDGP6.32 region 1 23542271 . . .
attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2 ID=region:2R;Alias=AE013599,NT_033778,AE013599.5,NT_033778.4,chr2R
3 ID=region:3L;Alias=AE014296,NT_037436,AE014296.5,NT_037436.4,chr3L
4 ID=region:3R;Alias=AE014297,NT_033777,AE014297.3,NT_033777.3,chr3R
5 ID=region:4;Alias=AE014135,NC_004353,AE014135.4,NC_004353.4,chr4
6 ID=region:X;Alias=AE014298,NC_004354,AE014298.5,NC_004354.4,chrX
# for the descending order: arrange(gff, desc(start))
You can also sort by two columns by adding another argument in arrange()
.
To sort first by start
ascendingly, then by end
descendingly:
arrange(gff, start, desc(end)) %>% head()
seqid source type start end score strand phase
1 3R BDGP6.32 region 1 32079331 . . .
2 3L BDGP6.32 region 1 28110227 . . .
3 2R BDGP6.32 region 1 25286936 . . .
4 X BDGP6.32 region 1 23542271 . . .
5 2L BDGP6.32 region 1 23513712 . . .
6 Y BDGP6.32 region 1 3667352 . . .
attributes
1 ID=region:3R;Alias=AE014297,NT_033777,AE014297.3,NT_033777.3,chr3R
2 ID=region:3L;Alias=AE014296,NT_037436,AE014296.5,NT_037436.4,chr3L
3 ID=region:2R;Alias=AE013599,NT_033778,AE013599.5,NT_033778.4,chr2R
4 ID=region:X;Alias=AE014298,NC_004354,AE014298.5,NC_004354.4,chrX
5 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
6 ID=region:Y;Alias=CP007106,NC_024512,CP007106.1,NC_024512.1,chrY
6 Subset
Often time, you only want to keep some rows of a dataframe only when a column has certain value.
For example, in gff
, there are many type
, but when we do only want to the data where type
is gene
, we will be subsetting the dataframe:
filter(gff, type=="gene") %>% head()
seqid source type start end score strand phase
1 2L FlyBase gene 9839 21376 . - .
2 2L FlyBase gene 21823 25155 . - .
3 2L FlyBase gene 25402 65404 . - .
4 2L FlyBase gene 66482 71390 . + .
5 2L FlyBase gene 71757 76211 . + .
6 2L FlyBase gene 76348 77783 . + .
attributes
1 ID=gene:FBgn0002121;Name=l(2)gl;biotype=protein_coding;description=lethal (2) giant larvae;gene_id=FBgn0002121;logic_name=refseq_import_visible
2 ID=gene:FBgn0031209;Name=Ir21a;biotype=protein_coding;description=Ionotropic receptor 21a;gene_id=FBgn0031209;logic_name=refseq_import_visible
3 ID=gene:FBgn0051973;Name=Cda5;biotype=protein_coding;description=Chitin deacetylase-like 5;gene_id=FBgn0051973;logic_name=refseq_import_visible
4 ID=gene:FBgn0067779;Name=dbr;biotype=protein_coding;description=debra;gene_id=FBgn0067779;logic_name=refseq_import_visible
5 ID=gene:FBgn0031213;Name=galectin;biotype=protein_coding;description=galectin;gene_id=FBgn0031213;logic_name=refseq_import_visible
6 ID=gene:FBgn0031214;Name=CG11374;biotype=protein_coding;gene_id=FBgn0031214;logic_name=refseq_import_visible
# The code below also works.
# subset(gff, type=="")
Write a code to subset gff
when source
is BDGP6.32
Code
filter(gff, source=="BDGP6.32")
seqid source type start end score strand phase
1 2L BDGP6.32 region 1 23513712 . . .
2 2R BDGP6.32 region 1 25286936 . . .
3 3L BDGP6.32 region 1 28110227 . . .
4 3R BDGP6.32 region 1 32079331 . . .
5 4 BDGP6.32 region 1 1348131 . . .
6 X BDGP6.32 region 1 23542271 . . .
7 Y BDGP6.32 region 1 3667352 . . .
8 mitochondrion_genome BDGP6.32 region 1 19524 . . .
attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2 ID=region:2R;Alias=AE013599,NT_033778,AE013599.5,NT_033778.4,chr2R
3 ID=region:3L;Alias=AE014296,NT_037436,AE014296.5,NT_037436.4,chr3L
4 ID=region:3R;Alias=AE014297,NT_033777,AE014297.3,NT_033777.3,chr3R
5 ID=region:4;Alias=AE014135,NC_004353,AE014135.4,NC_004353.4,chr4
6 ID=region:X;Alias=AE014298,NC_004354,AE014298.5,NC_004354.4,chrX
7 ID=region:Y;Alias=CP007106,NC_024512,CP007106.1,NC_024512.1,chrY
8 ID=region:mitochondrion_genome;Alias=KJ947872,KJ947872.2,NC_024511.2,chrM;Is_circular=true
You can subset based on numeric column.
For example keep only when start <200 and type is mRNA
filter(gff, start<300, type=="mRNA")
seqid source type start end score strand phase
1 mitochondrion_genome FlyBase mRNA 240 1263 . + .
attributes
1 ID=transcript:FBtr0100857;Parent=gene:FBgn0013680;Name=mt:ND2-RA;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=FBtr0100857
The above example subset when both conditions apply.
You can also subset when either conditions apply:
filter(gff, start==7529 | end ==9484 )
seqid source type start end score strand phase
1 2L FlyBase pseudogene 7529 9484 . + .
2 2L FlyBase pseudogenic_transcript 7529 9484 . + .
3 2L FlyBase exon 7529 8116 . + .
4 2L FlyBase exon 8193 9484 . + .
attributes
1 ID=gene:FBgn0031208;Name=CR11023;biotype=pseudogene;gene_id=FBgn0031208;logic_name=refseq_import_visible
2 ID=transcript:FBtr0475186;Parent=gene:FBgn0031208;Name=CR11023-RE;biotype=pseudogene;tag=Ensembl_canonical;transcript_id=FBtr0475186
3 Parent=transcript:FBtr0475186;Name=FBtr0475186-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E1;rank=1
4 Parent=transcript:FBtr0475186;Name=FBtr0475186-E2;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E2;rank=2
7 New column
We can create new column based on existing column.
For example to get the length of all elements by the start and end:
mutate(gff, length = abs(end - start)+1) %>% head()
seqid source type start end score strand phase
1 2L BDGP6.32 region 1 23513712 . . .
2 2L FlyBase pseudogene 7529 9484 . + .
3 2L FlyBase pseudogenic_transcript 7529 9484 . + .
4 2L FlyBase exon 7529 8116 . + .
5 2L FlyBase exon 8193 9484 . + .
6 2L FlyBase transposable_element_gene 9726 9859 . + .
attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2 ID=gene:FBgn0031208;Name=CR11023;biotype=pseudogene;gene_id=FBgn0031208;logic_name=refseq_import_visible
3 ID=transcript:FBtr0475186;Parent=gene:FBgn0031208;Name=CR11023-RE;biotype=pseudogene;tag=Ensembl_canonical;transcript_id=FBtr0475186
4 Parent=transcript:FBtr0475186;Name=FBtr0475186-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E1;rank=1
5 Parent=transcript:FBtr0475186;Name=FBtr0475186-E2;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E2;rank=2
6 ID=gene:FBti0060580;biotype=transposable_element;gene_id=FBti0060580;logic_name=refseq_import_visible
length
1 23513712
2 1956
3 1956
4 588
5 1292
6 134
# Note that this function export a new dataframe.
# If you want to save this object you have to add
# gff =
# before the code.
8 Combine data frames
Sometimes, you may need to combine two sets of data. Each data may have columns that share some identifier.
We will make up some data to illustrate (and show how to make a dataframe):
# Let's say we have data from two classes of students and their grades
= data.frame(student = c( "Denzel", "Claire", "Jack","Quisha", "Carlos", "Julio", "Hiromi", "Jiang", "Isha" ), grade = sample(1:100, 9))
class1
= data.frame(student = c( "Denzel", "Jack","Quisha", "Carlos", "Julio"), grade = sample(1:100, 5)) class2
class1
student grade
1 Denzel 30
2 Claire 69
3 Jack 93
4 Quisha 3
5 Carlos 88
6 Julio 81
7 Hiromi 62
8 Jiang 87
9 Isha 70
class2
student grade
1 Denzel 35
2 Jack 93
3 Quisha 53
4 Carlos 9
5 Julio 56
We make want to get the average grade of students that are in both classes.
To do that, we can first combine the two dataframes:
= left_join(x = class2, y = class1, by = "student")
class1n2 # left_join keeps rows in x
# while right_join keeps rows in y
# so we can also do
# right_join(class1, class2, "student")
class1n2
student grade.x grade.y
1 Denzel 35 30
2 Jack 93 93
3 Quisha 53 3
4 Carlos 9 88
5 Julio 56 81
Then we can make a new column to calculate the average grade:
mutate(class1n2, avg_grade = (grade.x + grade.y)/2)
student grade.x grade.y avg_grade
1 Denzel 35 30 32.5
2 Jack 93 93 93.0
3 Quisha 53 3 28.0
4 Carlos 9 88 48.5
5 Julio 56 81 68.5
9 Assignment
To practice the skills above and last week, we will test whether the length of the 3’ and 5’ UTRs are different in D. melanogaster.
Write your codes with clear annotations using #
to do the following. Run your set of scripts in the R console and submit the codes and output as an assignment (you can copy and paste it over).
Your submission should look something like this:
# 1. Check col names
names(gff)
## [1] "seqid" "source" "type" "start" "end"
## [6] "score" "strand" "phase" "attributes"
# 2. Check unique values of type
unique(gff$type)
## [1] "region" "pseudogene"
## [3] "pseudogenic_transcript" "exon"
## [5] "transposable_element_gene" "transposable_element"
## [7] "gene" "mRNA"
## [9] "three_prime_UTR" "CDS"
## [11] "five_prime_UTR" "ncRNA_gene"
## [13] "ncRNA" "snoRNA"
## [15] "pre_miRNA" "miRNA"
## [17] "snRNA" "tRNA"
## [19] "rRNA"
# 3. do anova
summary(aov(start~type, gff))
## Df Sum Sq Mean Sq F value Pr(>F)
## type 18 1.588e+16 8.824e+14 15.29 <2e-16 ***
## Residuals 503261 2.905e+19 5.773e+13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Subset
gff
into a new object calledgff_utr
based on eithertype=="five_prime_UTR"
ortype=="three_prime_UTR"
# Check your data nrow(gff_utr)==80246 # This should return TRUE
Overwrite
gff_utr
to add a new column calledlength
based on the absolute value ofend - start
+ 1. Note that your command may display the mutated dataframe in the console, but not saving it asgff_utr
. You may need to add something likegff_utr =
in front of your command.
# Check your data
sum(gff_utr$length)==27656385
# This should return TRUE
Overwrite
gff_utr
to add another new column calledlog_length
based onlog10(length)
.# Now your dataframe gff_utr has two unique values in the column type. # You can check this by: unique(gff_utr$type)
[1] "three_prime_UTR" "five_prime_UTR"
# You should also have two columns called "length" and log_length" # You can check this by: names(gff_utr)
[1] "seqid" "source" "type" "start" "end" [6] "score" "strand" "phase" "attributes" "length" [11] "log_length"
# If everything are good, proceed to the next steps.
Do a
t.test()
to test whether thelog_length
is different among the two categories oftype
(which hasfive_prime_UTR
andfive_prime_UTR
). Write a summary statement after your # in your code.Plot
log_length
againsttype
using a boxplot. DO NOT overlay points with geom_point().