R dataframe

Author

Solomon Chak

Published

March 6, 2023

1 Aim

This is an introduction for Intro. Bioinformatics student to “play” with dataframes in R.

You will learn to do the following:

extract data from rows and columns in a dataframe
sort a dataframe
subset rows of a dataframe based on values in columns
create new columns based on existing columns
combine dataframes

2 Install & load packages

We will need to install two packages. Once this is done, you won’t need to install again.

install.packages("dplyr")
install.packages("tidyr")

But you will need to load the library everytime:


library("dplyr")
library("tidyr")
library("ggplot2")

3 What is a dataframe?

In R, dataframe is a table format. Most of the time, you would import a large spreadsheet, or delimited text file into R as a dataframe. For example, we have imported the .gff3 file as a dataframe in Assignment 8 and 9.

Rerun this if you’re not in the same Project as Assignment 8. If you haven’t, click this link to download the compressed file (.gff3.gz). Then either put it in your working directory or upload it to posit.cloud, so that Rstudio has access to it.

This is an annotation file (.gff3) of Drosophila melanogaster (BDGP6.32).

gff = read.delim(gzfile("Drosophila_melanogaster.BDGP6.32.109.chr.gff3.gz"), header=F, comment.char="#", col.names = c("seqid", "source", "type", "start" , "end",  "score", "strand",  "phase", "attributes"))
head(gff)

  seqid   source                      type start      end score strand phase
1    2L BDGP6.32                    region     1 23513712     .      .     .
2    2L  FlyBase                pseudogene  7529     9484     .      +     .
3    2L  FlyBase    pseudogenic_transcript  7529     9484     .      +     .
4    2L  FlyBase                      exon  7529     8116     .      +     .
5    2L  FlyBase                      exon  8193     9484     .      +     .
6    2L  FlyBase transposable_element_gene  9726     9859     .      +     .
                                                                                                                            attributes
1                                                                   ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2                             ID=gene:FBgn0031208;Name=CR11023;biotype=pseudogene;gene_id=FBgn0031208;logic_name=refseq_import_visible
3 ID=transcript:FBtr0475186;Parent=gene:FBgn0031208;Name=CR11023-RE;biotype=pseudogene;tag=Ensembl_canonical;transcript_id=FBtr0475186
4 Parent=transcript:FBtr0475186;Name=FBtr0475186-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E1;rank=1
5 Parent=transcript:FBtr0475186;Name=FBtr0475186-E2;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E2;rank=2
6                                ID=gene:FBti0060580;biotype=transposable_element;gene_id=FBti0060580;logic_name=refseq_import_visible

We can check the class of data by:

class(gff)

[1] "data.frame"

We can summarize a dataframe to get some ideas of what it is:

summary(gff)

    seqid              source              type               start         
 Length:503280      Length:503280      Length:503280      Min.   :       1  
 Class :character   Class :character   Class :character   1st Qu.: 7222717  
 Mode  :character   Mode  :character   Mode  :character   Median :12812830  
                                                          Mean   :13210764  
                                                          3rd Qu.:19174339  
                                                          Max.   :32078883  
      end              score              strand             phase          
 Min.   :      65   Length:503280      Length:503280      Length:503280     
 1st Qu.: 7223898   Class :character   Class :character   Class :character  
 Median :12814880   Mode  :character   Mode  :character   Mode  :character  
 Mean   :13212331                                                           
 3rd Qu.:19174783                                                           
 Max.   :32079331                                                           
  attributes       
 Length:503280     
 Class :character  
 Mode  :character

We can check the column names of a dataframe:

names(gff)

[1] "seqid"      "source"     "type"       "start"      "end"       
[6] "score"      "strand"     "phase"      "attributes"

We can also check the row names, but typically a dataframe only has numbers as the row names

row.names(gff) %>% head()

[1] "1" "2" "3" "4" "5" "6"

# the %>% head() shows the first few results, since there's a lot of rows here. It's another way to write:
# head(row.names(gff))

For character columns, it’s useful to see the unique characters. For example, if we can to see what are the different type in gff:

unique(gff$type)

 [1] "region"                    "pseudogene"               
 [3] "pseudogenic_transcript"    "exon"                     
 [5] "transposable_element_gene" "transposable_element"     
 [7] "gene"                      "mRNA"                     
 [9] "three_prime_UTR"           "CDS"                      
[11] "five_prime_UTR"            "ncRNA_gene"               
[13] "ncRNA"                     "snoRNA"                   
[15] "pre_miRNA"                 "miRNA"                    
[17] "snRNA"                     "tRNA"                     
[19] "rRNA"

We see that the annotation file include many types of genomic elements.

Try to write a code to get the unique value of strand:

Code

unique(gff$strand)

[1] "." "+" "-"

4 Call rows and columns

Data frames are table, and you can extract data from each rows and columns.

We can do this based on the row/col numbers, which starts from 1 in R.

The convention is dataframe[row, col]

Get the first row:

gff[1,]

  seqid   source   type start      end score strand phase
1    2L BDGP6.32 region     1 23513712     .      .     .
                                                          attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L

Get the first column

gff[,1]
# the results are too long to be show here

Using the same format, we can extract, for example, a cell in the third row and the third column

gff[3,3]

[1] "pseudogenic_transcript"

We can even extract a few rows/columns at the same time using the vector c().

For example, to get the first two rows in the third column:

gff[c(1,2), 3]

[1] "region"     "pseudogene"

Write a code to extract the cells in the 10th row and the 4th and 5th column:

Code

gff[10, c(4,5)]

   start   end
10  9839 18570

In many case, you know the column names and it’s more logical to call columns by the column names, following the same convention but with “colname.

Get the 100-105th row of the column “type””

gff[100:105, "type"]

[1] "exon" "CDS"  "exon" "CDS"  "exon" "CDS"

# Not that 100:105 is the same as c(100, 101, 102, 103, 104, 105)

We have seen this before, but we can use the $ sign if we are specifying just one column.

gff$seqid  %>% head()

[1] "2L" "2L" "2L" "2L" "2L" "2L"

5 Sort

There are many ways to sort, but it is easier to use functions in tidyr

For example to sort gff by start in ascending order:

arrange(gff, start)  %>% head()

  seqid   source   type start      end score strand phase
1    2L BDGP6.32 region     1 23513712     .      .     .
2    2R BDGP6.32 region     1 25286936     .      .     .
3    3L BDGP6.32 region     1 28110227     .      .     .
4    3R BDGP6.32 region     1 32079331     .      .     .
5     4 BDGP6.32 region     1  1348131     .      .     .
6     X BDGP6.32 region     1 23542271     .      .     .
                                                          attributes
1 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2 ID=region:2R;Alias=AE013599,NT_033778,AE013599.5,NT_033778.4,chr2R
3 ID=region:3L;Alias=AE014296,NT_037436,AE014296.5,NT_037436.4,chr3L
4 ID=region:3R;Alias=AE014297,NT_033777,AE014297.3,NT_033777.3,chr3R
5   ID=region:4;Alias=AE014135,NC_004353,AE014135.4,NC_004353.4,chr4
6   ID=region:X;Alias=AE014298,NC_004354,AE014298.5,NC_004354.4,chrX

# for the descending order: arrange(gff, desc(start))

You can also sort by two columns by adding another argument in arrange().

To sort first by start ascendingly, then by end descendingly:

arrange(gff, start, desc(end))  %>% head()

  seqid   source   type start      end score strand phase
1    3R BDGP6.32 region     1 32079331     .      .     .
2    3L BDGP6.32 region     1 28110227     .      .     .
3    2R BDGP6.32 region     1 25286936     .      .     .
4     X BDGP6.32 region     1 23542271     .      .     .
5    2L BDGP6.32 region     1 23513712     .      .     .
6     Y BDGP6.32 region     1  3667352     .      .     .
                                                          attributes
1 ID=region:3R;Alias=AE014297,NT_033777,AE014297.3,NT_033777.3,chr3R
2 ID=region:3L;Alias=AE014296,NT_037436,AE014296.5,NT_037436.4,chr3L
3 ID=region:2R;Alias=AE013599,NT_033778,AE013599.5,NT_033778.4,chr2R
4   ID=region:X;Alias=AE014298,NC_004354,AE014298.5,NC_004354.4,chrX
5 ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
6   ID=region:Y;Alias=CP007106,NC_024512,CP007106.1,NC_024512.1,chrY

6 Subset

Often time, you only want to keep some rows of a dataframe only when a column has certain value.

For example, in gff , there are many type, but when we do only want to the data where type is gene, we will be subsetting the dataframe:

filter(gff, type=="gene") %>% head()

  seqid  source type start   end score strand phase
1    2L FlyBase gene  9839 21376     .      -     .
2    2L FlyBase gene 21823 25155     .      -     .
3    2L FlyBase gene 25402 65404     .      -     .
4    2L FlyBase gene 66482 71390     .      +     .
5    2L FlyBase gene 71757 76211     .      +     .
6    2L FlyBase gene 76348 77783     .      +     .
                                                                                                                                       attributes
1 ID=gene:FBgn0002121;Name=l(2)gl;biotype=protein_coding;description=lethal (2) giant larvae;gene_id=FBgn0002121;logic_name=refseq_import_visible
2  ID=gene:FBgn0031209;Name=Ir21a;biotype=protein_coding;description=Ionotropic receptor 21a;gene_id=FBgn0031209;logic_name=refseq_import_visible
3 ID=gene:FBgn0051973;Name=Cda5;biotype=protein_coding;description=Chitin deacetylase-like 5;gene_id=FBgn0051973;logic_name=refseq_import_visible
4                      ID=gene:FBgn0067779;Name=dbr;biotype=protein_coding;description=debra;gene_id=FBgn0067779;logic_name=refseq_import_visible
5              ID=gene:FBgn0031213;Name=galectin;biotype=protein_coding;description=galectin;gene_id=FBgn0031213;logic_name=refseq_import_visible
6                                    ID=gene:FBgn0031214;Name=CG11374;biotype=protein_coding;gene_id=FBgn0031214;logic_name=refseq_import_visible

# The code below also works.
# subset(gff, type=="")

Write a code to subset gff when source is BDGP6.32

Code

filter(gff, source=="BDGP6.32")

                 seqid   source   type start      end score strand phase
1                   2L BDGP6.32 region     1 23513712     .      .     .
2                   2R BDGP6.32 region     1 25286936     .      .     .
3                   3L BDGP6.32 region     1 28110227     .      .     .
4                   3R BDGP6.32 region     1 32079331     .      .     .
5                    4 BDGP6.32 region     1  1348131     .      .     .
6                    X BDGP6.32 region     1 23542271     .      .     .
7                    Y BDGP6.32 region     1  3667352     .      .     .
8 mitochondrion_genome BDGP6.32 region     1    19524     .      .     .
                                                                                  attributes
1                         ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2                         ID=region:2R;Alias=AE013599,NT_033778,AE013599.5,NT_033778.4,chr2R
3                         ID=region:3L;Alias=AE014296,NT_037436,AE014296.5,NT_037436.4,chr3L
4                         ID=region:3R;Alias=AE014297,NT_033777,AE014297.3,NT_033777.3,chr3R
5                           ID=region:4;Alias=AE014135,NC_004353,AE014135.4,NC_004353.4,chr4
6                           ID=region:X;Alias=AE014298,NC_004354,AE014298.5,NC_004354.4,chrX
7                           ID=region:Y;Alias=CP007106,NC_024512,CP007106.1,NC_024512.1,chrY
8 ID=region:mitochondrion_genome;Alias=KJ947872,KJ947872.2,NC_024511.2,chrM;Is_circular=true

You can subset based on numeric column.

For example keep only when start <200 and type is mRNA

filter(gff, start<300, type=="mRNA")

                 seqid  source type start  end score strand phase
1 mitochondrion_genome FlyBase mRNA   240 1263     .      +     .
                                                                                                                               attributes
1 ID=transcript:FBtr0100857;Parent=gene:FBgn0013680;Name=mt:ND2-RA;biotype=protein_coding;tag=Ensembl_canonical;transcript_id=FBtr0100857

The above example subset when both conditions apply.

You can also subset when either conditions apply:

filter(gff, start==7529 | end ==9484 )

  seqid  source                   type start  end score strand phase
1    2L FlyBase             pseudogene  7529 9484     .      +     .
2    2L FlyBase pseudogenic_transcript  7529 9484     .      +     .
3    2L FlyBase                   exon  7529 8116     .      +     .
4    2L FlyBase                   exon  8193 9484     .      +     .
                                                                                                                            attributes
1                             ID=gene:FBgn0031208;Name=CR11023;biotype=pseudogene;gene_id=FBgn0031208;logic_name=refseq_import_visible
2 ID=transcript:FBtr0475186;Parent=gene:FBgn0031208;Name=CR11023-RE;biotype=pseudogene;tag=Ensembl_canonical;transcript_id=FBtr0475186
3 Parent=transcript:FBtr0475186;Name=FBtr0475186-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E1;rank=1
4 Parent=transcript:FBtr0475186;Name=FBtr0475186-E2;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E2;rank=2

7 New column

We can create new column based on existing column.

For example to get the length of all elements by the start and end:

mutate(gff, length = abs(end - start)+1) %>% head()

  seqid   source                      type start      end score strand phase
1    2L BDGP6.32                    region     1 23513712     .      .     .
2    2L  FlyBase                pseudogene  7529     9484     .      +     .
3    2L  FlyBase    pseudogenic_transcript  7529     9484     .      +     .
4    2L  FlyBase                      exon  7529     8116     .      +     .
5    2L  FlyBase                      exon  8193     9484     .      +     .
6    2L  FlyBase transposable_element_gene  9726     9859     .      +     .
                                                                                                                            attributes
1                                                                   ID=region:2L;Alias=AE014134,NT_033779,AE014134.6,NT_033779.5,chr2L
2                             ID=gene:FBgn0031208;Name=CR11023;biotype=pseudogene;gene_id=FBgn0031208;logic_name=refseq_import_visible
3 ID=transcript:FBtr0475186;Parent=gene:FBgn0031208;Name=CR11023-RE;biotype=pseudogene;tag=Ensembl_canonical;transcript_id=FBtr0475186
4 Parent=transcript:FBtr0475186;Name=FBtr0475186-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E1;rank=1
5 Parent=transcript:FBtr0475186;Name=FBtr0475186-E2;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=FBtr0475186-E2;rank=2
6                                ID=gene:FBti0060580;biotype=transposable_element;gene_id=FBti0060580;logic_name=refseq_import_visible
    length
1 23513712
2     1956
3     1956
4      588
5     1292
6      134

# Note that this function export a new dataframe.
# If you want to save this object you have to add 
# gff = 
# before the code.

8 Combine data frames

Sometimes, you may need to combine two sets of data. Each data may have columns that share some identifier.

We will make up some data to illustrate (and show how to make a dataframe):

# Let's say we have data from two classes of students and their grades
class1 = data.frame(student = c( "Denzel", "Claire", "Jack","Quisha", "Carlos", "Julio",  "Hiromi", "Jiang",  "Isha" ), grade = sample(1:100, 9))

class2 = data.frame(student = c( "Denzel", "Jack","Quisha", "Carlos", "Julio"), grade = sample(1:100, 5))

class1

  student grade
1  Denzel    30
2  Claire    69
3    Jack    93
4  Quisha     3
5  Carlos    88
6   Julio    81
7  Hiromi    62
8   Jiang    87
9    Isha    70

class2

  student grade
1  Denzel    35
2    Jack    93
3  Quisha    53
4  Carlos     9
5   Julio    56

We make want to get the average grade of students that are in both classes.

To do that, we can first combine the two dataframes:

class1n2 = left_join(x = class2, y = class1, by = "student")
# left_join keeps rows in x
# while right_join keeps rows in y
# so we can also do
# right_join(class1, class2, "student")

class1n2

  student grade.x grade.y
1  Denzel      35      30
2    Jack      93      93
3  Quisha      53       3
4  Carlos       9      88
5   Julio      56      81

Then we can make a new column to calculate the average grade:

mutate(class1n2, avg_grade = (grade.x + grade.y)/2)

  student grade.x grade.y avg_grade
1  Denzel      35      30      32.5
2    Jack      93      93      93.0
3  Quisha      53       3      28.0
4  Carlos       9      88      48.5
5   Julio      56      81      68.5

9 Assignment

To practice the skills above and last week, we will test whether the length of the 3’ and 5’ UTRs are different in D. melanogaster.

Write your codes with clear annotations using # to do the following. Run your set of scripts in the R console and submit the codes and output as an assignment (you can copy and paste it over).

Your submission should look something like this:


# 1. Check col names
names(gff)
## [1] "seqid"      "source"     "type"       "start"      "end"       
## [6] "score"      "strand"     "phase"      "attributes"

# 2. Check unique values of type
unique(gff$type)
##  [1] "region"                    "pseudogene"               
##  [3] "pseudogenic_transcript"    "exon"                     
##  [5] "transposable_element_gene" "transposable_element"     
##  [7] "gene"                      "mRNA"                     
##  [9] "three_prime_UTR"           "CDS"                      
## [11] "five_prime_UTR"            "ncRNA_gene"               
## [13] "ncRNA"                     "snoRNA"                   
## [15] "pre_miRNA"                 "miRNA"                    
## [17] "snRNA"                     "tRNA"                     
## [19] "rRNA"

# 3. do anova
summary(aov(start~type, gff))
##                 Df    Sum Sq   Mean Sq F value Pr(>F)    
## type            18 1.588e+16 8.824e+14   15.29 <2e-16 ***
## Residuals   503261 2.905e+19 5.773e+13                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Subset gff into a new object called gff_utr based on either type=="five_prime_UTR" or type=="three_prime_UTR"
```
# Check your data
nrow(gff_utr)==80246
# This should return TRUE
```
Overwrite gff_utr to add a new column called length based on the absolute value of end - start + 1. Note that your command may display the mutated dataframe in the console, but not saving it as gff_utr. You may need to add something like gff_utr =in front of your command.

# Check your data
sum(gff_utr$length)==27656385
# This should return TRUE

Overwrite gff_utr to add another new column called log_length based on log10(length).

# Now your dataframe gff_utr has two unique values in the column type. # You can check this by: 
unique(gff_utr$type)

[1] "three_prime_UTR" "five_prime_UTR"

# You should also have two columns called "length" and log_length"
# You can check this by:
names(gff_utr)

 [1] "seqid"      "source"     "type"       "start"      "end"       
 [6] "score"      "strand"     "phase"      "attributes" "length"    
[11] "log_length"

# If everything are good, proceed to the next steps.

Do a t.test() to test whether the log_length is different among the two categories of type (which has five_prime_UTR and five_prime_UTR). Write a summary statement after your # in your code.
Plot log_length against type using a boxplot. DO NOT overlay points with geom_point().