Abstract
MicroRNAs are derived from endogenous stem-loop precursors, and play important roles in various biological processes. From next-generation sequencing data, it is suggested that both the 5p-arm and 3p-arm of mature miRNAs could be generated from a single miRNA hairpin precursor; however, the current miRNA databases fail to provide comprehensive arm annotation features, which could result in ambiguous and incomplete analyses. In the present report, we have annotated over 99.7% of miRNAs with the correct 5p-arm and 3p-arm features. The length distribution of all annotated miRNAs is around 22 nucleotides; however, the 5p-arm miRNAs seem to be longer than those of the 3p-arm, which is evident in the 23-nucleotide group. Our study effort generates comprehensive miRNA arm-feature annotation which can be utilized for better interrogation of miRNAs. In further analysis of human gastric cancer tissues, we identified 38 down-regulated miRNAs and 22 up-regulated arm-specific miRNAs using this new comprehensives miRNA list.
MicroRNAs (miRNAs) are endogenous small non-protein-coding RNAs of 21-23 nucleotides in length (1, 2). They were initially discovered in Caenorhabditis elegans, and thousands more have since been identified in many organisms, including humans, mammals, invertebrates, insects, plants and viruses (3-5). It is essential to study miRNAs in many biological systems due to their biological significance as critical modulators of gene expression. miRNAs play important roles in cellular physiology, development, and disease by negatively regulating their target genes (6). It is well-known that miRNAs regulate their target genes through targeting the 3-UTR region of the gene mainly by pairing of the seed regions of mature miRNAs (7). During the miRNA biogenesis process, they are transcribed from their parental gene loci, and the primary miRNA transcripts subsequently fold into hairpins (i.e. the signature secondary structure of miRNAs). These pri-miRNAs are then cleaved into precursor miRNAs by the Drosha enzyme before being exported to the cytoplasm through the exportin 5 pathway (8, 9). The precursor pre-miRNAs are then cleaved by Dicer to generate the mature miRNA duplex (at the stem region of the hairpin structure) (10, 11). In the final stage of miRNA maturation, one strand (i.e. the guide strand) of the mature miRNA duplex is preferentially selected for Argonaut binding to form the ultimate RNA-induced silencing complex (RISC), which targets the mRNA transcripts for modulation (12, 13). The other strand of the duplex, however, is often degraded quickly and present in low amounts inside cells (12). Therefore, these have been called either ‘star strand’ (miRNA*) or ‘passenger strand’; however, it is now recognized that they can still be incorporated into the functional RISC complex and possess miRNA functionality (14). With the increasing depth of Next Generation Sequencing (NGS) data, more and more experimental evidence is indicating that both strands of the mature miRNA duplex can be identified. Therefore, miRBase, the well-known repository for all reported miRNAs, has announced that it has stopped using the miR* annotation and has instead opted for the assignment of arm features miR-XXX-5p and miR-XXX-3p for mature miRNA sequences derived from the 5’ and 3’ arms of the mature miRNA hairpin duplex (15).
miRBase was first established in 2002 and is now the foremost database of miRNA sequences and annotations for researchers interrogating various aspects of miRNAs (16, 17). In the current release of the miRBase registry, there are more than 28,645 reported miRNA loci in 223 species (release 21, 2014). While many of the miRNAs have been annotated with arm features (5p or 3p arms), not all miRNAs in the miRBase annotation are assigned respective arm features. It is possible that some miRNAs may possess only one strand of mature miRNA sequence information; therefore, there is no urgent need to assign the correct arm feature for them. Nonetheless, we realized that both strands of the pre-miRNA hairpin duplex can be processed and selected as mature miRNAs with increasing depth of NGS sequencing information (18, 19). It would be beneficial to have the correct arm feature assigned for known miRNAs, and it is likely that insufficient arm feature annotations could lead to ambiguous miRNA data interpretations. For example, various literature reports and technical bulletins often use the miRNA nomenclatures according to their own preferences, such as hsa-miR-21, hsa-miR-21-5p or hsa-miR-21-3p. Thus, troublesome interpretations of miRNA expression array data and NGS data exist depending on the hsa-miR-21 nomenclature used by different platforms and literature. Without detailed arm feature assignment, some biologists might become confused between hsa-miR-21, hsa-miR-21-5p and hsa-miR-21-3p. This is more evident in the NGS data analysis pipelines, since both strands of the mature miRNA hairpin duplex can be observed from the sequence reads. Users could conceivably fail to obtain comprehensive miRNA expression profiles due to use of only current miRNA assignments from miRBase in their analysis pipeline, which would lack the arm feature information on some mature miRNAs and thus affect the subsequent annotation procedures in some bioinformatic pipelines.
In this report, we provide a comprehensive arm feature annotation of all known miRNAs and perform subsequent analysis on the 5p-arm and 3p-arm miRNA populations in human gastric cancer tissues.
Materials and Methods
miRBase datasets and arm feature assignment. We downloaded all known miRNA information from miRBase (miRNA.dat, hairpin.fa and mature.fa). In brief, miRBase release 20 contains 24,521 miRNA precursors and 30,424 mature miRNA sequences (15). We also classified them using the species prefix for subsequent analysis. There are 206 different species in this dataset. Firstly, we filtered the miRNAs with already annotated arm features and then assigned the arm features of all other mature miRNAs by mapping them back to the hairpin precursor sequences using the bowtie program (20). We adapted a read mapping and classification strategy similar to those used by Zhou et al. (19). The hairpin precursors were divided into 5p-arm strand regions (37.5% of the hairpin length), loop regions (25% of the hairpin length) and 3p-arm strand regions (37.5% of the hairpin length) from the 5’-end starting position. Assignment of the arm features was carried according to the mapped regions. Python scripts were developed to process all data and analysis results. For whole-genome sequences (human hg19), we downloaded the repeat-masked genomic sequences from the UCSC Genome Bioinformatics website.
Clinical samples and RNA extraction. Human gastric carcinoma tissues and paired normal adjacent tissues were obtained from patients who had undergone gastric resection at the Department of Surgery, Veterans General Hospital-Taipei, Taiwan (21-23). The study was approved by the Institutional Review Board (AS-IRB01-10123), and informed consent was obtained from all patients before surgery. Immediately after surgical resection, all tissues were snap-frozen in liquid nitrogen and stored at −80°C until use. In total, four pairs of normal gastric tissue and gastric tumor tissue were lysed with TissueLyser, followed by RNA extraction with TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's protocols as described previously (21-23). The RNA samples were then processed and sequenced by a local company using Illumina Hi-Seq 2000 platform (Yourgene Bioscience, New Taipei City, Taiwan).
NGS data analysis with comprehensive arm-feature-annotated miRNAs. We applied the Illumina (Solexa) platform for small RNA sequencing. The generated sequence reads were processed to remove the adapter sequences, if applicable. Only the clean reads (i.e. reads with adapters detected and trimmed) were used for analysis. Considering the length distribution of mature miRNAs, we selected only the clean reads with lengths of 18 to 25 nucleotides for analysis (18). The clean reads were then grouped into a unique sequence read group, followed by tabulating the count of each unique sequence clean read group. For higher confidence, only the unique reads with read counts of two or more were used for mapping back to human miRNA precursor and genome sequences. As previously reported, we used stringent criteria by trimming the last 3’ end mismatch one by one until the mapping perfect-match reads were at least 18 nucleotides in length (18). Following the mapping, we assigned the miRNA by using our arm feature-annotated mature miRNA sets, which contained 1,872 human miRNA precursor loci with 1,297 5p-arm miRNAs and 1,279 3p-arm mature miRNAs (the complete list is available from our website: http://tdl.ibms.sinica.edu.tw/miRNA_arm/index.html). Following the mapping phase, we then used the top three expression isomiRs from 5p-arm and 3p-arm miRNA regions to represent the expression level of 5p-arm miRNAs and 3p-arm miRNAs. We filtered out the lowly expressed miRNAs (Reads per Million (RPM) value less than one) and selected only miRNAs expressed in over 50% of the NGS libraries. ANOVA analysis was performed to identify differentially expressed miRNAs or arm-specific miRNAs using Partek Genomic Suite software (St. Louis, MO, USA).
Results
Comprehensive arm feature annotations. The annotation of miRNAs is not a simple or straightforward task. Continuous efforts are being made to improve the annotation of miRNAs (24), even in the miRBase; however, the arm features have not been fully-annotated in all currently reported miRNAs. With more and more miRNAs being identified through NGS sequencing platforms, it is now recognized that both strands of the mature miRNA hairpin duplex can be incorporated into the RISC complex. In this study, we aimed to interrogate and annotate the arm features of all known human miRNAs.
There were 24,521 miRNA loci in 206 species and 30,424 mature miRNAs recorded in the miRBase dataset (release 20). We then exacted the miRNA records with arm features from these miRNAs. There were only 15,398 miRNAs annotated with arm features in the current miRBase release (around 50%). Therefore, about half of the reported miRNAs were without correct arm feature assignment. We then mapped these unassigned mature miRNA sequences to their respective precursor miRNA sequences via the bowtie mapping program.
Our pipeline used the reported loci positions of 24,521 miRNA precursors to determine the arm features of all mature miRNAs through bowtie mapping results. Finally, we were able to assign 30,360 mature miRNAs arm features: 15,027 miRNAs from the 5p-arm and 15,333 miRNAs from the 3p-arm. The arm feature assignment rate was over 99.9% in these miRNAs. To date, we believe, this is the first comprehensive arm feature annotation of all known miRNAs. Among them, 11,014 miRNA precursor loci contain both the 5p-arm and 3p-arm mature miRNAs recorded in the miRBase dataset (44.9%). There are 64 miRNAs that could not be assigned to the 5p-arm or 3p-arm, since they were mapped to the loop region mostly. In humans, there are 2,578 mature miRNAs, and we assigned 1,297 miRNAs to the 5p-arm and 1,279 miRNAs to the 3p-arm (2 unassigned).
Length distribution of reported mature miRNAs. With our comprehensive arm feature annotation, we were able to further interrogate the length distribution of all reported miRNAs in greater detail, especially for the 5p-arm and 3p-arm groups. As shown in Figure 1A, the most dominant mature miRNA length is the 22-nucleotide group, followed by the 21-nucleotide and 23-nucleotide groups. We then compared the 5p-arm and 3p-arm miRNAs in the groups of different mature miRNA length. There were more 3p-arm miRNAs in the 21-nucleotide and 22-nucleotide groups, while there were more 5p-arm miRNAs in the 23-nucleotide group (Figure 1B). It seems that selection of 5-arm mature miRNA is preferred in the longer length groups (i.e. 23- and 24-nucleotide groups). Because plant mature miRNAs are usually 21 nucleotides long, we checked the length distribution of miRNAs in Viridiplantae genomes. In the 23-nucleotide group, we did not see the prevailing 5p-arm mature miRNA pattern similar to that found in the Metazoan genomes. This could imply a difference between plant and animal miRNA biogenesis.
We then examined the length distribution in human miRNAs. It was more striking that the 22-nucleotide group is the dominant population in human miRNAs (Figure 2A). The 5p-arm and 3p-arm distribution pattern is similar in individual species. There are more human 3p-arm miRNAs in the 21- and 22-nucleotide groups, as there are more human 5p-arm miRNAs in the 23-nucleotide group (Figure 2B). Such a difference between 5p-arm and 3p-arm miRNA distribution in 21-, 22- and 23-nucleotide groups of the human genome was statistically significant by the Chi-square test (Figure 3). Furthermore, the result of the one-sided Mann–Whitney test shows that lengths of 5p-arm miRNAs are not significantly shorter than those of 3p-arm miRNAs (p-value=5.809e-05).
Human and mouse orthologous miRNAs. With the arm feature annotated in almost all miRNAs, we were better able to address the similarity issues between orthologous miRNA families. Between the 2,578 human mature miRNAs and 1,908 mouse mature miRNAs, there were 525 miRNAs with the same miR-ID number. We considered them as orthologous miRNAs. According to the typical miRNA nomenclature rules, we expected these human and mouse orthologous miRNAs to have highly similar mature sequences. Indeed, we observed 271 miRNAs with identical mature miRNA sequences and 90 miRNAs with slight variations at the end of the mature sequences [possibly due to isomiR (25)]. Excluding those end variation isomiRs, there were an additional 90 miRNAs with one nucleotide variation in the mature miRNA sequences and 34 miRNAs with two nucleotide variations. To our surprise, there were a further 40 miRNAs with variations in more than two nucleotides – 11 of these miRNAs contained variations in more than five nucleotides (Table I). This might create some issues, since the target genes would be different between some of these highly diverse orthologous miRNAs.
Expression levels of human 5p-arm and 3p-arm miRNAs in cancer tissues. Besides the length and variations issues for 5p-arm and 3p-arm miRNAs discussed, we were interested in learning about the expression levels of 5p-arm and 3p-arm miRNAs generated from the same miRNA precursor loci. In order to interrogate the 5p-arm and 3p-arm miRNA expression levels in detail, we then analyzed the Solexa NGS data from four pairs of human gastric cancer tissues with abundant depth coverage for small RNAs (G1170N: 11,742,486 reads; G1170T: 11,729,328 reads; G1176N: 11,815,166 reads; G1176T: 11,844,941 reads; G1193N: 11,510,656 reads; G1193T: 11,114,691 reads; G1224: 11,025,333 reads; G1224: 11,036,771 reads). The Solexa reads were trimmed and filtered for adapter sequences as well as low-quality reads, according to our previous analysis pipeline (18). We then applied our newly established comprehensive arm feature-annotated miRNA list from this study in the bioinformatic pipeline to provide better annotation for the NGS expression results. We observed a total of 1,855 miRNA genes among four pairs of gastric cancer tissues. In order to investigate the expression levels of 5p-arm and 3p-arm miRNAs, we selected only miRNAs expressing both 5p-arm and 3p-arm miRNAs from all eight NGS libraries of small RNA. There were 191 miRNAs identified, and the 5p-arm/3p-arm expression ratio is illustrated in Figure 4. We found no significant difference between 5p-arm expression dominance and 3p-arm expression dominance miRNAs; however, it is significant to note that in most of the miRNA loci, there is a dramatic difference in expression level (tens or hundreds fold more) in the 5p-arm over 3p-arm or 3p-arm over 5p-arm (Figure 4). This observation reflects the fact that there is a significant difference in expression level between the guide strand miRNAs and passenger strand miRNAs (miRNA*) as researchers have perceived previously. These data illustrate the advantages of our arm-feature annotation efforts to provide clear and better miRNA analysis results for the NGS bioinformatic pipelines.
Dysregulated 5p-arm and 3p-arm miRNAs in gastric cancer tissues. We further examined the miRNAs significantly dysregulated in the gastric cancer tissues. We filtered out the poorly expressed miRNAs (RPM less than one) and selected only miRNAs expressed in over 50% of the NGS libraries. The subsequent normalization and ANOVA analysis were performed by the Partek software. We identified 38 down-regulated miRNAs and 22 up-regulated miRNAs (Table II). The most significantly up-regulated miRNAs were miR-21 and miR-196, which were previously studied in our laboratory as putative gastric cancer biomarkers (22, 26-28). Interestingly, some miRNAs were found to be significantly expressed in both the 5p-arm and 3p-arms, such as miR-136, miR-142, miR-150, and miR-93. The data presented herein demonstrate that it is beneficial to use the comprehensive 5p-arm- and 3p-arm-annotated miRNA list for in-depth NGS data analysis. In summary, one can generate better analysis results with well-defined 5p-arm and 3p-arm expression information using our pipeline demonstrated here.
Discussion
In the biogenesis of miRNAs, the formation of a suitable hairpin structure and subsequent cleavage steps are essential (1). Just recently, in addition to the secondary hairpin structure, it has been reported that certain primary sequence motifs are also required in hairpin recognition and processing, including the downstream SRp20-binding motif, the basal UG motif in the stem, and the apical stem GUG motif (29). This suggests that more detailed studies on miRNA biogenesis are still required learn more about the generation of mature miRNAs. In the final step, unwinding of the Dicer-processed duplex is necessary for the completion of the RISC complex. In previous understanding, only one of the two strands was selected for the Argonaut binding to form the RISC-silencing complex, and the other strand is discarded (13). There is no doubt that there is a selection preference in the determination of the mature miRNA guide strand. It has been reported that thermodynamic properties and properties of the base pairs at the 5’ ends of the duplex strand could be important in the selection process (12). However, further studies are required to elucidate the detailed molecular mechanisms.
Although our data imply that there is strong bias in the expression levels of the mature 5p-arm and 3p-arm miRNAs, with the advance of the NGS platform and accumulation of small RNA sequencing data, we believe that both strands of the miRNA duplex could be incorporated into the RISC complex ubiquitously or under biological regulation. This statement is supported by the new official miRNA nomenclature set forth in the miRBase 17 release, which retires the miR/miR* nomenclature and uses 5p-arm/3p-arm mature miRNAs due to growing evidence that implicates the expression of both strands. While more and more miRNAs are being annotated with their arm features in recent miRBase releases, we do note that only about 50% of the currently reported miRNAs are annotated with their respective arm features. Since the current miRBase annotation did not provide all reported miRNAs with complete 5p-arm or 3p-arm annotation, we believe our effort here provides beneficial information for the miRNA data analysis pipeline.
During the literature survey, it was noted that certain miRNAs were reported with the correct arm assignment annotations in some literature, but the arm annotation information was lacking in other reports. This generates confusion in the interpretation of data, since one would not know which arm was being referred to. The mature 5p-arm miRNA is totally different from that of the 3p-arm in terms of sequence and target spectrum, not to mention expression level. A good example here is the well-known oncomiR, miR-21. miR-21 is highly expressed in many human cancer tissues, including gastric cancer (30). In general, this is implicated to be miR-21-5p. The difference in expression level between miR-21-5p and miR-21-3p is around 80-fold (unpublished observation with TCGA gastric cancer datasets). Therefore, miR-21-3p is often neglected in analysis; however, recent reports now demonstrate that this minor miRNA does perform significant functions in cardiomyocyte hypertrophy and in the growth of hepatocellular carcinoma cells (31-33). Unfortunately, miR-21-3p has been totally overlooked in many studies, and only miR-21 (miR-21-5p) has been investigated in their analyses. This again provides support for the claim that comprehensive miRNA arm-feature annotation could be utilized for better miRNA NGS data interrogation. With the increasing depth of NGS data, this situation would be more significant since more miRNAs expressed at a low level on the opposite arm could be discovered. Therefore, comprehensive 5p-arm and 3p-arm assignment and annotation will be essential and beneficial for systematic interrogation in miRNA studies.
Acknowledgements
This work was supported by research grants from Academia Sinica and Ministry of Science and Technology, Taiwan, Republic of China.
Footnotes
-
↵* These Authors contributed equally to this study.
-
This article is freely accessible online.
- Received November 19, 2014.
- Revision received November 29, 2014.
- Accepted December 4, 2014.
- Copyright© 2015 International Institute of Anticancer Research (Dr. John G. Delinassios), All rights reserved