Genomics Big data: bigdatagenomics ADAM

A primary reason for using ADAM is that you have the ability to process your genomics data in parallel. With BAM / SAM / VCF files, their structures are not designed for parallelism. Because of this, even the Broad Institute's GATK tools as of V4 support ADAM Parquet.

docker run -it heuermh/adam adam-submit --version

docker run -v `pwd`:/data -it heuermh/adam adam-shell

spark2.0
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignmentRecords = sc.loadBam("/data/small.sam")

spark1.6
scala> import org.bdgenomics.adam.rdd.ADAMContext

scala> val ac = new ADAMContext(sc)

#we have to define the fiels otherwise can't be accessed

scala>val projection = Projection(

AlignmentRecordField.readMapped,

AlignmentRecordField.mateMapped,

AlignmentRecordField.readPaired

)

scala>val adamFile = ac.loadAlignments(

"/data/sample.rmdup.adam",

projection = Some(projection)

)

scala>adamFile.map(p=>p.getReadMapped).take(1).foreach(println)

contig就是重叠群的意思。就是基因组分析测序中的一个概念。
把含有STS序列标签位点的基因片段分别测序后，重叠分析就可以得到完整的染色体基因组序列。分析中的用到的一个概念就是重叠群

**全基因组de novo测序：又称从头测序，它不依赖于任何现有的序列资料，而直接对某个物种的基因组进行测序，然后利用生物信息学分析手段对序列进行拼接、组装，从而获得该物种的基因组序列图谱。

全基因组重测序：对已有参考序列（Reference Sequence）物种的不同个体进行基因组测序，并以此为基础进行个体或群体水平的遗传差异性分析。全基因组重测序能够发现大量的单核苷酸多态性位点（SNP）、拷贝数变异（Copy Number Variation，CNV）、插入缺失（InDel，Insertion/Deletion）、结构变异（Structure Variation，SV）等变异类型，以准确快速的方法将单个参考基因组信息上升为群体遗传特征。

转录组：Transcriptome，是指特定生长阶段某组织或细胞内所有转录产物的集合；狭义上指所有mRNA的集合。

转录组测序**：对某组织在某一功能状态下所能转录出来的所有RNA进行测序，获得特定状态下的该物种的几乎所有转录本序列信息。通常转录组测序是指对mRNA进行测序获得相关序列的过程。其根据所研究物种是否有参考基因组序列分为转录组de novo测序（无参考基因组序列）和转录组重测序（有参考基因组序列）。

scala> import org.bdgenomics.adam.rdd.ADAMContext

import org.bdgenomics.adam.rdd.ADAMContext

scala> val ac = new ADAMContext(sc)

ac: org.bdgenomics.adam.rdd.ADAMContext = org.bdgenomics.adam.rdd.ADAMContext@4f61fc2e

scala> ac.loadAlignments("/data/sample.rmdup.bam")

16/10/31 09:36:57 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:284

res1: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[1] at map at ADAMContext.scala:289

scala>

adam-submit --master yarn -- transform data.sam data.adam

adam-submit --master yarn -- transform -sort_reads data.adam data_sort.adam

adam-submit --master yarn -- transform -mark_duplicate_reads data_sort.adam data_sort_dup.adam

adam-submit --master yarn -- transform -realign_indels data_sort_dup.adam data_sort_dup_rea.adam

基因的表达是通过DNA控制蛋白质的合成来实现的，包括转录和翻译两个过程

一般地说，色盲这种遗传病是由男性通过他的女儿遗传给他的外甥的(交叉遗传)

读段定位

获得RNA-seq 的原始数据后，首先需要将所
有测序读段通过序列映射(mapping)定位到参考基
因组上，这是所有后续处理和分析的基础．在读段
定位之前，有时还需要根据测序数据情况对其做某
些基本的预处理．例如，过滤掉测序质量较差的读
段、对miRNA测序读段数据去除接头序列等．
高通量测序的海量数据对计算机算法的运行时
间提出了很高的要求．针对诸如Illumina/Solexa 等
测序平台得到的读段一般较短、且插入删除错误较少等特点，人们开发了一些短序列定位算法．这
些算法主要采用空位种子索引法(spaced-seed
indexing)或Burrows-Wheeler 转换(Burrows-Wheeler
Transform，BWT)技术来实现

读段定位到基因组后通常采用SAM(Sequence
Alignment/Map)格式或其二进制版本BAM 格式[39]
来存储．二进制版本可大大节省存储空间，但不能
直接用普通文本编辑工具显示

Wu Fei @Fei-Guang Oct 14 10:25
hello,
is there a tool of bcl to adam like bcl2fastq?
do i have to convert fastq to adam?

Andy Petrella @andypetrella Oct 14 11:32
@heuermh awesome, poke us on our Gitter if needed, we'll be happy helping you. We may publish new contents meanwhile that ill point you too

Michael L Heuer @heuermh Oct 14 23:06
@Fei-Guang No, we don't have support for BCL format in ADAM (pull requests accepted, of course). From here it looks like you could either use the Illumina tools to convert to FASTQ or Picard to convert to SAM, and then read either format into ADAM.

https://www.biostars.org/p/217346/#217366

ADAM Software Development ›

Reading ADAM alignments with SQLContext.parquetFile it typically means that you are reading and writing data with different versions of the Parquet libraries. Specifically, Spark SQL is trying to read the Parquet statistics and is failing to read them because the format of the statistics doesn’t match what it is expecting

Genomics Big data

Wednesday, October 12, 2016

bigdatagenomics ADAM

基因的表达是通过DNA控制蛋白质的合成来实现的，包括转录和翻译两个过程

读段定位

No comments:

Post a Comment