Thursday, October 27, 2016

spark2 notes

http://stackoverflow.com/questions/40796818/how-to-append-a-resource-jar-for-spark-submit
Set SPARK_PRINT_LAUNCH_COMMAND environment variable to have the complete
Spark command printed out to the console, e.g.
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Ja...
Refer to Print Launch Command of Spark Scripts (or
org.apache.spark.launcher.Main Standalone Application where this environment
variable is actually used).
Tip
Avoid using scala.App trait for a Spark app
 docker run -v `pwd`:/data -e SPARK_PRINT_LAUNCH_COMMAND=1 -it heuermh/adam adam-shell

Avoid using scala.App trait for a Spark application’s main class in Scala as
reported in SPARK-4170 Closure problems when running Scala app that "extends
App".
Refer to Executing Main — runMain internal method in this document.

Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
Steps to install Spark(1.6.2-bin-hadoop2.6)prebuild in local mode  on windows:
  1. Install Java 7 or later. To test java installation is complete, open command prompt type javaand hit enter. If you receive a message 'Java' is not recognized as an internal or external command. You need to configure your environment variables, JAVA_HOME and PATHto point to the path of jdk.
  2. Set SCALA_HOME in Control Panel\System and Security\System goto "Adv System settings" and add %SCALA_HOME%\bin in PATH variable in environment variables.
  3. Install Python 2.6 or later from Python Download link.
  4. Download SBT. Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.
  5. Download winutils.exe from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe and place it in a bindirectory under a created Hadoop home directory. Set HADOOP_HOME = <<Hadoop home directory>> in environment variable.and add it to path env
  6. We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
    Set SPARK_HOME and add %SPARK_HOME%\bin in PATH variable in environment variables.
  7. Run command: spark-shell
  8. Open http://localhost:4040/ in a browser to see the SparkContext web UI.
$ cat rdd1.txt
chr1    10016
chr1    10017
chr1    10018
chr1    20026
scala> val lines = sc.textFile("/data/rdd1.txt")

scala> case class Chrom(name: String, value: Long)
defined class Chrom

scala> val chroms = lines.map(_.split("\\s+")).map(r => Chrom(r(0), r(1).toLong))
chroms: org.apache.spark.rdd.RDD[Chrom] = MapPartitionsRDD[5] at map at <console>:28

scala> val df = chroms.toDF
16/10/28 16:17:42 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/10/28 16:17:43 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, value: bigint]

scala> df.show
+----+-----+
|name|value|
+----+-----+
|chr1|10016|
|chr1|10017|
|chr1|10018|
|chr1|20026|
|chr1|20036|
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|
+----+-----+
scala> df.filter('value > 30000).show
+----+-----+
|name|value|
+----+-----+
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|

+----+-----+

scala> case class Chrom2(name: String, value: Long, value: Long)
scala> val chroms2 = rdd2.map(_.split("\\s+")).map(r => Chrom2(r(0), r(1).toLong, r(2).toLong))

chroms2: org.apache.spark.rdd.RDD[Chrom2] = MapPartitionsRDD[35] at map at <console>:28

scala> val df2=chroms2.toDF

df2: org.apache.spark.sql.DataFrame = [name: string, min: bigint ... 1 more field]

scala> df.join(df2, Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()




$./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2
.0



Your csv file does not have the same number of fields in each row - this cannot be parsed as is into a DataFrame


As of Spark 2.0.0, DataFrame - the flagship data abstraction of previous
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :


A Dataset is local if it was created from local collections using SparkSession.emptyDataset
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.

Wednesday, October 12, 2016

bigdatagenomics ADAM

A primary reason for using ADAM is that you have the ability to process your genomics data in parallel. With BAM / SAM / VCF files, their structures are not designed for parallelism. Because of this, even the Broad Institute's GATK tools as of V4 support ADAM Parquet.




docker run -it heuermh/adam adam-submit --version

docker run -v `pwd`:/data -it heuermh/adam adam-shell

spark2.0
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignmentRecords = sc.loadBam("/data/small.sam")






spark1.6
scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> val ac = new ADAMContext(sc)
#we have to define the fiels otherwise can't be accessed
scala>val projection = Projection(
      AlignmentRecordField.readMapped,
      AlignmentRecordField.mateMapped,
      AlignmentRecordField.readPaired     
    )
scala>val adamFile = ac.loadAlignments(
      "/data/sample.rmdup.adam",
      projection = Some(projection)      
    )
scala>adamFile.map(p=>p.getReadMapped).take(1).foreach(println)



contig就是重叠群的意思。就是基因组分析测序中的一个概念。 
把含有STS序列标签位点的基因片段分别测序后,重叠分析就可以得到完整的染色体基因组序列。分析中的用到的一个概念就是重叠群

全基因组de novo测序:又称从头测序,它不依赖于任何现有的序列资料,而直接对某个物种的基因组进行测序,然后利用生物信息学分析手段对序列进行拼接、组装,从而获得该物种的基因组序列图谱。

全基因组重测序:对已有参考序列(Reference Sequence)物种的不同个体进行基因组测序,并以此为基础进行个体或群体水平的遗传差异性分析。全基因组重测序能够发现大量的单核苷酸多态性位点(SNP)、拷贝数变异(Copy Number VariationCNV)、插入缺失(InDelInsertion/Deletion)、结构变异(Structure VariationSV)等变异类型,以准确快速的方法将单个参考基因组信息上升为群体遗传特征。

转录组Transcriptome,是指特定生长阶段某组织或细胞内所有转录产物的集合;狭义上指所有mRNA的集合。


转录组测序:对某组织在某一功能状态下所能转录出来的所有RNA进行测序,获得特定状态下的该物种的几乎所有转录本序列信息。通常转录组测序是指对mRNA进行测序获得相关序列的过程。其根据所研究物种是否有参考基因组序列分为转录组de novo测序(无参考基因组序列)和转录组重测序(有参考基因组序列)。






scala> import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext

scala> val ac = new ADAMContext(sc)
ac: org.bdgenomics.adam.rdd.ADAMContext = org.bdgenomics.adam.rdd.ADAMContext@4f61fc2e

scala> ac.loadAlignments("/data/sample.rmdup.bam")
16/10/31 09:36:57 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:284
res1: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[1] at map at ADAMContext.scala:289

scala>







adam-submit --master yarn -- transform data.sam data.adam
adam-submit --master yarn -- transform -sort_reads data.adam data_sort.adam
adam-submit --master yarn -- transform -mark_duplicate_reads data_sort.adam data_sort_dup.adam
adam-submit --master yarn -- transform -realign_indels data_sort_dup.adam data_sort_dup_rea.adam


基因的表达是通过DNA控制蛋白质的合成来实现的,包括转录和翻译两个过程


一般地说,色盲这种遗传病是由男性通过他的女儿遗传给他的外甥的(交叉遗传)

读段定位

获得RNA-seq 的原始数据后,首先需要将所
有测序读段通过序列映射(mapping)定位到参考基
因组上,这是所有后续处理和分析的基础.在读段
定位之前,有时还需要根据测序数据情况对其做某
些基本的预处理.例如,过滤掉测序质量较差的读
段、对miRNA测序读段数据去除接头序列等.
高通量测序的海量数据对计算机算法的运行时
间提出了很高的要求.针对诸如Illumina/Solexa 等
测序平台得到的读段一般较短、且插入删除错误较少等特点,人们开发了一些短序列定位算法.这
些算法主要采用空位种子索引法(spaced-seed
indexing)或Burrows-Wheeler 转换(Burrows-Wheeler
Transform,BWT)技术来实现



读段定位到基因组后通常采用SAM(Sequence
Alignment/Map)格式或其二进制版本BAM 格式[39]
来存储.二进制版本可大大节省存储空间,但不能
直接用普通文本编辑工具显示


Wu Fei @Fei-Guang Oct 14 10:25
hello,
is there a tool of bcl to adam like bcl2fastq?
do i have to convert fastq to adam?

Andy Petrella @andypetrella Oct 14 11:32
@heuermh awesome, poke us on our Gitter if needed, we'll be happy helping you. We may publish new contents meanwhile that ill point you too

Michael L Heuer @heuermh Oct 14 23:06
@Fei-Guang No, we don't have support for BCL format in ADAM (pull requests accepted, of course). From here it looks like you could either use the Illumina tools to convert to FASTQ or Picard to convert to SAM, and then read either format into ADAM.


https://www.biostars.org/p/217346/#217366



Reading ADAM alignments with SQLContext.parquetFile
it typically means that you are reading and writing data with different versions of the Parquet libraries. Specifically, Spark SQL is trying to read the Parquet statistics and is failing to read them because the format of the statistics doesn’t match what it is expecting