spark2 notes
Set SPARK_PRINT_LAUNCH_COMMAND environment variable to have the complete
Spark command printed out to the console, e.g.
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Ja...
Refer to Print Launch Command of Spark Scripts (or
org.apache.spark.launcher.Main Standalone Application where this environment
variable is actually used).
Avoid using scala.App trait for a Spark app
 docker run -v `pwd`:/data -e SPARK_PRINT_LAUNCH_COMMAND=1 -it heuermh/adam adam-shell

Avoid using scala.App trait for a Spark application’s main class in Scala as
reported in SPARK-4170 Closure problems when running Scala app that "extends
Refer to Executing Main — runMain internal method in this document.

Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
Steps to install Spark(1.6.2-bin-hadoop2.6)prebuild in local mode  on windows:
  1. Install Java 7 or later. To test java installation is complete, open command prompt type javaand hit enter. If you receive a message 'Java' is not recognized as an internal or external command. You need to configure your environment variables, JAVA_HOME and PATHto point to the path of jdk.
  2. Set SCALA_HOME in Control Panel\System and Security\System goto "Adv System settings" and add %SCALA_HOME%\bin in PATH variable in environment variables.
  3. Install Python 2.6 or later from Python Download link.
  4. Download SBT. Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.
  5. Download winutils.exe from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe and place it in a bindirectory under a created Hadoop home directory. Set HADOOP_HOME = <<Hadoop home directory>> in environment variable.and add it to path env
  6. We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
    Set SPARK_HOME and add %SPARK_HOME%\bin in PATH variable in environment variables.
  7. Run command: spark-shell
  8. Open http://localhost:4040/ in a browser to see the SparkContext web UI.
$ cat rdd1.txt
chr1    10016
chr1    10017
chr1    10018
chr1    20026
scala> val lines = sc.textFile("/data/rdd1.txt")

scala> case class Chrom(name: String, value: Long)
defined class Chrom

scala> val chroms ="\\s+")).map(r => Chrom(r(0), r(1).toLong))
chroms: org.apache.spark.rdd.RDD[Chrom] = MapPartitionsRDD[5] at map at <console>:28

scala> val df = chroms.toDF
16/10/28 16:17:42 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/10/28 16:17:43 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, value: bigint]

scala> df.filter('value > 30000).show


scala> case class Chrom2(name: String, value: Long, value: Long)
scala> val chroms2 ="\\s+")).map(r => Chrom2(r(0), r(1).toLong, r(2).toLong))

chroms2: org.apache.spark.rdd.RDD[Chrom2] = MapPartitionsRDD[35] at map at <console>:28

scala> val df2=chroms2.toDF

df2: org.apache.spark.sql.DataFrame = [name: string, min: bigint ... 1 more field]

scala> df.join(df2, Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()

$./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2

Your csv file does not have the same number of fields in each row - this cannot be parsed as is into a DataFrame

As of Spark 2.0.0, DataFrame - the flagship data abstraction of previous
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :

A Dataset is local if it was created from local collections using SparkSession.emptyDataset
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.

bigdatagenomics ADAM

A primary reason for using ADAM is that you have the ability to process your genomics data in parallel. With BAM / SAM / VCF files, their structures are not designed for parallelism. Because of this, even the Broad Institute's GATK tools as of V4 support ADAM Parquet.

docker run -it heuermh/adam adam-submit --version

docker run -v `pwd`:/data -it heuermh/adam adam-shell

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignmentRecords = sc.loadBam("/data/small.sam")

scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> val ac = new ADAMContext(sc)
#we have to define the fiels otherwise can't be accessed
scala>val projection = Projection(
scala>val adamFile = ac.loadAlignments(
      projection = Some(projection)      


全基因组de novo测序:又称从头测序,它不依赖于任何现有的序列资料,而直接对某个物种的基因组进行测序,然后利用生物信息学分析手段对序列进行拼接、组装,从而获得该物种的基因组序列图谱。

全基因组重测序:对已有参考序列(Reference Sequence)物种的不同个体进行基因组测序,并以此为基础进行个体或群体水平的遗传差异性分析。全基因组重测序能够发现大量的单核苷酸多态性位点(SNP)、拷贝数变异(Copy Number VariationCNV)、插入缺失(InDelInsertion/Deletion)、结构变异(Structure VariationSV)等变异类型,以准确快速的方法将单个参考基因组信息上升为群体遗传特征。


转录组测序:对某组织在某一功能状态下所能转录出来的所有RNA进行测序,获得特定状态下的该物种的几乎所有转录本序列信息。通常转录组测序是指对mRNA进行测序获得相关序列的过程。其根据所研究物种是否有参考基因组序列分为转录组de novo测序(无参考基因组序列)和转录组重测序(有参考基因组序列)。

scala> import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext

scala> val ac = new ADAMContext(sc)
ac: org.bdgenomics.adam.rdd.ADAMContext = org.bdgenomics.adam.rdd.ADAMContext@4f61fc2e

scala> ac.loadAlignments("/data/sample.rmdup.bam")
16/10/31 09:36:57 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:284
res1: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[1] at map at ADAMContext.scala:289


adam-submit --master yarn -- transform data.sam data.adam
adam-submit --master yarn -- transform -sort_reads data.adam data_sort.adam
adam-submit --master yarn -- transform -mark_duplicate_reads data_sort.adam data_sort_dup.adam
adam-submit --master yarn -- transform -realign_indels data_sort_dup.adam data_sort_dup_rea.adam




获得RNA-seq 的原始数据后,首先需要将所
间提出了很高的要求.针对诸如Illumina/Solexa 等
indexing)或Burrows-Wheeler 转换(Burrows-Wheeler

Alignment/Map)格式或其二进制版本BAM 格式[39]

Reading ADAM alignments with SQLContext.parquetFile
it typically means that you are reading and writing data with different versions of the Parquet libraries. Specifically, Spark SQL is trying to read the Parquet statistics and is failing to read them because the format of the statistics doesn’t match what it is expecting