Genomics Big data
Monday, November 14, 2016
Thursday, October 27, 2016
spark2 notes
http://stackoverflow.com/questions/40796818/how-to-append-a-resource-jar-for-spark-submit
Set SPARK_PRINT_LAUNCH_COMMAND environment variable to have the complete
Spark command printed out to the console, e.g.
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Ja...
Refer to Print Launch Command of Spark Scripts (or
org.apache.spark.launcher.Main Standalone Application where this environment
variable is actually used).
Tip
Avoid using scala.App trait for a Spark app
docker run -v `pwd`:/data -e SPARK_PRINT_LAUNCH_COMMAND=1 -it heuermh/adam adam-shell
Avoid using scala.App trait for a Spark application’s main class in Scala as
reported in SPARK-4170 Closure problems when running Scala app that "extends
App".
Refer to Executing Main — runMain internal method in this document.
reported in SPARK-4170 Closure problems when running Scala app that "extends
App".
Refer to Executing Main — runMain internal method in this document.
Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
Steps to install Spark(1.6.2-bin-hadoop2.6)prebuild in local mode on windows:
-
Install Java 7 or later. To test java installation is complete, open command prompt type
java
and hit enter. If you receive a message 'Java' is not recognized as an internal or external command.
You need to configure your environment variables, JAVA_HOME
and PATH
to point to the path of jdk.
-
Set
SCALA_HOME
in Control Panel\System and Security\System
goto "Adv System settings" and add %SCALA_HOME%\bin
in PATH variable in environment variables.
-
Install Python 2.6 or later from Python Download link.
- Download SBT. Install it and set
SBT_HOME
as an environment variable with value as <<SBT PATH>>
.
- Download
winutils.exe
from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe
and place it in a bin
directory under a created Hadoop
home directory. Set HADOOP_HOME = <<Hadoop home directory>>
in environment variable.and add it to path env
-
We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
Set
SPARK_HOME
and add %SPARK_HOME%\bin
in PATH variable in environment variables.
-
Run command:
spark-shell
-
Open
http://localhost:4040/
in a browser to see the SparkContext web UI.
$ cat rdd1.txt
chr1 10016
chr1 10017
chr1 10018
chr1 20026
scala> val lines = sc.textFile("/data/rdd1.txt")scala> case class Chrom(name: String, value: Long)
defined class Chrom
scala> val chroms = lines.map(_.split("\\s+")).map(r => Chrom(r(0), r(1).toLong))
chroms: org.apache.spark.rdd.RDD[Chrom] = MapPartitionsRDD[5] at map at <console>:28
scala> val df = chroms.toDF
16/10/28 16:17:42 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/10/28 16:17:43 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, value: bigint]
scala> df.show
+----+-----+
|name|value|
+----+-----+
|chr1|10016|
|chr1|10017|
|chr1|10018|
|chr1|20026|
|chr1|20036|
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|
+----+-----+
scala> df.filter('value > 30000).show
+----+-----+
|name|value|
+----+-----+
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|
+----+-----+
scala> case class Chrom2(name: String, value: Long, value: Long)
scala> val chroms2 = rdd2.map(_.split("\\s+")).map(r => Chrom2(r(0), r(1).toLong, r(2).toLong))
chroms2: org.apache.spark.rdd.RDD[Chrom2] = MapPartitionsRDD[35] at map at <console>:28
scala> val df2=chroms2.toDF
df2: org.apache.spark.sql.DataFrame = [name: string, min: bigint ... 1 more field]
scala> df.join(df2, Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()
$./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2
.0
Your csv file does not have the same number of fields in each row - this cannot be parsed as is into a DataFrame
As of Spark 2.0.0, DataFrame - the flagship data abstraction of previous
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :
A Dataset is local if it was created from local collections using SparkSession.emptyDataset
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.
Wednesday, October 12, 2016
bigdatagenomics ADAM
A primary reason for using ADAM is that you have the ability to process your genomics data in parallel. With BAM / SAM / VCF files, their structures are not designed for parallelism. Because of this, even the Broad Institute's GATK tools as of V4 support ADAM Parquet.
docker run -it heuermh/adam adam-submit --version
docker run -v `pwd`:/data -it heuermh/adam adam-shell
spark2.0
scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._
scala> val alignmentRecords = sc.loadBam("/data/small.sam")
spark1.6
scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> val ac = new ADAMContext(sc)
#we have to define the fiels otherwise can't be accessed
scala>val projection = Projection(
AlignmentRecordField.readMapped,
AlignmentRecordField.mateMapped,
AlignmentRecordField.readPaired
)
scala>val adamFile = ac.loadAlignments(
"/data/sample.rmdup.adam",
projection = Some(projection)
)
scala>adamFile.map(p=>p.getReadMapped).take(1).foreach(println)
contig就是重叠群的意思。就是基因组分析测序中的一个概念。
把含有STS序列标签位点的基因片段分别测序后,重叠分析就可以得到完整的染色体基因组序列。分析中的用到的一个概念就是重叠群
全基因组de novo测序:又称从头测序,它不依赖于任何现有的序列资料,而直接对某个物种的基因组进行测序,然后利用生物信息学分析手段对序列进行拼接、组装,从而获得该物种的基因组序列图谱。
全基因组重测序:对已有参考序列(Reference Sequence)物种的不同个体进行基因组测序,并以此为基础进行个体或群体水平的遗传差异性分析。全基因组重测序能够发现大量的单核苷酸多态性位点(SNP)、拷贝数变异(Copy Number Variation,CNV)、插入缺失(InDel,Insertion/Deletion)、结构变异(Structure Variation,SV)等变异类型,以准确快速的方法将单个参考基因组信息上升为群体遗传特征。
转录组:Transcriptome,是指特定生长阶段某组织或细胞内所有转录产物的集合;狭义上指所有mRNA的集合。
转录组测序:对某组织在某一功能状态下所能转录出来的所有RNA进行测序,获得特定状态下的该物种的几乎所有转录本序列信息。通常转录组测序是指对mRNA进行测序获得相关序列的过程。其根据所研究物种是否有参考基因组序列分为转录组de novo测序(无参考基因组序列)和转录组重测序(有参考基因组序列)。
scala> import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext
scala> val ac = new ADAMContext(sc)
ac: org.bdgenomics.adam.rdd.ADAMContext = org.bdgenomics.adam.rdd.ADAMContext@4f61fc2e
scala> ac.loadAlignments("/data/sample.rmdup.bam")
16/10/31 09:36:57 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:284
res1: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[1] at map at ADAMContext.scala:289
scala>
adam-submit --master yarn -- transform data.sam data.adam
adam-submit --master yarn -- transform -sort_reads data.adam data_sort.adam
adam-submit --master yarn -- transform -mark_duplicate_reads data_sort.adam data_sort_dup.adam
adam-submit --master yarn -- transform -realign_indels data_sort_dup.adam data_sort_dup_rea.adam
A primary reason for using ADAM is that you have the ability to process your genomics data in parallel. With BAM / SAM / VCF files, their structures are not designed for parallelism. Because of this, even the Broad Institute's GATK tools as of V4 support ADAM Parquet.
docker run -it heuermh/adam adam-submit --version
docker run -it heuermh/adam adam-submit --version
docker run -v `pwd`:/data -it heuermh/adam adam-shell
spark2.0
import org.bdgenomics.adam.rdd.ADAMContext._
scala> val alignmentRecords = sc.loadBam("/data/small.sam")
spark1.6
scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> val ac = new ADAMContext(sc)
#we have to define the fiels otherwise can't be accessed
scala>val projection = Projection(
AlignmentRecordField.readMapped,
AlignmentRecordField.mateMapped,
AlignmentRecordField.readPaired
)
scala>val adamFile = ac.loadAlignments(
"/data/sample.rmdup.adam",
projection = Some(projection)
)
scala>adamFile.map(p=>p.getReadMapped).take(1).foreach(println)
contig就是重叠群的意思。就是基因组分析测序中的一个概念。
把含有STS序列标签位点的基因片段分别测序后,重叠分析就可以得到完整的染色体基因组序列。分析中的用到的一个概念就是重叠群
把含有STS序列标签位点的基因片段分别测序后,重叠分析就可以得到完整的染色体基因组序列。分析中的用到的一个概念就是重叠群
全基因组de novo测序:又称从头测序,它不依赖于任何现有的序列资料,而直接对某个物种的基因组进行测序,然后利用生物信息学分析手段对序列进行拼接、组装,从而获得该物种的基因组序列图谱。
全基因组重测序:对已有参考序列(Reference Sequence)物种的不同个体进行基因组测序,并以此为基础进行个体或群体水平的遗传差异性分析。全基因组重测序能够发现大量的单核苷酸多态性位点(SNP)、拷贝数变异(Copy Number Variation,CNV)、插入缺失(InDel,Insertion/Deletion)、结构变异(Structure Variation,SV)等变异类型,以准确快速的方法将单个参考基因组信息上升为群体遗传特征。
转录组:Transcriptome,是指特定生长阶段某组织或细胞内所有转录产物的集合;狭义上指所有mRNA的集合。
转录组测序:对某组织在某一功能状态下所能转录出来的所有RNA进行测序,获得特定状态下的该物种的几乎所有转录本序列信息。通常转录组测序是指对mRNA进行测序获得相关序列的过程。其根据所研究物种是否有参考基因组序列分为转录组de novo测序(无参考基因组序列)和转录组重测序(有参考基因组序列)。
scala> import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.rdd.ADAMContext
scala> val ac = new ADAMContext(sc)
ac: org.bdgenomics.adam.rdd.ADAMContext = org.bdgenomics.adam.rdd.ADAMContext@4f61fc2e
scala> ac.loadAlignments("/data/sample.rmdup.bam")
16/10/31 09:36:57 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:284
res1: org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.AlignmentRecord] = MapPartitionsRDD[1] at map at ADAMContext.scala:289
scala>
adam-submit --master yarn -- transform data.sam data.adam
基因的表达是通过DNA控制蛋白质的合成来实现的,包括转录和翻译两个过程
一般地说,色盲这种遗传病是由男性通过他的女儿遗传给他的外甥的(交叉遗传)
读段定位
获得RNA-seq 的原始数据后,首先需要将所有测序读段通过序列映射(mapping)定位到参考基
因组上,这是所有后续处理和分析的基础.在读段
定位之前,有时还需要根据测序数据情况对其做某
些基本的预处理.例如,过滤掉测序质量较差的读
段、对miRNA测序读段数据去除接头序列等.
高通量测序的海量数据对计算机算法的运行时
间提出了很高的要求.针对诸如Illumina/Solexa 等
测序平台得到的读段一般较短、且插入删除错误较少等特点,人们开发了一些短序列定位算法.这
些算法主要采用空位种子索引法(spaced-seed
indexing)或Burrows-Wheeler 转换(Burrows-Wheeler
Transform,BWT)技术来实现
读段定位到基因组后通常采用SAM(Sequence
Alignment/Map)格式或其二进制版本BAM 格式[39]
来存储.二进制版本可大大节省存储空间,但不能
直接用普通文本编辑工具显示
Wu Fei @Fei-Guang Oct 14 10:25
hello,
is there a tool of bcl to adam like bcl2fastq?
do i have to convert fastq to adam?
Andy Petrella @andypetrella Oct 14 11:32
@heuermh awesome, poke us on our Gitter if needed, we'll be happy helping you. We may publish new contents meanwhile that ill point you too
Michael L Heuer @heuermh Oct 14 23:06
@Fei-Guang No, we don't have support for BCL format in ADAM (pull requests accepted, of course). From here it looks like you could either use the Illumina tools to convert to FASTQ or Picard to convert to SAM, and then read either format into ADAM.
https://www.biostars.org/p/217346/#217366
Reading ADAM alignments with SQLContext.parquetFile
it typically means that you are reading and writing data with different versions of the Parquet libraries. Specifically, Spark SQL is trying to read the Parquet statistics and is failing to read them because the format of the statistics doesn’t match what it is expecting
|
Saturday, September 24, 2016
docker registry using oss as backend
Dockerfile:
FROM registry:2.3
ENV REGISTRY_STORAGE=oss
ENV REGISTRY_STORAGE_OSS_ACCESSKEYID=
ENV REGISTRY_STORAGE_OSS_ACCESSKEYSECRET=
ENV REGISTRY_STORAGE_OSS_REGION=oss-cn-beijing
ENV REGISTRY_STORAGE_OSS_BUCKET=docker-registry-ano
ENV REGISTRY_STORAGE_OSS_ENDPOINT=vpc100-oss-cn-beijing.aliyuncs.com
ENV REGISTRY_STORAGE_OSS_INTERNAL=true
ENV REGISTRY_STORAGE_OSS_SECURE=false
$docker build -t registry-oss-ano:0.1 .
#Not using localhost ensured that ipv4 was used
$docker -D run -p 127.0.0.1:80:5000 registry-oss-ano:0.1
REGISTRY_STORAGE_OSS_INTERNAL=true(connect to internal oss timeout)
FROM registry:2.3
ENV REGISTRY_STORAGE=oss
ENV REGISTRY_STORAGE_OSS_ACCESSKEYID=
ENV REGISTRY_STORAGE_OSS_ACCESSKEYSECRET=
ENV REGISTRY_STORAGE_OSS_REGION=oss-cn-beijing
ENV REGISTRY_STORAGE_OSS_BUCKET=docker-registry-ano
ENV REGISTRY_STORAGE_OSS_ENDPOINT=vpc100-oss-cn-beijing.aliyuncs.com
ENV REGISTRY_STORAGE_OSS_INTERNAL=true
ENV REGISTRY_STORAGE_OSS_SECURE=false
$docker build -t registry-oss-ano:0.1 .
#Not using localhost ensured that ipv4 was used
$docker -D run -p 127.0.0.1:80:5000 registry-oss-ano:0.1
REGISTRY_STORAGE_OSS_INTERNAL=true(connect to internal oss timeout)
panic: Get http://docker-registry-anno.oss-cn-beijing-internal.aliyuncs.com/?delimiter=&marker=&max-keys=1&prefix=: dial tcp 10.157.164.6:80: getsockopt: connection timed out
2016-09-24 16:49:18
Sunday, September 18, 2016
Swarm basic
#Set up a discovery backend
docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap
docker run -d -p 5000:4000 swarm manage -H :5000 --replication --advertise 192.168.13.159:5000 consul://192.168.13.159:8500
docker run -d -p 6000:4000 swarm manage -H :6000 --replication --advertise 192.168.13.159:6000 consul://172.30.0.161:8500
#add a node
docker run -d swarm --debug join --advertise=192.168.13.159:2375 consul://192.168.13.159:8500
# use the regular docker cli
$ docker -H tcp://<swarm_ip:swarm_port> info
$ docker -H tcp://<swarm_ip:swarm_port> run ...
$ docker -H tcp://<swarm_ip:swarm_port> ps
$ docker -H tcp://<swarm_ip:swarm_port> logs ...
$ docker -H tcp://<swarm_ip:swarm_port> network ls
...
# list nodes in your cluster
$ docker run --rm swarm list token://<cluster_id>
docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap
#Create Swarm cluster
#debug
docker run -d -p 4000:4000 swarm --debug manage -H :4000 --replication --advertise 192.168.13.159:4000 consul://192.168.13.159:8500docker run -d -p 5000:4000 swarm manage -H :5000 --replication --advertise 192.168.13.159:5000 consul://192.168.13.159:8500
docker run -d -p 6000:4000 swarm manage -H :6000 --replication --advertise 192.168.13.159:6000 consul://172.30.0.161:8500
#add a node
docker run -d swarm --debug join --advertise=192.168.13.159:2375 consul://192.168.13.159:8500
# use the regular docker cli
$ docker -H tcp://<swarm_ip:swarm_port> info
$ docker -H tcp://<swarm_ip:swarm_port> run ...
$ docker -H tcp://<swarm_ip:swarm_port> ps
$ docker -H tcp://<swarm_ip:swarm_port> logs ...
$ docker -H tcp://<swarm_ip:swarm_port> network ls
...
# list nodes in your cluster
$ docker run --rm swarm list token://<cluster_id>
Thursday, September 8, 2016
dropwizard in docker
java -jar target/dropwiz.jar server application-config.yml
FROM java:8
MAINTAINER Victor Martinez
RUN mkdir -p /dw/project/src/app
WORKDIR /dw/project/src/app
EXPOSE 8080 8081
COPY . /dw/project/src/app
dropwizard_container:
image: dropwizard_image
command: java -jar target/dropwizard-1.0.0-SNAPSHOT.jar server src/main/resources/config.yml
ports:
- 8080:8080
- 8081:8081
Tuesday, September 6, 2016
golang
Go does not have classes. However, you can define methods on types.
go f(x,y,z)
The evaluation of f, x, y, and z happens in the current goroutine(just evaluation of f,x,y,z)
and the execution of f happens in the new goroutine(execution f).
- $GOPATH 默认采用和
$GOROOT
一样的值,但从 Go 1.1 版本开始,你必须修改为其它路径。它可以包含多个包含 Go 语言源码文件、包文件和可执行文件的路径,而这些路径下又必须分别包含三个规定的目录:src
、pkg
和bin
,这三个目录分别用于存放源码文件、包文件和可执行文件。
how to debug gofabric8
clone fabric8 to $GOPATH/src/github.com/fabric8io/gofabric8
$git clone git@github.com:fabric8io/gofabric8.git $GOPATH/src/github.com/fabric8io/gofabric8
$make
$go get github.com/derekparker/delve/cmd/dlv
$export PATH=$GOPATH/bin:$PATH
$dlv debug
(dlv) break main.main
command-line arguments for run/debug
liteide: build config ->TARGETARGShttps://github.com/visualfc/liteide/issues/67
Subscribe to:
Posts (Atom)