Thursday, October 27, 2016

spark2 notes

http://stackoverflow.com/questions/40796818/how-to-append-a-resource-jar-for-spark-submit
Set SPARK_PRINT_LAUNCH_COMMAND environment variable to have the complete
Spark command printed out to the console, e.g.
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Ja...
Refer to Print Launch Command of Spark Scripts (or
org.apache.spark.launcher.Main Standalone Application where this environment
variable is actually used).
Tip
Avoid using scala.App trait for a Spark app
 docker run -v `pwd`:/data -e SPARK_PRINT_LAUNCH_COMMAND=1 -it heuermh/adam adam-shell

Avoid using scala.App trait for a Spark application’s main class in Scala as
reported in SPARK-4170 Closure problems when running Scala app that "extends
App".
Refer to Executing Main — runMain internal method in this document.

Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
Steps to install Spark(1.6.2-bin-hadoop2.6)prebuild in local mode  on windows:
  1. Install Java 7 or later. To test java installation is complete, open command prompt type javaand hit enter. If you receive a message 'Java' is not recognized as an internal or external command. You need to configure your environment variables, JAVA_HOME and PATHto point to the path of jdk.
  2. Set SCALA_HOME in Control Panel\System and Security\System goto "Adv System settings" and add %SCALA_HOME%\bin in PATH variable in environment variables.
  3. Install Python 2.6 or later from Python Download link.
  4. Download SBT. Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.
  5. Download winutils.exe from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe and place it in a bindirectory under a created Hadoop home directory. Set HADOOP_HOME = <<Hadoop home directory>> in environment variable.and add it to path env
  6. We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
    Set SPARK_HOME and add %SPARK_HOME%\bin in PATH variable in environment variables.
  7. Run command: spark-shell
  8. Open http://localhost:4040/ in a browser to see the SparkContext web UI.
$ cat rdd1.txt
chr1    10016
chr1    10017
chr1    10018
chr1    20026
scala> val lines = sc.textFile("/data/rdd1.txt")

scala> case class Chrom(name: String, value: Long)
defined class Chrom

scala> val chroms = lines.map(_.split("\\s+")).map(r => Chrom(r(0), r(1).toLong))
chroms: org.apache.spark.rdd.RDD[Chrom] = MapPartitionsRDD[5] at map at <console>:28

scala> val df = chroms.toDF
16/10/28 16:17:42 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/10/28 16:17:43 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, value: bigint]

scala> df.show
+----+-----+
|name|value|
+----+-----+
|chr1|10016|
|chr1|10017|
|chr1|10018|
|chr1|20026|
|chr1|20036|
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|
+----+-----+
scala> df.filter('value > 30000).show
+----+-----+
|name|value|
+----+-----+
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|

+----+-----+

scala> case class Chrom2(name: String, value: Long, value: Long)
scala> val chroms2 = rdd2.map(_.split("\\s+")).map(r => Chrom2(r(0), r(1).toLong, r(2).toLong))

chroms2: org.apache.spark.rdd.RDD[Chrom2] = MapPartitionsRDD[35] at map at <console>:28

scala> val df2=chroms2.toDF

df2: org.apache.spark.sql.DataFrame = [name: string, min: bigint ... 1 more field]

scala> df.join(df2, Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()




$./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2
.0



Your csv file does not have the same number of fields in each row - this cannot be parsed as is into a DataFrame


As of Spark 2.0.0, DataFrame - the flagship data abstraction of previous
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :


A Dataset is local if it was created from local collections using SparkSession.emptyDataset
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.

No comments:

Post a Comment