http://stackoverflow.com/questions/40796818/how-to-append-a-resource-jar-for-spark-submit
Set SPARK_PRINT_LAUNCH_COMMAND environment variable to have the complete
Spark command printed out to the console, e.g.
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Ja...
Refer to Print Launch Command of Spark Scripts (or
org.apache.spark.launcher.Main Standalone Application where this environment
variable is actually used).
Tip
Avoid using scala.App trait for a Spark app
docker run -v `pwd`:/data -e SPARK_PRINT_LAUNCH_COMMAND=1 -it heuermh/adam adam-shell
Avoid using scala.App trait for a Spark application’s main class in Scala as
reported in SPARK-4170 Closure problems when running Scala app that "extends
App".
Refer to Executing Main — runMain internal method in this document.
reported in SPARK-4170 Closure problems when running Scala app that "extends
App".
Refer to Executing Main — runMain internal method in this document.
Make sure to use the same version of Scala as the one used to build your distribution of Spark. Pre-built distributions of Spark 1.x use Scala 2.10, while pre-built distributions of Spark 2.0.x use Scala 2.11.
10
Steps to install Spark(1.6.2-bin-hadoop2.6)prebuild in local mode on windows:
-
Install Java 7 or later. To test java installation is complete, open command prompt type
java
and hit enter. If you receive a message 'Java' is not recognized as an internal or external command.
You need to configure your environment variables, JAVA_HOME
and PATH
to point to the path of jdk.
-
Set
SCALA_HOME
in Control Panel\System and Security\System
goto "Adv System settings" and add %SCALA_HOME%\bin
in PATH variable in environment variables.
-
Install Python 2.6 or later from Python Download link.
- Download SBT. Install it and set
SBT_HOME
as an environment variable with value as <<SBT PATH>>
.
- Download
winutils.exe
from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe
and place it in a bin
directory under a created Hadoop
home directory. Set HADOOP_HOME = <<Hadoop home directory>>
in environment variable.and add it to path env
-
We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
Set
SPARK_HOME
and add %SPARK_HOME%\bin
in PATH variable in environment variables.
-
Run command:
spark-shell
-
Open
http://localhost:4040/
in a browser to see the SparkContext web UI.
$ cat rdd1.txt
chr1 10016
chr1 10017
chr1 10018
chr1 20026
scala> val lines = sc.textFile("/data/rdd1.txt")scala> case class Chrom(name: String, value: Long)
defined class Chrom
scala> val chroms = lines.map(_.split("\\s+")).map(r => Chrom(r(0), r(1).toLong))
chroms: org.apache.spark.rdd.RDD[Chrom] = MapPartitionsRDD[5] at map at <console>:28
scala> val df = chroms.toDF
16/10/28 16:17:42 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/10/28 16:17:43 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [name: string, value: bigint]
scala> df.show
+----+-----+
|name|value|
+----+-----+
|chr1|10016|
|chr1|10017|
|chr1|10018|
|chr1|20026|
|chr1|20036|
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|
+----+-----+
scala> df.filter('value > 30000).show
+----+-----+
|name|value|
+----+-----+
|chr1|30016|
|chr1|30026|
|chr2|40016|
|chr2|40116|
|chr2|50016|
|chr3|70016|
+----+-----+
scala> case class Chrom2(name: String, value: Long, value: Long)
scala> val chroms2 = rdd2.map(_.split("\\s+")).map(r => Chrom2(r(0), r(1).toLong, r(2).toLong))
chroms2: org.apache.spark.rdd.RDD[Chrom2] = MapPartitionsRDD[35] at map at <console>:28
scala> val df2=chroms2.toDF
df2: org.apache.spark.sql.DataFrame = [name: string, min: bigint ... 1 more field]
scala> df.join(df2, Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()
$./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.2
.0
Your csv file does not have the same number of fields in each row - this cannot be parsed as is into a DataFrame
As of Spark 2.0.0, DataFrame - the flagship data abstraction of previous
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :
versions of Spark SQL - is currently a mere type alias for Dataset[Row] :
A Dataset is local if it was created from local collections using SparkSession.emptyDataset
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.
or SparkSession.createDataset methods and their derivatives like toDF. If so, the queries on
the Dataset can be optimized and run locally, i.e. without using Spark executors.