Spark

From air
Jump to navigation Jump to search

http://spark.apache.org/

Apache Spark™ is a fast and general engine for large-scale data processing. Write applications quickly in Java, Scala, Python, R. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

Liens

Livres

code https://github.com/sryza/aas

code https://github.com/databricks/learning-spark

code https://github.com/mahmoudparsian/data-algorithms-book

Installation (depuis un Mac)

wget http://wwwftp.ciril.fr/pub/apache/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
tar xvf spark-1.6.0-bin-hadoop2.6.tgz
cd spark-1.6.0-bin-hadoop2.6
more README.md


Remarque: ./bin/spark-shell et ./bin/pyspark démarrent une console web Spark UI http://localhost:4040/jobs/

Programmation interactive en Scala

./bin/spark-shell

Browse the Spark UI http://localhost:4040/jobs/

Spark UI
sc.parallelize(1 to 10000000).count()
val NUM_SAMPLES = 10000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)

Programmation interactive en Python

./bin/pyspark
>>> sc.parallelize(range(1000)).count()

>>> exit()


./bin/run-example SparkPi

./bin/run-example mllib.LinearRegression --numIterations 1000 data/mllib/sample_linear_regression_data.txt

./bin/run-example streaming.MQTTWordCount test.mosquitto.org "#"


MASTER=spark://localhost:7077 ./bin/run-example SparkPi
./dev/run-tests

Déploiement d'un cluster Spark sur Amazon EC2

Voir http://spark.apache.org/docs/latest/ec2-scripts.html

./ec2/spark-ec2 --help

Récupérer et positionner les crédentials AWS

export AWS_ACCESS_KEY_ID=XXXX
export AWS_SECRET_ACCESS_KEY=XXXX

Création du cluster et démarrage

./ec2/spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=eu-west-1 --zone=eu-west-1a launch my-spark-cluster 

Arrêt

./ec2/spark-ec2 --region=eu-west-1 --zone=eu-west-1a stop my-spark-cluster

Redémarrage

./ec2/spark-ec2 --identity-file=awskey.pem --zone=eu-west-1a start my-spark-cluster

Terminaison et destruction du cluster

./ec2/spark-ec2 destroy my-spark-cluster

Packages

http://spark-packages.org/


spark-avro Integration utilities for using Spark with Apache Avro data

kafka-spark-consumer Receiver Based Low Level Kafka-Spark Consumer with builtin Back-Pressure Controller

spark-perf Performance tests for Spark

deep-spark Connecting Apache Spark with different data stores

spark-mongodb MongoDB data source for Spark SQL

spark-es ElasticSearch integration for Apache Spark

elasticsearch-hadoop Official integration between Apache Spark and Elasticsearch real-time search and analytics

magellan Geo Spatial Data Analytics on Spark

SparkTwitterAnalysis An Apache Spark standalone application using the Spark API in Scala. The application uses Simple Build(SBT) for building the project.

spark-druid-olap Spark Druid Package

SpatialSpark Big Spatial Data Processing using Spark

killrweather KillrWeather is a reference application (in progress) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations on time series data in asynchronous Akka event-driven environments.

spark-kafka Low level integration of Spark and Kafka

docker-spark Docker container for spark standalone cluster.

spark-streamingsql Manipulate Apache Spark Streaming by SQL

twitter-stream-ml Machine Learning over Twitter's stream. Using Apache Spark, Web Server and Lightning Graph server.