Spark

http://spark.apache.org/

''Apache Spark™ is a fast and general engine for large-scale data processing. Write applications quickly in Java, Scala, Python, R.  Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. ''

=Liens=
 * http://spark-packages.org/

=Livres= code https://github.com/sryza/aas
 * Advanced Analytics with Spark : Patterns for Learning from Data at Scale, http://shop.oreilly.com/product/0636920035091.do

code https://github.com/databricks/learning-spark
 * Learning Spark : Lightning-Fast Big Data Analysis, http://shop.oreilly.com/product/0636920028512.do

code https://github.com/mahmoudparsian/data-algorithms-book
 * Data Algorithms : Recipes for Scaling Up with Hadoop and Spark, http://shop.oreilly.com/product/0636920033950.do

=Installation (depuis un Mac)= wget http://apache.crihan.fr/dist/spark/spark-1.5.1/spark-1.5.1.tgz tar xvf spark-1.5.1.tgz cd spark-1.5.1 more README.md build/mvn -DskipTests clean package

Remarque: ./bin/spark-shell et ./bin/pyspark démarrent une console web Spark UI http://localhost:4040/jobs/

=Programmation interactive en Scala= ./bin/spark-shell

Browse the Spark UI http://localhost:4040/jobs/



sc.parallelize(1 to 10000000).count

val NUM_SAMPLES = 10000 val count = sc.parallelize(1 to NUM_SAMPLES).map{i => val x = Math.random val y = Math.random if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)

=Programmation interactive en Python= ./bin/pyspark >>> sc.parallelize(range(1000)).count

>>> exit

./bin/run-example SparkPi

./bin/run-example mllib.LinearRegression --numIterations 1000 data/mllib/sample_linear_regression_data.txt

./bin/run-example streaming.MQTTWordCount test.mosquitto.org "#"

MASTER=spark://localhost:7077 ./bin/run-example SparkPi

./dev/run-tests

=Packages= http://spark-packages.org/

spark-avro Integration utilities for using Spark with Apache Avro data

kafka-spark-consumer Receiver Based Low Level Kafka-Spark Consumer with builtin Back-Pressure Controller

spark-perf Performance tests for Spark

deep-spark Connecting Apache Spark with different data stores

spark-mongodb MongoDB data source for Spark SQL

spark-es ElasticSearch integration for Apache Spark

elasticsearch-hadoop Official integration between Apache Spark and Elasticsearch real-time search and analytics

magellan Geo Spatial Data Analytics on Spark

SparkTwitterAnalysis An Apache Spark standalone application using the Spark API in Scala. The application uses Simple Build(SBT) for building the project.

spark-druid-olap Spark Druid Package

SpatialSpark Big Spatial Data Processing using Spark

killrweather KillrWeather is a reference application (in progress) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations on time series data in asynchronous Akka event-driven environments.

spark-kafka Low level integration of Spark and Kafka

docker-spark Docker container for spark standalone cluster.

spark-streamingsql Manipulate Apache Spark Streaming by SQL

twitter-stream-ml Machine Learning over Twitter's stream. Using Apache Spark, Web Server and Lightning Graph server.