Spark: Difference between revisions
| Line 76: | Line 76: | ||
./dev/run-tests |
./dev/run-tests |
||
</pre> |
</pre> |
||
=Déploiement d'un cluster Spark sur Amazon EC2= |
|||
<pre> |
|||
./ec2/spark-ec2 --help |
|||
</pre> |
|||
Récupérer les crédentials AWS |
|||
=Packages= |
=Packages= |
||
Revision as of 21:50, 21 February 2016
Apache Spark™ is a fast and general engine for large-scale data processing. Write applications quickly in Java, Scala, Python, R. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Liens
Livres
- Advanced Analytics with Spark : Patterns for Learning from Data at Scale, http://shop.oreilly.com/product/0636920035091.do
code https://github.com/sryza/aas
- Learning Spark : Lightning-Fast Big Data Analysis, http://shop.oreilly.com/product/0636920028512.do
code https://github.com/databricks/learning-spark
- Data Algorithms : Recipes for Scaling Up with Hadoop and Spark, http://shop.oreilly.com/product/0636920033950.do
code https://github.com/mahmoudparsian/data-algorithms-book
Installation (depuis un Mac)
wget http://wwwftp.ciril.fr/pub/apache/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz tar xvf spark-1.6.0-bin-hadoop2.6.tgz cd spark-1.6.0-bin-hadoop2.6 more README.md
Remarque: ./bin/spark-shell et ./bin/pyspark démarrent une console web Spark UI http://localhost:4040/jobs/
Programmation interactive en Scala
./bin/spark-shell
Browse the Spark UI http://localhost:4040/jobs/
sc.parallelize(1 to 10000000).count()
val NUM_SAMPLES = 10000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
Programmation interactive en Python
./bin/pyspark >>> sc.parallelize(range(1000)).count() >>> exit()
./bin/run-example SparkPi ./bin/run-example mllib.LinearRegression --numIterations 1000 data/mllib/sample_linear_regression_data.txt ./bin/run-example streaming.MQTTWordCount test.mosquitto.org "#"
MASTER=spark://localhost:7077 ./bin/run-example SparkPi
./dev/run-tests
Déploiement d'un cluster Spark sur Amazon EC2
./ec2/spark-ec2 --help
Récupérer les crédentials AWS
Packages
spark-avro
Integration utilities for using Spark with Apache Avro data
kafka-spark-consumer Receiver Based Low Level Kafka-Spark Consumer with builtin Back-Pressure Controller
spark-perf Performance tests for Spark
deep-spark Connecting Apache Spark with different data stores
spark-mongodb MongoDB data source for Spark SQL
spark-es ElasticSearch integration for Apache Spark
elasticsearch-hadoop Official integration between Apache Spark and Elasticsearch real-time search and analytics
magellan Geo Spatial Data Analytics on Spark
SparkTwitterAnalysis An Apache Spark standalone application using the Spark API in Scala. The application uses Simple Build(SBT) for building the project.
spark-druid-olap Spark Druid Package
SpatialSpark Big Spatial Data Processing using Spark
killrweather KillrWeather is a reference application (in progress) showing how to easily leverage and integrate Apache Spark, Apache Cassandra, and Apache Kafka for fast, streaming computations on time series data in asynchronous Akka event-driven environments.
spark-kafka Low level integration of Spark and Kafka
docker-spark Docker container for spark standalone cluster.
spark-streamingsql Manipulate Apache Spark Streaming by SQL
twitter-stream-ml Machine Learning over Twitter's stream. Using Apache Spark, Web Server and Lightning Graph server.