Avro

From air
Revision as of 15:51, 6 April 2016 by FAURE.ADRIEN (talk | contribs)
Jump to navigation Jump to search

Apache Avro

Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. It is similar to Thrift, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them.

  • Java
  • Scala
  • C Sharp
  • C
  • C++
  • Python
  • Ruby

Features

Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Specification

Schema Declaration

A Schema is represented in JSON by one of:

  • A JSON string, naming a defined type.
  • A JSON object, of the form:

{"type": "typeName" ...attributes...} where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

  • A JSON array, representing a union of embedded types.

Primitive Types

The set of primitive type names is:

null: no value boolean: a binary value int: 32-bit signed integer long: 64-bit signed integer float: single precision (32-bit) IEEE 754 floating-point number double: double precision (64-bit) IEEE 754 floating-point number bytes: sequence of 8-bit unsigned bytes string: unicode character sequence Primitive types have no specified attributes.

Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to:

{"type": "string"}

schema example

Schema

{

 "type" : "record",
 "name" : "twitter_schema",
 "namespace" : "com.miguno.avro",
 "fields" : [ {
   "name" : "username",
   "type" : "string",
   "doc"  : "Name of the user account on Twitter.com"
 }, {
   "name" : "tweet",
   "type" : "string",
   "doc"  : "The content of the user's Twitter message"
 }, {
   "name" : "timestamp",
   "type" : "long",
   "doc"  : "Unix epoch time in seconds"
 } ],
 "doc:" : "A basic schema for storing Twitter messages"

}

Data

{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }

{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }