Avro

From air
Jump to navigation Jump to search

Apache Avro

Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. It is similar to Thrift, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them.

  • Java
  • Scala
  • C Sharp
  • C
  • C++
  • Python
  • Ruby

Features

Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.


Resources

General

A great slide presentation to hep you to choose your serialization system : http://fr.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro

Java

There is a list of good article to serialize and deserialise data with java.


Specification

Schema Declaration

A Schema is represented in JSON by one of:

  • A JSON string, naming a defined type.
  • A JSON object, of the form:

{"type": "typeName" ...attributes...} where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.

  • A JSON array, representing a union of embedded types.

Primitive Types

The set of primitive type names is:

  • null: no value
  • boolean: a binary value
  • int: 32-bit signed integer
  • long: 64-bit signed integer
  • float: single precision (32-bit) IEEE 754 floating-point number
  • double: double precision (64-bit) IEEE 754 floating-point number
  • bytes: sequence of 8-bit unsigned bytes
  • string: unicode character sequence
  • Primitive types have no specified attributes.

Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to:

{"type": "string"}

Complex Data Types

Beyond the primitive data types described in the previous section, Avro also supports six complex data types: Records, Enums, Arrays, Maps, Unions, and Fixed. They are described in this section.

record

A record represents an encapsulation of attributes that, all combined, describe a single thing. The attributes that an Avro record supports

  • name

This is the record's name, and it is required. It is meant to identify the thing that the record describes. For example: PersonInformation or Automobiles or Hats or BankDeposit. Note that record names must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

  • namespace
  • A namespace is an optional attribute that uniquely identifies the record. It is optional, but it should be used when there is a chance that the record's name will collide with another record's name. For example, suppose you have a record that describes an employee. However, you might have several different types of employees: full-time, part time, and contractors. So you might then create all three types of records with the name EmployeeInfo, but then with namespaces such as FullTime, PartTime and Contractor. The fully qualified name for the records used to describe full time employees would then be FullTime.EmployeeInfo.
  • Alternatively, if your store contains information for many different organizations, you might want to use a namespace that identifies the organization used by the record so as to avoid collisions in the record names. In this case, you could end up with fully qualified records with names such as My.Company.Manufacturing.EmployeeInfo and My.Company.Sales.EmployeeInfo.
  • doc: This optional attribute simply provides documentation about the record. It is parsed and stored with the schema, and is available from the Schema object using the Avro API, but it is not used during serialization.
  • aliases : This optional attribute provides a JSON array of strings that are alternative names for the record. Note that there is no such thing as a rename operation for JSON schema. So if you want to refer to a schema by a name other than what you initially defined in the name attribute, use an alias.
  • type : A required attribute that is either the keyword record, or an embedded JSON schema definition. If this attribute is for the top-level schema definition, record must be used.
  • fields : A required attribute that provides a JSON array which lists all of the fields in the schema. Each field must provide a name and a type attribute. Each field may provide doc, order, aliases and default attributes:
    • The name, type, doc and aliases attributes are used in the exact same way as described earlier in this section. As is the case with record names, field names must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].
    • The order attribute is optional, and it is ignored by Oracle NoSQL Database. For applications (other than Oracle NoSQL Database) that honor it, this attribute describes how this field impacts sort ordering of this record. Valid values are ascending, descending, or ignore. For more information on how this works, see http://http://avro.apache.org/docs/current/spec.html#order.
    • The default attribute is optional, but highly recommended in order to support schema evolution. It provides a default value for the field that is used only for the purposes of schema evolution. Use of the default attribute does not mean that you can fail to initialize the field when creating a new value object; all fields must be initialized regardless of whether the default attribute is present.

Schema evolution is described in Schema Evolution.

Permitted values for the default attribute depend on the field's type. Default values for unions depend on the first field in the union. Default values for bytes and fixed fields are JSON strings.

Enum

Enums are enumerated types, and it supports the following attributes

  • name :A required attribute that provides the name for the enum. This name must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].
  • namespace : An optional attribute that qualifies the enum's name attribute.
  • aliases : An optional attribute that provides a JSON array of alternative names for the enum.
  • doc : An optional attribute that provides a comment string for the enum.
  • symbols : A required attribute that provides the enum's symbols as an array of names. These symbols must begin with [A-Za-z_], and subsequently contain only [A-Za-z0-9_].

For example:

{

 "type" : "enum",
 "name" : "Colors",
 "namespace" : "palette",
 "doc" : "Colors supported by the palette.",
 "symbols" : ["WHITE", "BLUE", "GREEN", "RED", "BLACK"]

}

Arrays

Defines an array field. It only supports the items attribute, which is required. The items attribute identifies the type of the items in the array:

{"type" : "array", "items" : "string"}

Maps

A map is an associative array, or dictionary, that organizes data as key-value pairs. The key for an Avro map must be a string. Avro maps supports only one attribute: values. This attribute is required and it defines the type for the value portion of the map.

{"type" : "map", "values" : "int"}

Unions

A union is used to indicate that a field may have more than one type. They are represented as JSON arrays.

For example, suppose you had a field that could be either a string or null. Then the union is represented as:

["string", "null"] You might use this in the following way:

{

    "type": "record",
    "namespace": "com.example",
    "name": "FullName",
    "fields": [
      { "name": "first", "type": ["string", "null"] },
      { "name": "last", "type": "string", "default" : "Doe" }
    ]

}

Fixed

A fixed type is used to declare a fixed-sized field that can be used for storing binary data. It has two required attributes: the field's name, and the size in 1-byte quantities.

For example, to define a fixed field that is one kilobyte in size:

{"type" : "fixed" , "name" : "bdata", "size" : 1048576}

schema example

Schema

{

 "type" : "record",
 "name" : "twitter_schema",
 "namespace" : "com.miguno.avro",
 "fields" : [ {
   "name" : "username",
   "type" : "string",
   "doc"  : "Name of the user account on Twitter.com"
 }, {
   "name" : "tweet",
   "type" : "string",
   "doc"  : "The content of the user's Twitter message"
 }, {
   "name" : "timestamp",
   "type" : "long",
   "doc"  : "Unix epoch time in seconds"
 } ],
 "doc:" : "A basic schema for storing Twitter messages"

}

Data

{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }

{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }