Skip to content
Search
Generic filters
Exact matches only

Apache Pig

Let’s understand Apache Pig’s data model using the arbitrary pictures above, in clock-wise.

The first image is of the Atom, which is the smallest unit of data available in Apache Pig. It can be of any data type, i.e. int, long, float, double, char array and byte array that carries a single value of information. For example, ‘Prathamesh’ or 30 or ‘Medium22’.

Next, we have an ordered set of “fields” of any data type, separated by a comma as it’s delimiter. Think of it as a single line in csv file. This data structure is known as a Tuple. For example, (‘Prathamesh’, 30, ‘Medium22’).

Put simply, a Bag is an unordered set or collection of tuples. Think of it as singular/multiple non-unique records in a csv. However, unlike a csv, this has no fixed structure, i.e. it has a flexible schema, such that the 1st row can be of 5 fields, the 2nd with 30 fields and so on. For example, {(‘Prathamesh’, 30, ‘Medium22’), (‘Nimkar’, 700)}.

A Bag is of two types, outer and inner. An outer bag is a Relation, which is a bag of tuples. Think of it as a table in a relational database, except there is no fixed schema. An inner bag is a relation, inside another bag, a nested bag structure or bag-ception if you will.

Pig also supports key-value pairs in the form of [char array to element] through a Map data structure. The element can be of any Pig data type. For example, [“Name”#“Prathamesh”, “Year”#2020], wherein, the brackets delimit the map, comma to delimit pairs and hash to separate key from value.