4.3 Entities and Data Structures
In the previous chapter ....... the 'The Client View' and the 'The Space/Node View' architectural layers were introduced.
In terms of data the client view holds data as java objects and relationships are held as references to other java objects.
The space/node view (which contains the IMDG) also holds the data as java objects but relationships are held as foreign key values.
This chapter looks at some of the concept used in making the IMDG scalable and efficient
4.3.1 Partitioning
|
Question: Why partition the data?
Answer : To allow you to scale up - one machine with all the data becomes useless when it is full. Partitioning allows th
In general, the data for an application will not fit into one node.
For example, a given node may hold 1GB of application data, but the complete application needs 8GB.
CloudTran uses the partition feature of GigaSpaces, which spreads the information across a number of nodes.
This information can be addressed as a complete unit, so searches can be done in one call across all nodes.
Information can also be directed to a specific node using Content-Based Routing:
the partition to look in is determined by part of the data (one of the fields/columns).
This is called the routingValue.
In CloudTran, the number of partitions is determined at configuration time not development time:
as long as the developer takes partitioning into account during the design and programming phase,
then the application is scalable - arbitrarily large datasets can be accommodated simply by reconfiguring.
In depicting a partitioned grid with backups, we use the vertical axis for backups and the horizontal axis for partitions.
Typically, one or two backups are used, so the grid will be two or three units high.
4.3.2 Entity Groups
|
Question: What are entity groups and why are they useful?
Answer: Entity Groups are groups of related entities deployed into the same space instance. They are useful because they provide 'locality of reference'.
A common pattern in scalable systems is to group entities that are tightly coupled into entity groups.
(We started using this term internally in the CloudTran development, and then discovered it is well-known!)
The reason for using entity groups is to put instances of related entities together into a single space, so that references between the entities are local.
This supports the SBA approach, which is to co-locate data and processing, avoiding the overhead of serialization, network hops and tiers of functionality.
Local references are typically 10-100 times more efficient and faster than an inter-node reference.
The grouping is application-specific, and up to the developer's judgement.
For example, in a Customer system, one approach ('Customers with Orders') is to group Customer, CustomerAddress, Order and OrderLine into one group.
An equally valid approach ('Orders separate from Customers') would be to split the group: Customer/Address into one entity group; Order/Line into another.
This small piece of business advice has massive benefits over multi-tier architectures, like J2EE or separate compute-grid/data-grid.
CloudTran therefore insists on it being modelled: with the current state of the art, it cannot be deduced.
For scalable systems, entity spaces will need to be partitioned.
To route to the correct partition, there must be a single routing value for all instances of an entity group, to ensure that they all end up in the same partition.
In CloudTran, this is done by using the primary key of the master entity for routing to a partition.
The diagram shows a space partitioned in two.
Customer is the master entity of the entity group; we refer to the non-master entities in the group as subordinate entities.
Customer John Smith and all subordinate entity instances are collected in one partition; John Adams and all subordinate entity instances are in the other.
The client, to save an order, must ensure that the request can be directed to the space where John Smith's live entities are.
The modeller / designer will need to give some thought to how the entity groups should be
NOTE: Entity Groups in CloudTran do not limit transactions
A common theme in other distributed data approaches (Microsoft Azure, Google App Engine) is to use entity groups to limit the scope of a transaction.
This makes life easy for the infrastructure developer but hard for the application developer ...
who is left in the position of constructing distributed transactions from the entity-group level transactions. For example:
Performing Entity Group Transactions [In Microsoft Azure]
An entity group transaction must meet the following requirements:
- All entities subject to operations as part of the transaction must have the same PartitionKey value [i.e. must be in the same partition]
- An entity can appear only once in the transaction, and only one operation may be performed against it.
- The transaction can include at most 100 entities, and its total payload may be no more than 4 MB in size.
SQL Azure Database does not support distributed transactions, which are transactions that affect several resources.
None of these restrictions apply in CloudTran.
Leaving the developer to construct distributed transactions out of limited-functionality transactions exposes
him to all the difficulties of high-performance distributed programming.
At best, this will introduce high-cost, home-grown solutions; at worst, projects will fail.
4.3.3 Master Entities
|
Question: What is a Master Entity?
Answer: A Master Entity is the head of an Entity Group.
In an entity group the Master Entity is the head of that Entity Group. All the subordinate entities will be placed in the same space as the Master Entity.
In the example in the previous section the Customer is the Master Entity. The diagram shows two instances of the Entity Groups and each has its Master Entity
instance (John Smith and John Adams). There can be only one Master Entity in an Entity Group.
The decision to make a particular Entity a Master Entity is done at the modelling stage. The modeller make the decision which will . In creating Entity Groups and hence the
Master Entities the modeller needs to keep in mind the idea of locality of reference. Ideally the modeller should arrange his Entity Groups to minimise the number of space hops
when amending and retrieving data records.
For example if the modeller knows that a Customer's Address records will invariably be retrieved when retrieving the Customer record, it would make sense to make the Address
records part of the Customer Entity Group.
Modelling Entity Groups is described in detail in section XXX. The principal in the model is to use nested Entities, w
There are a few points to remember regarding Master Entities.
- Master Entities cannot be at the 'Many' end of a relationship within in the Entity Group. So using the example in the previous section you could not make the OrderLine
a Master Entity without first putting it in a separate Entity Group, so...
- Master Entities are always at the 'to-One' end of a relationship within an Entity Group.
- Master Entities can be at the 'Many' end of a relationship provided the other end of the relation is not in the same Entity Group.
- All top level entities become Master Entities even if they do not contain any subordinate Entities.
4.3.4 Routing Entities
|
Routing entities is driven by the need to keep the Entity Groups in the same space so as to maintain the reference of locality. The simplest solution therefore is to use the primary
key of the Master Entity (MasterEntityPK). as the routing value of the Entity Group.
GigaSpaces recommends using a Long or Integer for the routing value. However primary keys in CloudTran applications can be any object, so the actual routing value is a Integer
representation of the MasterEntityPK.
All data objects provide method to get the routing value for the Entity. This method, shown below, is annotated with @SpaceRouting as an indication that this provides the routing value.
@SpaceRouting
public Integer getCloudTran_RoutingValue()
NOTE: In CloudTran we refer to the routing of the Entities as being determined by the MasterEntityPK. Remember it is actually an Integer representation of the MasterEntityPK
4.3.5 Primary Key
|
Primary Keys give each entity its uniqueness. Every entity must have a primary key and this is achieved by designating one of the attributes of the entity as the primary key.
This designation is done at the modelling stage, where the modeller can chose whether the primary key will be provided by an external source (usually the end user) or whether
they wish for CloudTran to automatically generate.
If the modeller wants to make use of CloudTran's Key Generation Service they need to set the flag autokey="true", otherwise they simply chose key="true".
An 'autokey' attribute must be of type Integer, int, Long or long, whereas a 'key' is not restricted to a particular type.
If the modeller does not provide a primary key for an entity then the CloudTran generator will add an additional attribute
(by default called 'OID'). The OID will be set with autokey="true", so the OID value will be automatically generated.
For GigaSpaces the getter method in the data class of the primary key will have the @SpaceId annotation.
@SpaceId(autoGenerate=false)
The autogenerat=false means that the value has been provided (by CloudTran or by the user) and should not be provided by GigaSpaces.
NOTE: The primary key of an entity is not the same as the MasterEntityPk except in the special case when the Entity is a MasterEntity.
4.3.6 Relationships
|
Relationships are permitted between entities. At present the only limitation is that they must be One-to-Many or One-to-One. Many-to-Many relationships are not allowed.
The other restrictions with regard to relationships is when the relationship is with the MasterEntity in the same Entity Group. These restrictions are described in detail in
above.
Relationships are allowed between Entity Groups and so in a running system can be across spaces and partitions.
In terms of the data structures. In the spaces the Java Objects representing the data will have a similar structure to that of the datasource, in that the relationship is represented
as a foreign key field, which is the primary key of the other end of the relation. In the client view of the data the relationship is represented as an pointer to the object
representing the other end of the relation.
The client view and the space view of the data is handled by the ORM and is explained in further detail in the next section.
|