Book Image

SAP HANA Cookbook

Book Image

SAP HANA Cookbook

Overview of this book

SAP HANA is a real-time applications platform that provides a multi-purpose, in-memory appliance. Decision makers in the organization can gain instant insight into business operations. Thus all the data available can be analysed and you can react to the changing business conditions rapidly to make decisions. The real-time platform not only empowers business users and top management to make decisions but also provides the capability to make decisions in real-time.A practical and comprehensive guide that helps you understand the power of SAP HANA’s real-time and in-memory capabilities. It also provides step-by-step instructions to exploit all the possible features of the SAP HANA database, enabling users to harness the full potential of this technology and its features.You will gain an understanding of real-time replications, effective data loading from various sources, how to load data, and how to create re-usable objects such as models and reports.Use this practical guide to enable or transform your business landscape by implementing SAP HANA to meet your business requirements. The book shows you how to load data from different types of systems, create models in SAP HANA, and consume data for decision-making. The book covers various tools at different stages creating models using SAP HANA Studio, and consuming data using reporting tools such as SAP BusinessObjects, SAP Lumira, and so on . This book also explains the in-depth architecture of SAP HANA to help you understand SAP HANA as an appliance, that is, a combination of hardware and software.The book covers the best practices to leverage SAP HANA’s in-memory technology to transform data into insightful information. It also covers technology landscaping, solution architecture, connectivity, data loading, and setting up the environment for modeling purpose (including setup of SAP HANA Studio).If you have an intention to start your career as SAP HANA Modeler, this book is the perfect start.  
Table of Contents (16 chapters)
SAP HANA Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Storing data – row storage


As seen in the architectural diagram of SAP HANA IMCE, there are two relational engines in the heart of the IMCE. These relational engines are in-memory, meaning that their primary data persistence is based in RAM. The row store stores the data in rows, and in this respect, it behaves like a traditional database—except that the data always resides in RAM. The row store engine is highly optimized for write operations and is interfaced from the calculation/execution layer. All the operations on the row tables will be processed by this row engine. When a query is fired on to the SAP HANA database, the optimizer decides in which engine the query has to be executed. For example, there may be some functions that OLAP engine doesn't support, but the row engine does. In that case, the optimizer sends all the data to the row engine and gets the task done. This may be more expensive as the column data has to be converted to row data before it is processed by row engine. One such example is non equi join. Non equi joins will be executed by the row engine only as this is not supported by the column engine.

Now, let us see the internal architecture of the SAP HANA row store engine in the following diagram:

The main functions of the different components are explained as follows:

  • Transactional Version Memory: This memory sections holds the temporary version of data. All the recent versions of changed records are maintained in this section. This data is required by MVCC. For concurrency control, SAP HANA implements the classic MVCC principle to provide concurrent access to the database. Data reading and writing will happen in parallel from database. When the data is being written and some users are reading the same data, there are fair chances that the data is inconsistent. To avoid this, techniques such as locking and MVCC are implemented.

    Locking is an effective way of handling concurrency problem, but takes lot of time. However, MVCC is very effective in handling the latest versions of data. When a query hits the database, the data at that instant of time is displayed. The changes made will not be reflected in the results until the transaction is committed to the database.

    When there is a new set of data to be updated, MVCC will not update the old data set. Instead, it marks the old data as outdated and writes a new set of data elsewhere. In this process, there will be many versions of data stored—only one being the latest. Hence, a considerable amount of memory is required to maintain these data versions. MVCC in combination with a time-travel mechanism allows temporal queries inside the relational engine.

  • Segments: Segments contain the actual data in the form of pages. All the data in the row-store tables are stored in segments, in the form of pages. The concept of linked list is used in storing the memory pages. Linked list is one of the fundamental data structures. The SAP HANA database uses the same concept. The row store tables are linked lists of memory pages. Pages are grouped in segments. The typical size of each page is 16 KB.

  • Page Manager: Page Manager is responsible for memory allocation. It also keeps track of the used pages and the free pages available.

  • Version Memory Consolidation: As discussed earlier, different versions of the data are stored in the transactional version memory and MVCC takes care of the data consistency. When a transaction is committed, it has to be stored in a database table, a row table in this case. Version Memory Consolidation takes care of this activity. The recent versions of the changed records are moved from the transaction version memory to the persistent segment on commit-ID basis. After moving the recent version to the persistent segment, all the temporary data and the different versions created by MVCC have to be cleared from the transaction version memory for effective utilization of memory. This activity is also taken care of by Version Memory Consolidation. Hence, Version Memory Consolidation can be considered as garbage collector for MVCC.

  • Persistence Layer: Persistence Layer is used for writing purposes. It is called in log write operations and checkpoints. All the database logs are maintained by the log replay/undo agent. After the data has been reloaded into the data area of database, it will replay the log from the log backups and the log area. The database will be back online only after these actions are completed.

The redo log information is located in the log backup and in the log area of the database. The recovery process takes care of checking log positions in the data backup after the data area has been restored. In order to replay the logs, the log position must be available either in the log backups or the log area. Also, the system should find the offset on the log. If the backup being used for recovery is not the latest one, we must ensure that the offset needed for the backups is available in the log backups or the log area. Unless the required offset is present, log replay cannot be performed.

During recovery, if the system cannot find the log offset in the log area, we see an error message log and data must be compatible. In this error situation, we must use the clear log option during to get the system online again. Any logs in the log area are ignored during the log replay phase. Even if the replay of the log area is not performed, the system ends up in a consistent data state. The data area holds all the undo log information, and it is reloaded into the area during recovery. The replication server won't have a restart point if the log replay has not taken place. When this situation occurs, it is essential to refer to the replication server documentation for information on how to solve this problem.

If we perform a recovery without implicit log replay, the log area is formatted. The log backups are replayed, but not the logs in the log area. In this situation, the .ini files can be recovered. On the other hand, their recovery is not important. If the .ini files are recovered, parameter changes made after the backup will not be recovered; therefore they are lost.

When we use the clear log option, the following actions will be performed:

  • The data changes made after the back up will be lost; as the log entries get cleared from the system, there is no more information available to perform redo

  • The transactions that are not yet committed in the backup area will be rolled back (undo)

Only when the log replay of the log area cannot take place, the clear log option has to be used as an exception.

The following are examples of situations where the log replay may not be possible:

  • When the log area is corrupted and the log information is no longer available

  • A log backup is missing, which links the latest available log backup to the log area

  • While performing a disaster recovery if the log available in the log backups and the log in the log area are not compatible

Let us complete our learning about all the components of the row store engine:

  • Write Operations: When there are any write operations, they mainly go to the Transaction Version Memory. Here, all the versions are maintained by MVCC and finally written to Persisted Segment. The Insert operation also writes the data to Persisted Segment.

  • Persisted Segment: Persisted Segment contains data that is used in ongoing active transactions and data that has been committed before any active transaction was started.

  • Index: Each row-store table has a primary index. ROW ID is a number that specifies the memory segment and page for each record. Primary index maps Primary Key of the table to ROW ID. ROW ID contains the segment address and the offset. To locate a record, combination of segment address and segment offset is used. The formula becomes Segment Address + Segment Offset. The memory page for a table record can be obtained. A structure called ROW ID contains the segment and the page for the record. The page can then be searched for the records based on Primary Key. As mentioned earlier, ROW ID is a part of the primary index of the table.

Indices are never persistent. They are always stored in the memory only. When tables are loaded into the memory on system start up, indices for all the row tables are filled. They are never stored permanently.

We can create secondary indices if required. It is better to go with row storage in the following situations:

  • It is recommended when the tables contain a low volume of data

  • It is used when the application request has to access the entire row

  • It is used when the data has to be processed record by record

For more information, refer to the following link:

http://scn.sap.com/community/developer-center/hana/blog/2012/08/16/in-a-relationship-with-hana--part-3