Book Image

BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation

By : Qiang Ding
Book Image

BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation

By: Qiang Ding

Overview of this book

Control-M is one of the most widely used enterprise class batch workload automation platform. With a strong knowledge of Control-M, you will be able to use the tool to meet ever growing batch needs. There has been no book that can guide you to implement and manage this powerful tool successfully... until now. With this book you will quickly master Control-M and be able to call yourself "a Control-M" specialist! "BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation" will lead you into the world of Control-M and guide you to implement and maintain a Control-M environment successfully. By mastering this workload automation tool, you will see new opportunities opening up before you. With this book you will be able to take away and put into practice knowledge from every aspect of Control-M ñ implementation, administration, design and management of Control-M job flows, and more importantly how to move into workload automation and let batch processing utilize the cloud. You will start off with batch processing and workload automation, and then get an understanding of how Control-M meets these needs. Then we will look more in depth at the technical details of Control-M, and finally look at how to work with it to meet critical business needs. Throughout the book, you will learn important concepts and features, as well as learn from the Author's experience, accumulated over many years. By the end of the book you will be set up to work efficiently with this tool and also understand how to utilize the latest features of Control-M.
Table of Contents (16 chapters)
BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Index

Centralized enterprise scheduling


Looking back 20 years, technology has grown beyond imagination, but the needs for batch processing haven't been reduced. According to Gartner's Magic Quadrant for Job Scheduling 2009 report, 70 percent of business processes are performed in batch. In the same report, Gartner forecasted a 6.5 percent future annual growth in the job scheduling market.

Challenges in today's batch processing

In the recent years, IT is becoming more and more sophisticated to meet the ever-growing business requirements. The amount of information to be processed in batch is scarily increasing. At the same time we are also in a trend of batch window shrinking. As a consequence, the system's in-built scheduling functionality or homegrown scheduling tool can no longer handle the dramatically increasing complexity. Let's first have a look at the evolution of surrounding technologies, which affected the way batch processing runs today:

  • The mixture of platforms within IT environment

  • Different machines and applications are inter-related. They often have to work together to accomplish common business goals

From running batch on a single machine and single application, we ended up with an IT environment with hundreds or thousands of machines. Some of these machines are acting as a database server or data warehouse running Oracle, Sybase, and MS SQL Server. Some other machines may purely be used for running ETL jobs from Informatica or Datastage. There are also machines that are dedicated file servers for sending and receiving files according to a specific event. Then there are backup applications running data archiving tasks across the entire IT environment. Besides these, we still have mainframe computers running legacy applications that need to integrate with applications in the distributed environment.

More or less these machines will have their own batch jobs running to serve a particular need. Not only that, applications that are specialized in a particular area may also require batch processing. Some of these applications such as PeopleSoft Finance and SAP R/3 had to come with an in-built batch scheduling feature to meet its own batch processing requirements.

These platforms and applications can rely on a built-in scheduling feature to handle basic batch processing requirements without a problem. Issues arise when business processes require cross platform and cross application batch flow. These islands of automation are becoming silos of information. Without proper methodology, interrelated jobs on different platforms simply don't know when the parent job will finish and the finishing status, thus not knowing when it should start. There are different approaches in order to allow each step of a cross platform job flow to execute in the correct order. The most common one is the time matching method.

With the time matching approach, we first need to know roughly how long a given job takes to run in order to allocate a reasonable time frame for it to finish before the next job starts. The time allocated for each job has to be longer than its normal execution time in case the processing takes longer than normal.

Let's revisit the imagamingpc.com example:

As the batch processing got broken down into individual tasks, the quality of customer service began to improve and the site became busier and busier. After six months, the average number of orders per day increased to 300! The business owner was happy, but the IT person was a bit worried; he summarized the following issues to the business owner:

  • Currently everything is running from one machine, which is presenting a performance bottleneck and some degree of security concern.

  • Sometimes if there are too many orders generated, the batch jobs cannot complete execution within the designed batch window. In an extreme case, it will finish the last step at 11:00am the next day. During this time, the CPU is constantly hitting 100 percent, thus the system cannot process new order requests coming from the web.

At the moment, the IT person only gets to know that the batch flow failed in the morning after he gets into the office. It was ok when the amount of data was small and he could just re-run the failed step and run the rest of the flow. But as the number of daily orders starts to increase, re-running some of the stage can take a lot of time. Sometimes it takes the whole morning to re-run the PROCESS_ORDER step, so the technician cannot build any machines until the daily_build_list is finally generated. During this time, the rerun will also take up most of the CPU resources, which again affects the system processing real-time customer requests from the web.

After research and consulting with other similar businesses, the IT person came up with the following solution:

  • Move the inventory database into a new machine (machine B), separate from the web server (machine A) to reduce its resource utilization.

  • Instead of populating an individual build list into flat files on the webserver, create a new database on a separate machine (machine C) dedicated for storing an individual build list. In this case, PROCESS_ORDER can run quicker and cost less disk IO. Therefore, hopefully it can complete within the designed batch window and not affect the online processing during business hours.

  • To keep the data secure, once all the processing is completed, backup all data onto tape.

The business owner agrees on the approach. During the implementation of the new environment, the IT guy ran into a new problem. Now the batch jobs are divided to run on different machines. There's a synchronization issue, that is, when inter-related jobs are not on the same machine, how do the downflow jobs know their parent job(s) is finished? The IT guy took the time matching approach, that is, defined a timeframe for each step to run. The sequence of the job's execution is as follows:

  1. 12:00am to 1:00am: FTP order is generated during the day from Machine A to Machine C.

  2. 1:00am to 1:30am: Machine C populates a build list into the database.

  3. 1:30am to 2:00am: Run "PROCESS_ORDER" on Machine C.

  4. 2:00am to 2:15am: MAIL_CONFRIMED_ORDER gets executed from Machine C.

  5. 2:15am to 2:30am: Machine C runs PRINT_DAILY_BUILD_LIST.

  6. 2:30am to 3:00am: UPDATE_INVENTORY gets triggered by Machine C.

  7. 3:00am to 3:30am: Machine B triggers GENERATE_BACKORDER_LIST.

  8. 3:30am to 4:00am: Machine B runs MAIL_BACKORDER.

  9. 4:00am to 5:00am: Machine D runs RUN_BACKUP to backup data on machines B and C.

In this example, each processing step is spread across different machines and applications, rather than running off a standalone server (refer to the previous diagram). Each step depends on the previous one to finish before it can start, so it can continue on the work based on what the previous step has done. Obviously, there would be a problem if the confirmed order e-mails got sent out before the order data is fully generated. Sometimes the job may take longer to run due to the increased amount of input data or insufficient amount of computer processing power. Therefore, an extra time window needs to be allocated for each job by taking into consideration the worst case scenario to avoid overlap.

The time matching approach can allow cross platform and application batch flow possibly to run in its designed order, but there are still challenges present in the following areas:

  • Processing time

  • Batch window length

  • Batch monitoring and management

  • Cross time zone scheduling

  • Resource utilization

  • Maintenance and troubleshooting

  • Reporting

  • Reacting to changes

Processing time

With the time matching approach, the entire batch flow will take longer to run due to the time gap between job executions. Child job(s) will not trigger until the scheduled time comes, even if the parent job(s) finished early or at the average finishing time. In extreme cases, a parent job(s) may run over its allocated time, which means the child job(s) will get triggered according to the predefined time while the parent job(s) is still running. This can cause a serious failure and may require data rollback and reset the overlay job(s) to go back to its initial state. As a consequence, the total duration of the batch flow execution will increase with the risk of running longer than the pre-agreed batch window. This is extremely unfavorable under the current trend where the processing time is increasing and batch window is shrinking.

Batch window length

In a traditional scenario, batch window is allocated at night when online activity is low. The system has plenty of time to run the batch jobs and recover from error before the online activity picks up again next morning. As the Internet became popular, organizations have become able to expand their businesses by offering product and services globally. This requires the computer system to be almost 24 hours available for processing online requests from different time zones, and therefore leaves very little room for batch processing.

Batch monitoring and management

When jobs are running on different platforms, they can be monitored as per machine basis only. The user can see which job completed and which job failed, but unable to see everything as a complete business process flow. Many business processes today require thousands of jobs to complete and these jobs may be spread on hundreds of machines. It is not practical for operators to track each step of the batch flow by logging on every machine that is involved in the processing. Not only because this approach is labor intensive, but also because different skill sets are required for people who are in charge for batch jobs running on each different environment. Also, it is difficult to find out what the consequences would be if one job needs to be started late or if some jobs need to be disabled for a given day.

Cross-time zone scheduling

In the trend of globalization, it is common to see that a business has operations set up in many different countries. Sales offices located in North America and Europe, manufacturing offices located in Asia, and customer support centers located in South America. Each of these locations doesn't operate on its own and it is more than likely they need to share large amount of data between them, consequently there will be business processes that require batch processing within different regions to be executed one after another. Due to the different operation time and different geography, cross-time zone scheduling is found to be extremely hard to achieve by the time-matching approach. It also increases the challenge for batch monitoring and troubleshooting.

Resource utilization

In the time-matching approach, if some batch jobs' execution often exceeds their allocated time frame, the application owners either have to resolve the long running job problem or delay the next job's start time to allow more execution time for the problematic job. This will ensure that the long running job is completed without having overlay issues, but will also unnecessarily increase the overall execution time of the batch flow when the problematic job does not overrun. This time gap makes the system idle, and brings more difficulty to the batch processing when the entire batch window is already small.

Maintenance and troubleshooting

In a multi-platform environment, each system or application is likely to be managed by individual teams that are specialized in their own areas. As each batch job resides on different machines across departments, it can take hours to track down the failure point. For example, at 2:00am, a reporting job gets triggered and fails immediately. The person in charge for reporting quickly checks the cause of the problem, and discovers that the parent job failed at 1:50am too. The parent job was a database script that inserts data from CSV files, which were meant to arrive at 1:30am. So the DBA checks with the person who is in charge for the creation of the file, it goes on and on, and may even turn out that the job failure is caused by someone on the other side of the world. By the time they find out the original problem, it is already too late to allow the rerun to complete within the SLA.

Just think from the maintenance point as well, all these failures were caused by a rename to the CSV files. Without seeing the whole picture, the person who made the modification did not know there's a downflow reporting job, or many other parties outside his department may rely on these files for further processing.

Reporting

Batch running report is important information for analyzing the behavior of the batch flow. Job execution information collected on each machine may not represent a cross-platform business process because the individual machine is only running a portion of the entire business process. To report on the job execution status of a cross-platform batch flow, we need to collect data from each involved machine and filter out any job information that is not related to the batch flow definition. This process can be complicated and time-consuming and may require modification each time the batch flow is changed.

Reacting to changes

The business environment does not stay the same all the time. Changes made to the business can dramatically affect how IT works. Think about situations such as company mergers. Without an overall view of the entire batch environment from a business process point of view, plus a lack of standardization and documentation, IT will become a resistant of the business transition. Even with business events as small as a marketing campaign, batch jobs may require longer than normal to run in order to process the extra amount of data. For example, when a national retail store is opened for 24 hours during the Christmas period, the machine needs more resources to be capable of handling the online transactions. With batch jobs residing across many platforms, a lot of manual modifications will be needed to cater to the temporary change.

Costs for computer hardware are reducing, but sometimes adding more machines and technical staff may not be enough to effectively face the challenges we talked about so far, but can even complicate the situation further. If the IT components do not work together very well, the business will face serious problems. Just think about suffering from currency exchange rate increases due to failure in processing an order on time, penalties for batch processing missing its service-level agreement, and security risks. IT risks can cost the business a huge amount of profit and even potentially affect the company's share price and public image.

The solution

The computer networking technology allows machines to communicate with each other freely. Based on this technology, batch scheduling tools are able to expand their ability to schedule jobs on multiple platforms and provide users with a single point of control instead of running a standalone batch scheduling tool on each individual machine and using the time-matching method to schedule cross-platform batch flow.

During runtime, the centralized scheduling platform examines each job's scheduling criteria to decide which job should be running next, each time it sends a job execution request to the remote host that was predefined in the selected job's definition. A mechanism on the remote host needs to be established to communicate with the centralized scheduling platform, as well as to handle the job submission request by interacting with its own operating system. Once the job is submitted, the centralized scheduling platform will wait for the response from the remote host and, in the mean time, submit other jobs that are meeting their scheduling criteria. Upon the completion of each job, the centralized scheduling tool will get an acknowledgment from the remote host and decide what to do next based on the execution outcome of the completed job (for example, rerun the current job or progress to the next job).

Let's re-visit some of the challenges mentioned earlier that are related to cross-platform batch processing and analyze how to overcome them by using the centralized scheduling approach:

  • Processing time and resource utilization

  • Batch monitoring and management

  • Cross-time zone scheduling

  • Maintenance and troubleshooting

  • Reporting

  • Reacting to changes

Processing time and resource utilization

Centralized scheduling approach effectively minimizes the time gap between the executions of jobs, thus potentially shortening the batch flow's total execution time and reducing system idle time. Cross-platform jobs are built into a logical job flow according to the predefined dependency, the centralized scheduling platform controls the execution of each job by reviewing the job's parent job(s) status and the job's own scheduling criteria. In this case, jobs can be triggered imminently when its parent job(s) are completed. If the parent job(s) takes longer to run, the child job will start later. If the parent job(s) completed early, the child job will also start earlier. This effectively avoids job overrun, that is, the child job will not get triggered unless the centralized scheduling platform received a completion acknowledgement from the parent job(s).

Batch monitoring and management

We often hear from more senior IT people talking about their "good old days" - batch operators has a list of all jobs expected to be executed on the day, together with a list of machines where these jobs are located. They had to manually logon each machine to check the perivous job's completion state, then logon another machine to execute the next job, Centralized scheduling allowed users to monitor and manage cross platform batch jobs from a single point of control. Users no longer need to estimate which job should run next by looking at their spreadsheet, because jobs that belong to a single business process are grouped into a visualized batch flow. Users can see exactly where the execution is up to in the batch flow. Centralized scheduling also provides a uniform job management interface. The people in charge of managing the batch jobs no longer need to have in-depth knowledge of the job's running environment to be able to perform simple tasks such as rerun a job, delay a job's execution, or deploy a new job.

Cross-time zone scheduling

The time-matching approach requires each job defined to match each other's scheduling time. For example, if a job is located in Sydney Australia (GMT +10), its child jobs are located in Hong Kong (GMT +7), Bangkok Thailand (GMT +7), and LA USA (GMT -5). If the parent job is set to run between 2pm to 3:30pm Sydney local time, the child jobs need to start at 12:30pm Hong Kong local time, 11:30am Bangkok local time, and 8pm LA local time. The schedule of each child job needs to be changed every time the parent job's scheduling is changed or when day light saving comes. It is much easier to manage cross-time zone batch flows when job scheduling does not rely on the time-matching approach. Jobs without additional time requirements are defined to run immediately once the parent jobs are completed, regardless of which machine they are at and what time zone the machine resides on.

Maintenance and troubleshooting

When an exception occurs, centralized batch scheduling platform allows the users to clearly see where the problematic jobs are located in the business process. Therefore, it is easier for them to estimate how it is impacting the down flow jobs. Operators who manage the batch processing can easily take actions against the problematic jobs from the central management console without the need of logging onto the job's machine as the job owner to perform tasks such as rerun or kill the job. In case a failure needs to be handled by the application owner, the operators can easily identify the job's owner and escalation instructions by looking up to the job's run book, which also can be recorded within the scheduling platform. From a maintenance point of view, before a job scheduling criteria needs to be modified, such as its execution time, the user can clearly see the job's child jobs from the central management console to find out the impact of such a change .

Reporting

As jobs are managed and scheduled from a central location, it is easier for the centralized scheduling platform to capture each job's scheduling details, such as its start time, end time, duration, and execution outcome. The user can extract the information into a report format to analyze the batch execution from the business process point of view. It saves the need for collecting data from each involved machine and filtering the data against the batch flow definition.

Reacting to changes

A centralized batch scheduling approach provides IT with the ability to react to business changes. In a centralized batch scheduling environment, batch jobs are managed from a single location and more likely follow the same procedure for deployment, monitoring, troubleshooting, maintenance, retirement, and documentation. During a company merger, it is much easier to consolidate two batch platforms into one, to compare dealings between each machine on an application basis.

The centralized scheduling approach speeds up batch processing and improves computing resource utilization by overcoming the cross platform and multi-time zone challenge in today's batch processing. From the user point of view, batch jobs are managed according to business processes rather than focusing on job execution within each individual machine or application. As a result, the cost and risk to the business is reduced. IT itself becomes more flexible and able to react to the shifting of business requirements, helping to improve the agility of the business.

A centralized enterprise scheduling platform would have the following common characteristics (some commercial product may not necessarily have these characteristics or are not limited to them):

  • It can schedule jobs on different system platforms and provide a centralized GUI monitoring and management console, such system platforms are (but not necessarily) mainframe, AS/400, Tandem, Unix, Linux, and Microsoft Windows.

  • It can schedule jobs based on its scheduling criteria, such as date and time.

  • It has the ability to execute job flows according to predefined inter-job logic regardless the operating system of the job.

  • It is able to automatically carry out the next action according to current job's execution status, the job's execution status can be its operating system return code or a particular part of its job output (for example, an error message).

  • It has the ability to make decisions on which job to schedule by reference to its priority and the current resource utilization (that is, limits the number of concurrent running jobs and allow jobs with higher priorities to be triggered first).

  • Handle event-based real-time scheduling.

  • Some degree of integration with applications (for example, ERP, Finance applications).

  • Automated notification when a job fails or a pre-defined event occurs.

  • Security and auditing features.

  • Integration with ITSM (IT service management).