HPC Shelf
A Component-Oriented Platform for Cloud-Based HPC Services
HPC Shelf is the proposal of a platform for the building and execution of component-oriented parallel computing systems across multiple computing infrastructures, such as IaaS clouds, supercomputing/HPC centers, on-premise clusters, etc. Its current prototype supports AWS EC2, Google Cloud Platform IaaS computing infrastructures, as well as the local computer. HPC Shelf may be viewed as an HPC-as-a-Service cloud (HPCaaS) that attempts to make it easy to exploit the performance of one or more clusters (multicluster) deployed in possibly distinct computing infrastructures (multicloud) with the goal of executing solutions to computationally challenging problems.
An application developer who is interested in making use of HPC resources may do this through the API offered by the Shelf Application Framework (SAFe). SAFe offers operations for building parallel computing systems. These systems are comprised of components of different kinds taken from a component catalog offered by a Core service.
Core services are currently implemented as web services. The component catalog of a Core service may offer component frameworks that include various components that meet different purposes (e.g. linear algebra, deep learning, Map-Reduce computations, and so on) in order to serve the requirements of different applications. In such a context, one may be interested to become a component provider, who offers useful components and component frameworks for the needs of application developers.
The components of parallel computing systems built on top of HPC Shelf comply with the Hash component model. It is a component model where components are able to address parallelism concerns in a general sense, including the implementation of parallel algorithms and communication/synchronization patterns among parallel processes.
Using SAFe, we have developed an HPC Shelf application called Swirls. It is a Jupyter kernel that makes it possible the interactive building and execution of parallel computing systems by means of Jupyter notebooks. A command line interface (CLI) is also supported. In particular, Swirls makes it possible to execute MPI programs written in C, C++, and Python (including Horovod support) over clusters deployed at the AWS EC2 and Google Cloud Platform computing infrastructures.
Swirls has the purpose of facilitating the first contact of potential users with the HPC Shelf capabilities.
If you are interested in more details about HPC Shelf (and Swirls), as well as to become a contributor, please contact us at hpcshelf@dc.ufc.br.
Origins & Motivations
HPC Shelf is the result of academic projects of researchers from UFC, a federal university at Ceará, Brazil, led by Prof. Francisco Heron de Carvalho Junior. These projects, partially supported by Brazilian public research agencies (CNPq, CAPES, and FUNCAP), date back to the 2000s, with the aim of investigating the convergence of modern software architecture methods for the design and implementation of parallel programs in the context of HPC. At that time, component-based software engineering (CBSE) emerged as an alternative to deal with the increasing scale and complexity of applications in computational sciences, as well as the increasing interoperability requirements between legacy codes developed by different research/engineering teams. In such a context, the CCA (Common Component Architecture) and Fractal/GCM (Grid Component Model), with a number of compliant computational frameworks and platforms, have been proposed by different research consortiums. The Hash component model, one of the main foundations of HPC Shelf, was proposed at that time.
In the last years, the following doctoral thesis have been concluded with contributions to HPC Shelf, under the Graduate Program in Computer Science (MDCC) of UFC:
(the titles are translated from Portuguese and we provide the DOI link of a related paper/article)
Jefferson de Carvalho Silva, "A Framework for the Construction of Component-Based Applications on a Cloud Computing Platform for HPC Services" (2016);
https://doi.org/10.1016/j.scico.2018.04.004Allberson Bruno de Oliveira Dantas, "Certification of Components in a Cloud Computing Platform for HPC Services" (2017);
https://doi.org/10.1016/j.scico.2019.102379Cenez Araújo Rezende, "A Component-Based Framework for Large-Scale Parallel Computing over Graphs" (2018);
https://doi.org/10.1109/WSCAD.2018.00026João Marcelo Uchôa de Alencar, "Elastic Reconfiguration of Parallel Components on a Cloud of HPC Services" (2018);
https://doi.org/10.5753/wscad.2019.8667Wagner Guimarães Al-Alam, "The Abstraction of Contextual Contracts for Resource Allocation in Component-Oriented Parallel Computing Systems over Clouds" (2019).
https://doi.org/10.1002/cpe.6225
Component Kinds
HPC Shelf complies with the Hash component model. Thus, the building blocks of parallel component systems deployed by HPC Shelf are parallel components picked from a set of component kinds. They are:
Virtual Platforms, representing distributed-memory parallel computing platforms, such as clusters and MPPs (Massive Parallel Processors), deployed at computation infrastructures such as IaaS providers, supercomputing/HPC centers, on-premise/local clusters, and so on.
Computations, representing implementations of parallel computing algorithms that attempt to exploit the architectural features of a class of virtual platforms in order to achieve peak performance. A computation instance is always associated with a virtual platform in a parallel computing system.
Data Repositories, offering access to large repositories of data that interest to applications and must be accessed by computations through service bindings to connectors.
Connectors, making it possible the use of computations and repositories belonging to different virtual platforms in the same parallel computing system. For that, they are comprised of a set of facets, each one placed at a virtual platform where a computation or data repository connected by the connector is placed. Connectors may have two roles: the orchestration of computations through action bindings and the supports to choreographies among computations and repositories through service bindings.
Service Bindings, binding user and provider ports between computations/repositories and connectors. The component that owns the user port (user component) may access the services provided by the component that owns the provider port (provider component) through the service binding. The service interfaces may be different. In such a case, the service binding may act as an adapter between such interfaces. Also, they are direct bindings, i.e., they are placed inside a single virtual platform, where the binded components are placed.
Action Bindings, making it possible the orchestration of connectors and computations by other connectors. For that, they bind action ports offered by the components involved in the orchestration, which must have a common interface, comprised of a set of action names. When a component invokes an action name of one of its action ports, it is suspended until all the involved components have invoked the same action name in an action port binded through the action binding.
Qualifiers and Quantifiers, specifying non-functional concerns that constitute assumptions in the implementation of components. In fact, they are the main ones responsible for the expressiveness of contextual contracts that guide the implementation and selection of component implementations.
The components of the above component kinds are the solution components of a parallel computing system. Also, they have two intrinsic components, i.e. which are present in any parallel computing system. They are the workflow component and the application component.
The Workflow Component is an intrinsic connector that makes it possible for the application to orchestrate the solution components of the parallel computing system. Besides being explicitly binded to application-specific action bindings of solution components, it is implicitly binded to their intrinsic lifecycle action ports, so that the application may activate the lifecycle action names of solution components, aimed at selecting appropriate component implementations, instantiating/preparing virtual platforms, compiling/installing component source codes, as well as instantiating/releasing components, running computations and connectors, etc. The set of lifecycle action names, and their meaning, varies according to the component kind.
The Application Component is an intrinsic connector that makes it possible the communication of the application with solution components through service bindings.
Stakeholders
The following stakeholders interact around HPC shelf.
Expert Users are the final users of HPC Shelf, being concerned with solving problems in some domain of interest. It is assumed that the solutions to these problems have huge computational requirements, motivating the use of HPC Shelf. However, expert users are not aware of the computational resources used to solve the problems through HPC Shelf. Intermediation between expert users and HPC Shelf services is done by an application that provides a high-level interface through which expert users specify problems, for which the application builds parallel computing systems to solve them.
It is noteworthy that HPC Shelf is not concerned with the concrete nature of applications. They can be presented to expert users as any kind of software artifact that can access the SAFe API (e.g. web portals, mobile apps, APIs, command-line interfaces, Jupyter kernels, etc).
Application Providers build applications. For this purpose, they are experts in the domain of problems dealt with by the application and can build contextual contracts for selecting appropriate components with the aim of building parallel computing systems to solve them.
Component Developers build computations and connectors, as well as service/action bindings to bind them. In fact, they are experts in parallel programming and parallel architectures and can explore the architectural features of parallel computing platforms, including accelerators, so that computation components may extract the maximum performance possible from the target virtual platform.
Data Managers offer data repositories, as well as connectors and bindings to enable computations to properly access them.
Platform Maintainers offer virtual platforms to applications, which may be instantiated in a computational infrastructure they own.
Architecture
HPC Shelf comprises three architectural elements: Frontend, Core, and Backend.
The Frontend is SAFe (Shelf Application Framework). It is presented as an Application Programming Interface (API) that provides operations for applications to build and deploy parallel computing systems.
The Core is a service that provides operations for applications to access the component catalog as well as control the lifecycle of the components of their parallel computing systems. It implements Alite, the contextual contract system that makes it possible to select computation component implementations based on application requirements and architectural features of their target virtual platforms (see the next section for more information).
The Backend is a service provided by each platform maintainer to instantiate virtual platforms over the computational infrastructure they own. The meaning of instantiating a virtual platform may vary depending on the underlying infrastructure. For example, the infrastructure can be a local (on-premise) cluster, on top of which there is a single virtual platform, representing the cluster itself. Alternatively, it can be the infrastructure of a commercial IaaS provider, over which virtually any number of virtual platforms can be instantiated, only depending on the budget limits of the application.
Contextual Contracts
Contextual contracts may be viewed as high-level specifications of functional and non-functional assumptions that guide the implementation of components, addressing application requirements, i.e., assumptions of the application about what it expects to obtain from the component, and target parallel computing platform characteristics, i.e., assumptions of the component about the characteristics of the target parallel computing platform.
The idea of contextual contracts comes from the premise that in the design of parallel computing systems under HPC requirements, efficient parallel software cannot be developed without taking into account the hardware on which it will be executed. This is a particular feature of HPC software. Outside the context of HPC, on the contrary, system designers commonly seek maximum abstraction from hardware architectural details in an attempt to increase software development productivity and enable a higher level of portability across different computer architectures and operating systems. This has a direct impact on modern methods and techniques of software engineering/architecture developed over the past decades, and we believe this is one of the main reasons why these techniques are difficult to accommodate to reconcile expressiveness, efficiency, and high-level of abstraction in HPC software that must run over parallel computing platforms, particularly on today's heterogeneous, hierarchical, and large-scale systems.
In the construction of a parallel computing system, applications request components by specifying contextual contracts and submitting them through the Core's services. In turn, the components of the Core's catalog meet the requirements of a contextual contract in its implementation. The Core is responsible for matching these contracts, enabling the application to implicitly select the implementation of a component that better meets the requirements of the contextual contract submitted. Such procedures of matching contracts and classifying the selected components are part of the so-called resolution algorithm. The module of the Core that is responsible to implement the resolution algorithm is called Alite.
In order to meet the requirement of selecting computation components that take advantage of the architectural characteristics of their target virtual platforms whenever possible, the contextual contract system defines the notion of system components. A system component is a set including a virtual platform and the set of computation components placed on it. The resolution of a system component means the simultaneous resolution of the contracts for the components it represents.
For computation components that must be resolved in the context of a system component, a contextual contract makes it possible to specify the following kinds of assumptions:
application assumptions, which defines the functional characteristics of the component;
platform assumptions, which defines the features of the target virtual platform that it is expected to be exploited by the computation in order to run as efficiently as possible;
QoS assumptions, which defines quality-of-service parameters that must be achieved by the selected component (e.g. execution time, efficiency, speedup, energy efficiency, etc) ;
Cost assumptions, which constraints the budget available to the execution of the component over commercially available computational infrastructures.
In a scenario where there are many alternatives to commercial IaaS cloud providers, the notion of contextual contracts proposed by HPC Shelf is potentially valuable in optimizing the balance between performance and cost of running parallel computing systems across multiple infrastructures. We recommend taking a look at the article “Contextual Contracts for Component‐Oriented Resource Abstraction in a Cloud of High Performance Computing Services” for details on specifying contextual contracts, as well as to know the results of a partial validation and performance evaluation study.
Implementation
The current prototype of HPC Shelf has been implemented in C# on top of the Mono platform. The Core and the Backend services are presented as Web Services. The Core web services is consumed by the application through SAFe, whereas the Backend services, as well as the web services deployed by virtual platforms, is consumed by the Core.
At present, we have implemented Backend services for the following maintainers, aimed at supporting experimental evaluations of HPC Shelf:
Amazon AWS EC2, for creating homogeneous clusters of EC2 instances of a given type, currently supporting all the instance types offered by AWS EC2;
Google Cloud Platform (GCP), for creating homogenous cluster os GCP virtual machine instances, also currently supporting all the GCP machine types;
OpenStack Cluster, that creates a virtual platform whose nodes comprise the processors of a single virtual machine;
Local Computer, for creating virtual platforms as MPI programs running in the local machine where the application is running, mainly intended for prototyping purposes.
Swirls
Swirls is a general-purpose HPC Shelf application for interactive building and offloading deployment/execution of parallel computing systems that employ any component available in the catalog of components of a Core service, using either Jupyter notebooks or via a terminal, using a direct command-line interface (CLI).
One of the main features offered by Swirls is the execution of MPI programs written in C#, C, C++, and Python on virtual platforms, encapsulated in computation components (wrappers) of HPC Shelf. In fact, using HPC Shelf's multicluster and multicloud capabilities, such MPI programs, deployed on distinct virtual platforms, possibly on different computing infrastructures, can communicate through connectors designed for this purpose.
Currently, Swirls users can instantiate virtual platforms on Amazon's AWS EC2 and Google Cloud Platform IaaS infrastructures (offloading), as well as the local computer of the user (useful for prototyping).
In the "Swirls Installation & Usage" documents, there are instructions on how to install and using Swirls. This is a collaborative document where users are invited to include comments to suggest fixes and improvements.
If you are interested in using Swirls, as well as becoming a contributor, please contact us at hpcshelf@dc.ufc.br.
Note: Swirls is an open-source prototype under continuous development. The Swirls developers assume no liability in respect of any consequence resulting from the use of Swirls. In fact, Swirls developers are looking for contributors who, in addition to making use of Swirls for their specific purpose, can help them improve it by pointing out fixes, testing existing features, and suggesting or implementing new features.
Publications
Journal Articles
HPC Shelf
de Carvalho Junior, F. H.; Al-Alam, W. G.; Dantas, A. B. O. . Contextual Contracts for Component-Oriented Resource Abstraction in a Cloud of HPC Services. Concurrency and Computation: Practice & Experience, v. 33, n. 18, e6225, 2021.
Dantas, A. B. O. ; de Carvalho Junior, F. H.; Barbosa, L. S. . A Component-Based Framework for Certification of Components in a Cloud of HPC Services. Science of Computer Programming, v. 191, p. 102379, 2020.
de Carvalho Silva, J. ; Dantas, A. B. O. ; de Carvalho Junior, F. H. . A Scientific Workflow Management System for Orchestration of Parallel Components in a Cloud of Large-Scale Parallel Processing Services. Science of Computer Programming, v. 173, p. 95-127, 2019.
Conference Papers
HPC Shelf
de Carvalho Junior, F. H.; Dantas, A. B. de O.; Sales, C. H. S. . Swirls: A Platform for Enabling Multicluster and Multicloud Execution of Parallel Programs. In: Proceedings of the XII Symposium on High Performance Computing Systems (WSCAD'2021), 2021, Belo Horizonte.
de Alencar, J. M. U. ; de Carvalho Junior, F. H. . On the Elasticity of Parallel Components in a Cloud of High Performance Computing Services. In: Proceedings of the XX Symposium on High Performance Computing Systems (WSCAD'2019), 2019, Campo Grande.
Al-Alam, Wagner ; de Carvalho Junior, F. H. . Contextual Contracts for Component-Based ResourceAbstraction in a Cloud of HPC Services. In: Proceedings of the XX Symposium on High Performance Computing Systems (WSCAD'2019), 2019, Campo Grande.
Rezende, C. A. ; de Carvalho Junior, F. H. . MapReduce with Components for Processing Big Graphs. In: Proceedings of the XIX Symposium on High Performance Computing Systems (WSCAD'2018), 2018, São Paulo.
de Oliveira Dantas, A. B. ; de Carvalho Junior, F. H. ; Barbosa, L. S. . A Framework for Certification of Large-scale Component-based Parallel Computing Systems in a Cloud Computing Platform for HPC Services. In: Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER'2017), p. 229-240., 2017, Porto.
de Oliveira Dantas, A. B. ; de Carvalho Junior, F. H. ; Barbosa, L. S. . Certification of Workflows in a Component-Based Cloud of High Performance Computing Services. In: Lecture Notes in Computer Science, v. 10487 (Proceedings of the 14th International Conference on Formal Aspects of Component Software - FACS'2017 - Braga, Portugal). p. 198-215, Berlim: Springer, 2017.
Carvalho Silva, J. ; de Carvalho Junior, F. H. . A Platform of Scientific Workflows for Orchestration of Parallel Components in a Cloud of High Performance Computing Applications. In: Lecture Notes in Computer Science. Genebra: Springer, v. 9889 (Proceedings of the 20th Brazilian Symposium on Programming Languages - SBLP'2016 - Maringá, Brazil). p. 156-170, 2016.
de Carvalho Junior, F. H.; Rezende, C. A. ; Silva, J. C. ; Al-Alam, W. G. . Contextual Abstraction in a Type System for Component-Based High Performance Computing Platforms. In:. Lecture Notes in Computer Science, Berlin:Heidelberg, v. 8129 (XVII Simpósio Brasileiro de Linguagens de Programação - SBLP'2013 - Brasília, Brazil), p. 90-104, 2013