Job migration in HPC clusters by means of checkpoint/ restart

dc.contributor.authorRodríguez-Pascual, Manuel
dc.contributor.authorCao, Jiajun
dc.contributor.authorMoríñigo, José A
dc.contributor.authorCooperman, Gene
dc.contributor.authorMayo-García, Rafael
dc.date.accessioned2020-12-02T10:43:35Z
dc.date.available2020-12-02T10:43:35Z
dc.date.issued2019
dc.description.abstractUntil now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/ restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.es_ES
dc.identifier.citationM. Rodríguez-Pascual, J. Cao, J.A. Moríñigo, G. Cooperman, R. Mayo-García. Job migration in HPC clusters by means of checkpoint/restart. The Journal of Supercomputing 75, 6517-6541 (2019)es_ES
dc.identifier.doihttp://dx.doi.org/10.1007/s11227-019-02857-y
dc.identifier.urihttps://hdl.handle.net/20.500.14855/818
dc.language.isoenges_ES
dc.publisherSpringeres_ES
dc.rights.accessRightsopen accesses_ES
dc.subjectCheckpoint–restart ·es_ES
dc.subjectDMTCPes_ES
dc.subjectDynamic job migrationes_ES
dc.subjectExascale clusterses_ES
dc.titleJob migration in HPC clusters by means of checkpoint/ restartes_ES
dc.typejournal articlees_ES

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Artículo_Revisado_Abril2019.pdf
Size:
257.62 KB
Format:
Adobe Portable Document Format