Job migration in HPC clusters by means of checkpoint/ restart

Rodríguez-Pascual, Manuel; Cao, Jiajun; Moríñigo, José A; Cooperman, Gene; Mayo-García, Rafael

doi:http://dx.doi.org/10.1007/s11227-019-02857-y

Job migration in HPC clusters by means of checkpoint/ restart

dc.contributor.author	Rodríguez-Pascual, Manuel
dc.contributor.author	Cao, Jiajun
dc.contributor.author	Moríñigo, José A
dc.contributor.author	Cooperman, Gene
dc.contributor.author	Mayo-García, Rafael
dc.date.accessioned	2020-12-02T10:43:35Z
dc.date.available	2020-12-02T10:43:35Z
dc.date.issued	2019
dc.description.abstract	Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/ restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.	es_ES
dc.identifier.citation	M. Rodríguez-Pascual, J. Cao, J.A. Moríñigo, G. Cooperman, R. Mayo-García. Job migration in HPC clusters by means of checkpoint/restart. The Journal of Supercomputing 75, 6517-6541 (2019)	es_ES
dc.identifier.doi	http://dx.doi.org/10.1007/s11227-019-02857-y
dc.identifier.uri	https://hdl.handle.net/20.500.14855/818
dc.language.iso	eng	es_ES
dc.publisher	Springer	es_ES
dc.rights.accessRights	open access	es_ES
dc.subject	Checkpoint–restart ·	es_ES
dc.subject	DMTCP	es_ES
dc.subject	Dynamic job migration	es_ES
dc.subject	Exascale clusters	es_ES
dc.title	Job migration in HPC clusters by means of checkpoint/ restart	es_ES
dc.type	journal article	es_ES

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Artículo_Revisado_Abril2019.pdf
Size:: 257.62 KB
Format:: Adobe Portable Document Format

Download

Collections

Artículos de Tecnología