Docu-menta >
Tecnología >
Artículos de Tecnología >
Por favor, use este identificador para citar o enlazar este ítem:
http://documenta.ciemat.es/handle/123456789/818
|
Título : | Job migration in HPC clusters by means of checkpoint/ restart |
Autor : | Rodríguez-Pascual, Manuel Cao, Jiajun Moríñigo, José A Cooperman, Gene Mayo-García, Rafael |
Palabras clave : | Checkpoint–restart · DMTCP Dynamic job migration Exascale clusters |
Fecha de publicación : | 2019 |
Editorial : | Springer |
Citación : | M. Rodríguez-Pascual, J. Cao, J.A. Moríñigo, G. Cooperman, R. Mayo-García. Job migration in HPC clusters by means of checkpoint/restart. The Journal of Supercomputing 75, 6517-6541 (2019) |
Resumen : | Until now, jobs running on HPC clusters were tied to the node where their execution
started. We have removed that limitation by integrating a user-level checkpoint/
restart library into a resource manager, fully transparent to both the user and running
application. This opens the door to a whole new set of tools and scheduling possibilities
based on the fact that jobs can be migrated, checkpointed, and restarted
on a different place or in a different moment, while providing fault tolerance for
every job running on the cluster. This is of utmost importance in the future generation
of exascale HPC clusters, where an increasing degree and complexities of
efficient scheduling make it challenging to obtain the required degree of parallelism
demanded by the applications. |
URI : | http://documenta.ciemat.es/handle/123456789/818 |
Aparece en las colecciones: | Artículos de Tecnología
|
Los ítems de Docu-menta están protegidos por una Licencia Creative Commons, con derechos reservados.
|