Achieve Efficient Distributed Scheduling with Cloud Message Queuing for Multitasking and High-Performance Computing
Due to the growth of data and the number of computational tasks, it is necessary to ensure the required level of system performance. Performance can be achieved by scaling the system horizontally / vertically, but even increasing the amount of computing resources does not solve all the problems. For example, a complex computational problem should be decomposed into smaller subtasks, the computation time of which is much shorter. However, the number of such tasks may be constantly increasing, due to which the processing on the services is delayed or even certain messages will not be processed. In many cases, message processing should be coordinated, for example, message A should be processed only after messages B and C. Given the problems of processing a large number of subtasks, we aim in this work - to design a mechanism for effective distributed scheduling through message queues. As services we will choose cloud services Amazon Webservices such as Amazon EC2, SQS and DynamoDB. Our FlexQueue solution can compete with state-of-the-art systems such as Sparrow and MATRIX. Distributed systems are quite complex and require complex algorithms and control units, so the solution of this problem requires detailed research.
P. Kogge, et. al., “Exascale computing study: Technology challenges in achieving exascale systems,” 2008.
M. A. Jette et. al, “Slurm: Simple linux utility for resource management”. In Lecture Notes in Computer Sicence: Proceedings of Job Scheduling Strategies for Prarallel Procesing (JSSPP) 2003 (2002), Springer-Verlag, pp. 44-60.
D. Thain, T. Tannenbaum, M. Livny, “Distributed Computing in Practice: The Condor Experience” Concurrency and Computation: Practice and Experience 17 (2-4), pp. 323-356, 2005.
J. Frey, T. Tannenbaum, I. Foster, M. Frey, S. Tuecke. “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Cluster Computing, 2002.
B. Bode et. al. “The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters,” Usenix, 4th Annual Linux Showcase & Conference, 2000.
W. Gentzsch, et. al. “Sun Grid Engine: Towards Creating a Compute Power Grid,” 1st International Symposium on Cluster Computing and the Grid (CCGRID’01), 2001.
C. Dumitrescu, I. Raicu, I. Foster. “Experiences in Running Workloads over Grid3”, The 4th International Conference on Grid and Cooperative Computing (GCC 2005), 2005.
I. Raicu, et. al. “Toward Loosely Coupled Programming on Petascale Systems,” IEEE/ACM Super Computing Conference (SC’08), 2008.
I. Raicu, et. al. “Falkon: A Fast and Light-weight tasK executiON Framework,” IEEE/ACM SC 2007.
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. “Dremel: Interactive Analysis of Web-Scale Datasets. Proc.” VLDB Endow., 2010.
L. Ramakrishnan, et. al. “Evaluating Interconnect and virtualization performance for high performance computing”, ACM Performance Evaluation Review, 40(2), 2012.
P. Mehrotra, et. al. “Performance evaluation of Amazon EC2 for NASA HPC applications”. In Proceedings of the 3rd workshop on Scientific Cloud Computing (ScienceCloud '12). ACM, NY, USA, pp. 41-50, 2012.
Q. He, S. Zhou, B. Kobler, D. Duy, and T. McGlynn. “Case study for running HPC applications in public clouds,” In Proc. of ACM Symposium on High Performance Distributed Computing, 2010.
G. Wang and T. S. Eugene. “The Impact of Virtualization on Network Performance of Amazon EC2 Data Center”. In IEEE INFOCOM, 2010.
I. Raicu, Y. Zhao, I. Foster, “Many-Task Computing for Grids and Supercomputers,” 1st IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2008.
I. Raicu. "Many-Task Computing: Bridging the Gap between High Throughput Computing and High Performance Computing", Computer Science Dept., University of Chicago, Doctorate Dissertation, March 2009
Amazon Elastic Compute Cloud (Amazon EC2), Amazon Web Services, [online] 2013, http://aws.amazon.com/ec2/
Amazon SQS, [online] 2013, Retrieved from http://aws.amazon.com/sqs/
LSF: http://platform.com/Products/TheLSFSuite/Batch, 2012.
L. V. Kal´e et. al. “Comparing the performance of two dynamic load distribution methods,” In Proceedings of the 1988 International Conference on Parallel Processing, pages 8–11, August 1988.
W. W. Shu and L. V. Kal´e, “A dynamic load balancing strategy for the Chare Kernel system,” In Proceedings of Supercomputing ’89, pages 389–398, November 1989.
A. Sinha and L.V. Kal´e, “A load balancing strategy for prioritized execution of tasks,” In International Parallel Processing Symposium, pages 230–237, April 1993.
M.H. Willebeek-LeMair, A.P. Reeves, “Strategies for dynamic load balancing on highly parallel computers,” In IEEE Transactions on Parallel and Distributed Systems, volume 4, September 1993.
G. Zhang, et. al, “Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers,” In Proceedings of the 2010 39th International Conference on Parallel Processing Workshops, ICPPW 10, pages 436-444, Washington, DC, USA, 2010.
K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. “Sparrow: distributed, low latency scheduling”. In Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 69-84.
M. Schwarzkopf, A Konwinski, M. Abd-el-malek, and J. Wilkes, Omega: Flexible, scalable schedulers for large compute clusters. In Proc. EuroSys (2013).
Frigo, et. al, “The implementation of the Cilk-5 multithreaded language,” In Proc. Conf. on Prog. Language Design and Implementation (PLDI), pages 212–223. ACM SIGPLAN, 1998.
R. D. Blumofe, et. al. “Scheduling multithreaded computations by work stealing,” In Proc. 35th FOCS, pages 356–368, Nov. 1994.
V. Kumar, et. al. “Scalable load balancing techniques for parallel computers,” J. Parallel Distrib. Comput., 22(1):60–79, 1994.
J. Dinan et. al. “Scalable work stealing,” In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009.
A. Rajendran, Ioan Raicu. "MATRIX: Many-Task Computing Execution Fabric for Extreme Scales", Department of Computer Science, Illinois Institute of Technology, MS Thesis, 2013.
T. Li, et al., “ZHT: A light-weight reliable persistent dynamic scalable zero-hop distributed hash table,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS ’13), 2013.
Amazon DynamoDB (beta), Amazon Web Services, [online] 2013, http://aws.amazon.com/dynamodb
P. Mell and T. Grance. “NIST definition of cloud computing.” National Institute of Standards and Technology. October 7, 2009.
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” in Proceedings of the 2nd USENIX Conference on Hot topics in Cloud Computing, Boston, MA, June 2010.
P. Mehrotra, et al. 2012. “Performance evaluation of Amazon EC2 for NASA HPC applications” In (ScienceCloud '12). ACM, New York, NY, pp. 41-50.
I. Sadooghi, et al. “Understanding the cost of cloud computing”. Illinois Institute of Technology, Technical report. 2013.
I. Raicu, et al. “The Quest for Scalable Support of Data Intensive Workloads in Distributed Systems,” ACM HPDC 2009.
I. Raicu, et al. "Middleware Support for Many-Task Computing", Cluster Computing, The Journal of Networks, Software Tools and Applications, 2010.
Y. Zhao, et al. "Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments", book chapter in Grid Computing Research Progress, Nova Publisher 2008.
I. Raicu, et al. "Towards Data Intensive Many-Task Computing", book chapter in "Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management", IGI Global Publishers, 2009.
Y. Zhao, et al. "Opportunities and Challenges in Running Scientific Workflows on the Cloud", IEEE CyberC 2011.
M. Wilde, et al. "Extreme-scale scripting: Opportunities for large task-parallel applications on petascale computers", SciDAC 2009.
Copyright (c) 2020 The author
This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles are published in open-access and licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Hence, authors retain copyright to the content of the articles.
CC BY 4.0 License allows content to be copied, adapted, displayed, distributed, re-published or otherwise re-used for any purpose including for adaptation and commercial use provided the content is attributed.