Comparing Emulators and Virtual Machines for Simulating Big Data Pipelines on Cloud-Edge Infrastructures
Abstract
This thesis explores advancements in emulation techniques for big data pipelines within cloud-edge infrastructures, with a specific focus on evaluating and comparing the performance of virtual machines (VMs) and locally emulated nodes using QEMU. The objective of this research is to provide insights into key performance metrics, including CPU utilization, memory consumption, execution time, and compute cost, under different configurations. Achieving this required us to extend SIM-PIPE, a simulation tool originally designed for single-node environments, to support multi-node setups.
The extended version of SIM-PIPE was utilized to simulate and measure the performance of big data pipelines in both VM-based and QEMU-based environments. Experimental results demonstrate that VMs generally excel in terms of CPU efficiency, execution speed, and cost-effectiveness in most configurations. QEMU on the other hand performs better in memory utilization and offers significant flexibility advantages, such as the ability to emulate multiple operating systems within a single platform. There are also some specific scenarios involving larger and more complex configurations and smaller workloads reveal instances where QEMU's adaptability makes it a more competitive option against virtual machines, despite its overall lower performance in other metrics.
By highlighting the strengths and weakness of VMs and QEMU on diverse hardware configurations, this research provides a detailed analysis of their suitability for emulating big data workflows in cloud-edge infrastructures. Our findings may contribute to informed decision-making when selecting environments based on resource efficiency for the purpose of improving the efficiency of big data pipeline simulations.