Comparative study of data transformation tools: An investigation of functionalities supported in common tools and case study of declarative and procedural data manipulation languages
MetadataShow full item record
Today, organizations are collecting and storing huge amounts of data that could potentially be very valuable. Finding trends and patterns in historic data can allow businesses to make more informed decision. Data scientists are therefore working to extract meaning from the massive amount of data. However, 80% of the time in data science projects is spent preparing the data for analysis. Selecting an efficient tool for the job can contribute to reducing the time spent on data transformation. Thus, this thesis will provide some insights into existing tools and their performance. A selection of common tools is made in Chapter 3. The tools are reviewed with regards to a framework to identify the support of common data preparation tasks and an evaluation of the tools are given at the end of the chapter. In Chapter 4, one declarative and one procedural Data Manipulation Language (DML) are selected from the common data transformation tools. Python pandas, a procedural language, and SQL, a declarative language, are evaluated and compared in a case study. The case study delves deeper into the tools through a use case and the comparative analysis at the end will provide some insights into the differences in the two DMLs. Thus, the first contribution of this thesis is a review of the support of common data preparation tasks provided by a selection of some prevalent data transformation tools. The second contribution is an analysis of the differences in a declarative vs procedural approach to data manipulation through a case study comparing two popular DMLs. The findings of the review of tools in Chapter 3, revealed that the most prevalent data transformation tools support the majority of the common data preparation tasks. This review gives some general insight into which tasks are supported, which tasks needs more effort to perform, and which are not supported at all. The review is exclusively based on information found in technical documentation of the tools, and no further experimentation is done to investigate the support. The case study in Chapter 4 revealed that the procedural DML, Python pandas, is better suited for data manipulation as it is less time-consuming and provides higher flexibility and usability. Python pandas is also considered to have high readability and expressiveness, although SQL seems to beat pandas in these areas.