Comparative study of data transformation tools: An investigation of functionalities supported in common tools and case study of declarative and procedural data manipulation languages

Storvoll, Tine-Lovise

dc.contributor.advisor	Soylu, Ahmet
dc.contributor.advisor	Martin-Recuerda, Francisco
dc.contributor.author	Storvoll, Tine-Lovise
dc.date.accessioned	2022-09-13T08:04:26Z
dc.date.available	2022-09-13T08:04:26Z
dc.date.issued	2022
dc.identifier.uri	https://hdl.handle.net/11250/3017395
dc.description.abstract	Today, organizations are collecting and storing huge amounts of data that could potentially be very valuable. Finding trends and patterns in historic data can allow businesses to make more informed decision. Data scientists are therefore working to extract meaning from the massive amount of data. However, 80% of the time in data science projects is spent preparing the data for analysis. Selecting an efficient tool for the job can contribute to reducing the time spent on data transformation. Thus, this thesis will provide some insights into existing tools and their performance. A selection of common tools is made in Chapter 3. The tools are reviewed with regards to a framework to identify the support of common data preparation tasks and an evaluation of the tools are given at the end of the chapter. In Chapter 4, one declarative and one procedural Data Manipulation Language (DML) are selected from the common data transformation tools. Python pandas, a procedural language, and SQL, a declarative language, are evaluated and compared in a case study. The case study delves deeper into the tools through a use case and the comparative analysis at the end will provide some insights into the differences in the two DMLs. Thus, the first contribution of this thesis is a review of the support of common data preparation tasks provided by a selection of some prevalent data transformation tools. The second contribution is an analysis of the differences in a declarative vs procedural approach to data manipulation through a case study comparing two popular DMLs. The findings of the review of tools in Chapter 3, revealed that the most prevalent data transformation tools support the majority of the common data preparation tasks. This review gives some general insight into which tasks are supported, which tasks needs more effort to perform, and which are not supported at all. The review is exclusively based on information found in technical documentation of the tools, and no further experimentation is done to investigate the support. The case study in Chapter 4 revealed that the procedural DML, Python pandas, is better suited for data manipulation as it is less time-consuming and provides higher flexibility and usability. Python pandas is also considered to have high readability and expressiveness, although SQL seems to beat pandas in these areas.	en_US
dc.language.iso	eng	en_US
dc.publisher	OsloMet - storbyuniversitetet	en_US
dc.relation.ispartofseries	ACIT;2022
dc.subject	Data transformation	en_US
dc.subject	Manipulation	en_US
dc.subject	Tools	en_US
dc.subject	Preparation	en_US
dc.title	Comparative study of data transformation tools: An investigation of functionalities supported in common tools and case study of declarative and procedural data manipulation languages	en_US
dc.type	Master thesis	en_US
dc.description.version	publishedVersion	en_US

Tilhørende fil(er)

Filnavn:: storvoll-acit2022.pdf
Størrelse:: 2.245Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

TKD - Master i Anvendt data- og informasjonsteknologi (ACIT) [237]

Vis enkel innførsel