Master of Science
Jin Soung Yoo
Date of Award
Radio Frequency Identification (RFID) technology is a prevalent tool in tracking moving objects. In supply chain management systems, most major retailers use RFID systems to track the movement of products from suppliers to warehouses, store backrooms, and eventually points of sale. The amount of information generated by such systems can be enormous since each individual item (a pallet, a box, or a SKU) will leave a trail of data as it moves to different locations. Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Warehousing and analyzing massive RFID data sets is an important problem with great potential benefits for inventory managing, object tracking, and product procurement processing.
Many industries that have been collecting digital data are having difficulties scaling up their systems because of the large size of the data. Since the data sets are so large and complex, it becomes difficult and expensive to process using traditional database management tools and data processing applications. Cloud computing services and big data platforms, such as Hadoop, can scale to handle much larger data sets.
In this thesis, I propose two RFID data warehouse designs, normalized schema and denormalized schema, that can handle massive amounts of RFID data and support a variety of OLAP queries as well as location and path related queries. This thesis implements the proposed schemas using a relational database system (PostgreSQL) and a big data platform (Hadoop/Hive), and then conducts performance tests with the cloud computing service. I closely studied how the schema designs, database systems, data storage formats, and the number of Hadoop nodes affected the performance for each type of queries I implemented.
A lot of businesses are interested in switching from relational databases to big data platforms, thinking this will enhance the query performance. This thesis shows that a big data platform does not always have a better performance than a relational database when there are less than a few billion records. Also, when the size of the data is not big enough, increasing the number of Hadoop nodes is not always effective because the percentage of wait-time takes longer than the percentage of query-time. Once the characteristics of data and the database query optimizer are understood, there are extensive opportunities to increase the query performance in both systems.
Yei-Sol Woo (2015).
RFID Big Data Warehousing and Analytics in Cloud Computing Environment.