It takes long to read all data on a single drive

The obvious solution is to read from multiple disks at once

e.g.. if we have 100 drives, each hold 1/100 of the data

There are several problems to work with multiple disks

when a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines.

Filesystems that manage the storage across a network of machines are called distributed file-systems

Since they are network-based, all the complications of network programming occur.

Hadoop comes with a distributed file-system called HDFS, which stands for Hadoop Distributed Filesystem

DATA TYPE

정형 데이터

반정형 데이터

비정형 데이터