It takes long to read all data on a single drive
The obvious solution is to read from multiple disks at once
e.g.. if we have 100 drives, each hold 1/100 of the data
There are several problems to work with multiple disks
when a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines.
Filesystems that manage the storage across a network of machines are called distributed file-systems
Since they are network-based, all the complications of network programming occur.
Hadoop comes with a distributed file-system called HDFS, which stands for Hadoop Distributed Filesystem
정형 데이터
반정형 데이터
비정형 데이터