Unstructured Data

3.2.3. Unstructured Data#

While many datasets encountered in practice are structured, data can also be unstructured. This includes data where records do not follow a fixed schema, for example, entries with missing or varying fields, or datasets containing free-form text.

In this section, we’ll do a quick overview of common methods for storing unstructured data.
While we won’t be working directly with unstructured data in the upcoming parts of this course, it’s still important to be aware of the different types and storage options available - especially as they are common in real-world data scenarios.

3.2.3.1. Document Storage#

This is one of the most commonly used approaches for storing unstructured or semi-structured data. In document storage, each record can have different fields, making it schema-flexible. It typically resembles a key-value structure, and JSON is a widely used format for this purpose.

Example:#

[
  {
    "name": "Record 1",
    "age": 21
  },
  {
    "name": "Record 2",
    "birth_year": 2003
  }
]

In the above example, both records share the name field, but the second fields (age and birth_year) differ. This flexibility allows you to store only the available data without enforcing a rigid schema, avoiding unnecessary NULL values and data loss.

Common Formats for Document Storage:#

JSON: JavaScript Object Notation; can be stored as standalone files or passed between services.
JSONL (JSON Lines): Each line is a valid JSON object. This format is more storage-efficient since it avoids extra whitespace and is easier to stream.
NoSQL Databases: Production-ready systems such as:
- MongoDB (document-based)
- Apache Cassandra (columnar store with flexible schema)
- ScyllaDB (Cassandra-compatible, high-performance)
- Redis (key-value store with support for various data structures)

These systems are designed to handle flexible schemas, large-scale reads/writes, and high availability in distributed environments.

3.2.3.2. Object / Blob Storage and File Systems#

Another common category of unstructured data storage is object storage, which deals with storing entire files as data “objects.” These objects can be of various types - audio files, images, videos, documents, and more.

You likely encounter this on your own computer (e.g., storing .mp3, .jpg, or .mp4 files). In production systems, object storage is widely used for storing things like log files, backups, and multimedia content.

Common Storage Options:#

Local Disk / External Hard Drives: Traditional file systems used on personal or enterprise machines.
Cloud Object Storage:
- Amazon S3 (Simple Storage Service)
- Google Cloud Storage
- Azure Blob Storage

3.2.3.3. Graph Databases#

In some use cases, data is best represented not as isolated records, but as a network of relationships. For example, in a social network, users (nodes) are connected by friendships or interactions (edges). In such cases, graph databases are ideal.

Graph databases store entities as nodes and relationships between them as edges. This structure enables efficient querying and traversal of complex, connected data.

Common Graph Databases:#

Neo4j – A popular open-source graph database with a powerful query language called Cypher
Amazon Neptune – A fully managed graph database service on AWS
ArangoDB – A multi-model database that supports graphs along with documents and key-value data