Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Latest commit

 

History

History
135 lines (103 loc) · 7.93 KB

Storage.md

File metadata and controls

135 lines (103 loc) · 7.93 KB

Selection of the appropriate Storage Technology

Various critera to consider when chossing a storage technology

Buisness Aspects

Different storage systems for different purposes, the choice is determined by the stage of the data lyfecycle:

  • Ingestion : acquiring data & bringing it into the Cloud
  • Storage : persisting data for later use
  • Processing & analysis : transformation in a usable format & description
  • Exploration & dataviz : making insights from tables / charts for others/ decision
  1. Ingestion

    1. Application Data
      • Generated by Mobile / Web (backend services) / Software apps
      • Examples : Transactions from an online retail app. Clickstream from users reading articles on a web site. Logs from a software. User registration from an online service.
      • The volume of data depends on the number of users (can vary widely)
    2. Streaming Data
      • Sent in small messages transmitted continuously from the source
      • Event time including a time stamp : need to be ordered (insertion in the correct position requires buffering ) for processing --> time series analysis
      • Examples : VMs monitoring data, IoT devices / Sensors, Customers adding items to an online shopping cart...
    3. Batch Data
      • In bulk, typically in files uploaded.
      • Examples : transaction data, archives, migration from on premises
      • Cloud storage is typically used for batch uploads, eventually in conjunction with Cloud Transfer Service and Transfer Appliance for large volumes.
  2. Storage

    1. How data is accessed : for instance, online transaction (query for specific records using a set of filters) vs a ML pipeline (training with thousand rows ie in batch mode)
    2. Access constrols :
      • Fine grained : implemented at the app level or by creating views that limit the data available to some users
      • Coarse grained : based on bucket permissions & access contro lists on objects stored in a bucket
    3. How long data will be stored :
      • if transient : could be stored on SSD (lost when instance shut down)
      • if frequently used : well suited in either relational or NoSQL db
      • policies to migrate older data to nearline (used once a month) or coldline (used once a year), with age can be deleted (if compliant with regulations) or archived
  3. Processing & analysis

    1. Transformation
      • Map into forms that make the data readily available to ad-hoc queries / analytics
      • Cleansing, correcting errors according to the data type or buisness logic (depends on the use case), normalizing / standardizing data --> Cloud Dataflow
    2. Analysis
      • Extract usefull infos with statistical techniques
      • Description, characteristics, correlations, predictions, cluster subsets... --> Cloud Dataflow, Cloud Dataproc, BigQuery & Cloud ML Engine
  4. Exploration & dataviz (Cloud Datalab / Data Studio

Technical Aspects

  1. Volume
  2. Velocity :
    • the rate at which it is produced, sent & processed
    • during ingestion, I/Os, it is important to match the velocity of incoming data with the rate at which the data store can write/read data
    • Web & mobile apps (human entered data): typically low velocity
    • IoT & time series (machine generated data): high velocity
  3. Variation of structure
  4. Data access patterns
    • How much data is retrieved / written in a read / insert operation
    • How often data is read / written

Variation in data structure

  1. Structured = a fixed set of attributes -> columns = attributes / rows = records or entities

    • OLTP : many operations one row at a time each (Cloud SQL & Cloud Spanner)
    • OLAP (data warehousing) : the analyst would query many rows and only some columns (BigQuery uses a column oriented approach)
  2. Semi-structured = doesn't follow a fixed tabular format and instead stores schema attributes along with the data (set of attributes can vary from one instance to another). Allow users to add attributes without changing the schema. 2 ways of storing: documents or as wide columns (distinction is how data is retrieved)

    • Fully indexed (Cloud Datastore)

      • Indexes are created on each of the attributes as well as a combination of attributes.
      • A large number of indexes can significantly increase the amount of storage used (the total index storage can be greater than the amount of storage used to store documents).
      • Additional indexes can negatively impact performance for insert, update, and delete operations because they need to be revised
    • Row-key access : wide-column DBs organize data so that rows with similar row keys are close together. Queries use a row key, which is analogous to a primary key in relational DBs, to retrieve data:

      • tables are designed to respond to particular queries (although relational DBs are designed according to forms of normalization that minimize the risk of data anomalies, wide-column DBs are designed for low-latency R/W at high volumes.
      • can lead to duplication of data
  3. Unstructured = doesn't have a defined schema or data model & may have an internal structure that is not relevant to the way it is stored.

Comparison

x Structured Semi-structured Unstructured
schema fixed attributes can vary no defined
flexibiity / search low/easy good compromise high/difficult
scalibity difficult feasible easy
examples tables, sheets, phone records, customer data, SQL, transaction history JSON, text with an apparent pattern, XML text (NLP), audio, video files, BLOBs

Decision tree

Example

Schemas for relational & NoSQL db

Structured data is usually stored in relational db, whereas semi-structured ones in NoSQL db. The schema influences how data is stored and accessed.

  1. Relational DB design
    • OLTP in most cases follows the following 3 out of 10 data normalization rules (to reduce the risk of anomalies):
      • each column have an atomic value, no repeating groups & a primary key
      • separate tables are created for values that apply to multiple rows and links them using foreign keys
      • any column from a table that does not depend on a key is eliminated
    • OLAP are used for datawarehouse or datamart apps. Data is organized around several dimensions. It's designed to facilitate :
      • rolling up & aggregatins data
      • drilling down from summary data to detailed one
      • pivoting & looking at data from different dimensions (slicing & dicing)
  2. NoSQL DB design - each type is suited for different use cases (ingestion, entity relationships & query requirements):
    • Key-value (Cloud Memorystore / Redis):

      • use associative arrays or dictionaries as the basic datatype
      • keys are used to look up values
      • values can be complex data structures for such as a JSON object
      • if items should be searchable: a document database is a better option
    • Document (Cloud Datastore, MongoDB, CouchDB & OrientDB):

      • allow complex data structures called documents,
      • documents are used as values & accessed in more ways than simple key lookup
      • documents should be designed to group data that is read together
    • Wide colum (Bigtable, HBase, Cassandra) :

      • use cases: high volumes, low-latency writes, more W than R
      • limited range of queries ie no ad hoc queries
      • lookup by a single key
      • model similar to the tabular structure of relational tables with differences: Wide-column DBs are often sparse (exception of IoT & time-series DBs)
    • Graph : (No managed graph DBs in GCP, Neo4j) :

      • based on modeling entities and relationships
      • use case : social networks
      • people = nodes in the graph/network & relationships = links/ edges

SQL language - MySQL cheatsheet