Databases 101: A Simple Guide for Non-Technical Users
With 300+ databases out there, what do they all do?
Just like choosing the right flour for baking, selecting the optimal database for your project requires careful consideration. Different databases excel at different tasks, and using the wrong one can significantly impact your project’s outcome.
Imagine the frustration of using all-purpose flour for pizza dough, only to end up with a disappointing, limp crust. Similarly, relying on a generic database for complex log analysis might lead to performance issues and hinder your project’s progress. That’s why your engineering lead might be requesting extra time to create a dedicated data store for log analysis — they’re essentially choosing the perfect flour for the task. By taking the time to select the right tool for the job, they are ensuring that your project has the best possible foundation for success.
Finding the Right Data Flour: Why Your Choice of Database Matters
Just like choosing the wrong flour can ruin your pizza, choosing the wrong database can lead to inefficient and inaccurate data management.
Understanding Data and Its Usage
To choose the right database, ask yourself these two key questions:
1. What does the data look like?
While technically any data can be stored in any database, many are designed for specific data types or structures. For example, a graph database excels at handling interconnected data like social networks, while Snowflake shines with massive analytical datasets.
Remember, the data type isn’t just about storage; it significantly impacts the insertion and retrieval of data.
2. What do you need to do with the data?
Different types of data and use cases demand different database functionalities. Consider the distinction between transactional and analytical databases:
- OLTP (Online Transactional Processing): These are the core databases for applications, handling user data and frequent read/write operations. They prioritize data integrity and quick, small queries.
- OLAP (Online Analytical Processing): These databases are used for data science and analytics, working with large and complex queries on sporadically accessed data.
The way users interact with the data necessitates different storage structures. OLTP and OLAP are just two examples, but numerous use cases and specialized databases exist.
3 Main Types of Databases and Their Roles
Here’s a simplified breakdown of 3 major categories of databases:
- Databases for User-Facing Apps: These are the workhorses behind your application, storing all the data it needs to function properly. This includes user information, product details, transactions, and more. Think of them as the backbone of your app.
Examples: PostgreSQL, MySQL, MongoDB
2. Databases for Analytics: Want to analyze user behavior, predict trends, or train machine learning models? These databases are your allies. They’re designed for fast and efficient querying of massive datasets, empowering data teams to unlock valuable insights.
Examples: Snowflake, Redshift, BigQuery
3. Databases for Operations: Keeping your app running smoothly requires continuous monitoring, logging, and security measures. These databases handle this critical task, ensuring efficient operation and troubleshooting.
Examples: Elasticsearch, Prometheus, InfluxDB
Remember, some databases may overlap categories. For instance, Redis, a NoSQL database, can be used for both app development and in-memory operations.
Understanding these categories and examples will help you select the right database for your specific needs.
Databases that power a user-facing app
The databases that drive a user-facing app, commonly known as production databases or production data stores, are the places where developers store and retrieve the data necessary for their app to function. All the information you see on your screen is stored in these databases.
- If you’re Twitter: tweets, user profiles, trending topics, DMs
- If you’re Gmail: emails, settings, spam filters
- If you’re Amazon: orders, users, credit cards
Production databases are built to support small and quick queries. The most important thing is data integrity and reliability — you don’t want to lose an order or a credit card.
User-facing DBs / Relational Databases
Relational databases are the OGs, having been in existence since the early days of database technology. They remain the go-to option for storing essential data in modern applications. In these databases, data is organized in a highly structured format with well-defined tables, columns, and relationships between tables. This structured setup ensures that queries can be executed swiftly and dependably.
- Postgres is an open source relational database initially released back in 1997. It’s one of the standard “boilerplate” choices for teams getting started on a new app. You can run Postgres yourself on a server, or pay someone like Amazon to run it for you.
- MySQL is a free-to-use relational database that first came out in 1995. Many teams choose it as their go-to database. However, it’s worth noting that since Oracle acquired Sun (and thus MySQL), there are legal considerations if you want to change the MySQL code for commercial purposes. When comparing MySQL to Postgres, they are quite similar, but they do have some key differences in how they handle data types, storage engines, and a few other aspects.
- MSSQL is Microsoft’s own SQL database. It’s widely used by big companies but not as much by smaller startups.
- Oracle was the first-ever commercial relational database. Although it’s known for being a slow-moving company today, it still runs the apps for many Fortune 50 companies. Like MSSQL, Oracle is meant for large organizations, not startups.
- MariaDB is a free relational database created by the original makers of MySQL. They seem a bit unhappy with Oracle’s restrictions on MySQL contributions, so they highlight the open community and the fact that their software will never need a license.
- SQLite is a library that allows you to have a small database within your application. It’s used in many apps, typically as a middleman database rather than the main storage. According to the website, it’s on every Android, iPhone, and more.
User-facing DBs / NoSQL Databases
- NoSQL databases simplify things by getting rid of the strict structure of relational databases. You can just put your data in without worrying about the format right away. While there are various types of NoSQL databases (for analytics, operations), the ones mentioned here are specifically for user-facing and production use.
- MongoDB made NoSQL popular and is the go-to if you need a NoSQL database for your app. Their cloud service, Atlas, is like the AWS RDS but for basic NoSQL needs.
- Cassandra is a NoSQL database designed for big companies with massive data needs. It’s focused on storing and retrieving data quickly. Unlike MongoDB, which is a document database, Cassandra is columnar. This means it stores data in entire columns, similar to Snowflake. Using Cassandra is more similar to using a relational database.
- DynamoDB is AWS’s proprietary NoSQL database.
- Firebase is a set of tools (bought by Google a while ago) for making apps, especially for smaller teams who want simplicity. Their database is called Firestore, and it’s a popular choice when you need a quick and easy database to get started.
User-facing DBs / Graph Databases
Each company’s data is unique, and for some, it’s best represented as a graph of connected nodes.
- Facebook, for example, uses an in-house graph database called Tao to manage their social connections. Graph databases are mainly used for user-facing applications, but they can also be handy for analytical purposes at times.
- Neo4J is the go-to graph database, and it’s open source. If you plan to scale up, you might need to check out their enterprise license. In Neo4J, data is shown as nodes, and these nodes can be linked in various ways. They use a language called Cypher, kind of like SQL, for querying the graph.
Databases that power analytics
What a data team wants from their database is quite different from what app developers need. In data science and analytics, stored data is often repetitive, there’s a large volume of it, and queries involve combining information from various tables. Typically, someone at a computer doing research or a system creating a pipeline queries the data, and it gets added at regular intervals (like twice a day).
Analytical DBs / Data Warehouses
Data warehouses are like special databases designed for analyzing data, such as user activities or monthly revenue. They’re good at handling big, complex queries involving multiple tables. While they’re usually relational, the way they’re set up can vary.
- Snowflake is a cloud data warehouse for analytics. It organizes data column-wise, which makes big queries faster, making it a popular choice for analytical databases.
- BigQuery is another cloud data warehouse for analytics and is quite similar to Snowflake.
- Redshift is also a cloud data warehouse for analytics, and it’s very much like Snowflake and BigQuery.
- ClickHouse, on the other hand, is an open-source data warehouse for analytics that gained popularity after being developed at Yandex., on the other hand, is an open-source data warehouse for analytics that gained popularity after being developed at Yandex.
Why can’t I use Postgres as a data warehouse?
You could use traditional methods, and some people still do. However, modern cloud data warehouses are much faster for handling big queries on substantial amounts of data. When it comes to databases, you can technically use any tool for any task, but there’s a point where you need a specialized tool to get the job done efficiently.
Analytical DBs / Data Lakes
- Data Lakes are like huge safes for data. You can put in all sorts of data — organized, messy, big, small — and worry about organizing it when you want to use it. This is different from a data warehouse, which, like a relational database, has strict rules about tables, columns, and data types.
- S3 is Amazon’s storage solution, usually used for images and videos, but it’s also commonly used as a data lake. You can’t directly “query” S3 like a database, so you need another layer on top (like AWS’s Lake Formation) to help with that.
- Databricks has something called the “lakehouse platform,” which, underneath the marketing, is an open source data lake that acts like a data warehouse, blurring the line between the two.
While you probably won’t see it much these days, HDFS used to be a popular way to build a data lake before the era of cloud data warehouses. It’s powered by Hadoop, an original framework for handling big data, but it’s known for being really, really hard to set up and run.
Databases that power operations
You won’t come across this type much unless you’re an engineer. The third category of databases is for data stores that help developers run internal operations. These include monitoring how well an application is doing, storing logs and security info, making apps faster, and acting as go-betweens for other databases.
Operational DBs / Key Value Stores
While most of the databases we’ve talked about store data permanently on a hard drive, there’s a special class that only keeps data in memory. These are for temporary data that you need to store and use quickly, even if it’s okay if it disappears later. These are called key-value stores (or KV stores) because they store data like a dictionary, where each entry has a key.
- Redis is the original key-value store. It’s an in-memory database used for various things like building caches, managing authentication sessions, and real-time data retrieval for chat and messaging. Redis is open source, but you can also pay them to host it for you.
- SingleStore (formerly known as MemSQL) is another in-memory database like Redis, but it’s more focused on general-purpose workloads rather than the typical use cases of in-memory databases.
Operational DBs / Time Series Databases
- Time series databases are designed for storing data based on time, like daily financial records, second-by-second sensor readings, or hourly health checks on servers.
- Timescale is like an upgrade to Postgres, but it’s sold as a separate cloud-hosted database. It takes the familiar relational database and adds features specifically for handling time-series data, such as automatic partitioning and optimized queries.
- Prometheus is an open-source time series database that comes with built-in alerting and visualization features.
- InfluxDB is another open-source time series database.
Operational DBs / Logs and Search Databases
- Elastic is a NoSQL database made for storing and searching logs, which are detailed records of things like server performance, API requests, and internal information. Elasticsearch is often used with Kibana, its companion tool for visualizing data.
- Solr is another NoSQL database designed for storing and searching logs.
Conclusion
From user-facing databases powering applications to analytical databases unlocking insights, and operational databases ensuring smooth operations, each type serves a distinct purpose. Understanding these distinctions helps in making informed choices, ensuring that, just like selecting the perfect flour for baking, the foundation for success is laid in the world of databases.