System Design - Design a URL Shortener

Sunday, December 03, 2023 by Ujjwal Bhardwaj

Functional Requirements

As a user, I should be able to input a URL and retrieve a shortened URL.
As a user, when I enter the shortened URL in a browser, it should redirect me to the original URL location.
As a system, I should expire and clean up old URLs.
As a system, I should store the analytics usage information of the URLs, including clicks, locations, referrers, etc.
As a system, I should allow users to generate custom shortened URLs.

Non-Functional Requirements

Availability: High availability is crucial because users and automated systems expect short links to redirect to their long URL counterparts at all times. This often involves redundant infrastructure, failover mechanisms, and possibly multi-region deployment to withstand regional outages.
Scalability: Horizontal scalability is often more relevant, as you would need to handle large volumes of requests and data. This includes scaling the database to store the increasing number of URLs and scaling the web servers to handle the increasing request load.
Latency: Low latency is critical because users expect immediate redirection from the shortened URL to the original URL.
Robustness: Handling invalid inputs gracefully, maintaining service during network failures, server crashes, and managing unexpected surges in traffic.

Capacity Requirements

Assume there are 500 million URLs generated per month.
Assume that there is a 100:1 read to write ratio.
Assume that it takes around 500 bytes to store the content for a single URL.

This implies that it requires 250GB storage per month for URL storage. Additionally, to store each shorten URL key (6 bytes each), it requires a total of 3GB storage.

High Level Design

ShortenURLHLD

Components

Load Balancer - Used to balance load to the Web Server instances.
Web Server - It is a backend for front-end service that is the single point of contact of all the incoming request from the client. These requests after proper authentication/authorization are redirected to the different microservices.
URLShortenerService - Service that hosts APIs to shorten the URL as well as returning the original URL from a shorten URL.
KeyGenerationService - Service that is responsible for generating and returning a key being used by the URL shortener.
KeyDatabase - Database to store the generated keys
URL Database - Database to store the longURL to short URL Mapping
Analytics Service - Service to store and access analytics data

Design Dive Deep

Key Generation

Approach 1: Generate Hash value of each long URL

This approach requires generating md5 or sha-256 of long URL and split first 6 characters of the URL.

Consequences:

Pros:
- Easy to integrate a library to generate hash value
Cons:
- It can lead to collisions as multiple the fist 6 characters of multiple hash values can be same.

Note: To resolve these collisions, we can check in the datastore if the key already exists. If it does not, we can use the generated key and store it in the DB. Otherwise, we can recursively add some predefined string to the original URL as a prefix until a unique key is generated. However, this will lead to an increase in latency and could also put a load on the database.

Approach 2: Use base62 encoder with a sequence number

This approach involved using a base62 encoder with a sequence number. The unique sequence number will be the auto-increment id of the Database entry of the long-URL.

Consequences:

Pros:
- There will be no collisions as the auto-increments Id will be unique.
Cons:
- It will be easy to predict other URLs that can lead to some security concern.

Approach 3: Offline URL Generation

This approach requires generating a random string of length 6 and storing it in a database. Whenever there is a request to generate a new shortened URL, we can choose a key and mark the record as used once the key is returned to the upstream.

Consequences:

Pros:
- There will be no collisions as the id will be unique.
Cons:
- We need to handle concurrency issue when multiple upstream calls fetches same string.

Database Strategy:

Approach 1: Using Relational Database

Consequences:

Pros:
- There is already a defined schema for the record stored that may rarely get updated.
Cons:
- As we need to store billion of records, it will be difficult to store in a single relational DB.
- To scale the database, we would need to partition it which will add additional infrastructure overhead.

Approach 2: Using NoSQL Database

Consequences:

Pros:
- There is no relationship between records. Thus, there will be no requirements to perform complex queries and joins.
- We can store huge number of records with sharding. faster read heavy
Cons:
- We already have a defined schema.

Frequently Asked Questions

How will you extend the design to support analytics?

WIP