Database Indexing
Table of content
How database store Index and Table
Structure of Index (B-Trees)
Logarithmic scalability of B-Trees
Create Index Implementation
How database store Index and Table
[pic 1]
Row_id helps database to uniquely identified each rows. Some databases (like Postgresql) create their own system maintained id as row_id while some database like MySQL use the primary key as their row_id.
[pic 2]
Heap stores pages which are fixed sized memory location (disk storage). Each pages contains multiple table rows. Input/Output(IO) are read request to database, we want to reduce this as much as possible.
During a single IO, database reads one or more pages. Each pages has a size (find out Postgresql and MySql sizes)
So some important questions that affects query performance are:
How many pages are we pulling?
How many IO are we making?
[pic 3]
Indexes are pointers to rows stored in the heap. They are organized as a B-Tree data structure. Indexes makes our search faster.
For example, we want to search the marks for student id 005, we first go to the heap to search for the index with value 005, once we found it, the index will tell us where the row is located in the heap, that way we don’t have to check row by row in the heap which takes a long time.
Structure of Index (B-Trees)
[pic 4]
We start the traversal from the root node, once we reach an index larger than the target value, we traverse down one level until we find reach the leaf node.
Logarithmic scalability of B-Trees
Create Index Implementation
Create a PostgreSQL database inside a Docker container
docker run -e POSTGRES_PASSWORD=postgres --name pg postgres
Once we saw the expected output below we are good to proceed
Next, open a new terminal and we want to open the terminal inside the container
docker exec -it pg psql -U postgres
We create a table which contains the student's id, marks and age
CREATE TABLE grades (
student_id SERIAL PRIMARY KEY,
marks INT,
age INT
);
We populate our table with 10000 synthetic data
INSERT INTO grades (marks, age)
SELECT random()*100, random()*30 FROM generate_series(0,10000);
Check the first 10 rows of our table
SELECT * FROM grades LIMIT 10;
Check the total number of rows in our table
SELECT COUNT(*) FROM grades;
EXPLAIN ANALYZE SELECT id FROM grades WHERE student_id = 450;
Index Only Scan: We only search through the index for student_id=450, we do not go to the heap to fetch for pages hence Heap Fetches is 0.
Planning time: the database is deciding whether it should scan the index or do a full table scan to fetch student_id=450
Execution time: time taken to query the data
EXPLAIN ANALYZE SELECT marks FROM grades WHERE student_id = 450;
We can see that the execution time is now 2.52 ms which is much longer then the previous one. This is because marks is not an index so the database first have to traverse down the B-tree to find the index student_id=450, then it will go to the heap to find the row which contains the marks for student_id=450.
EXPLAIN ANALYZE SELECT id FROM grades WHERE age=15;
The execution time is 1.235ms which is significantly longer than the last 2 queries because the age column is not an index, so the database has to do a Sequential Scan (Seq Scan) and go to the heap and scan all the 10000 rows.
Now we create an index for age
CREATE INDEX student_age ON grades(age);
EXPLAIN ANALYZE SELECT id FROM grades WHERE age=15;
Now the query is way faster and it uses Bitmap Heap Scan.