Indexes

Andrey Aksenov

In traditional databases, indexes can significantly speed up data access. In distributed systems like Greengage DB (based on Greenplum), indexes should be used more sparingly. Greengage DB performs parallel sequential scans, with each segment scanning only its local data subset. Partitioning can further reduce scan size.

Since analytical queries often retrieve large datasets, using indexes is typically not efficient. Before creating indexes, test your queries without them to establish a performance baseline. Indexes might be helpful for OLTP workloads or highly selective queries, especially on compressed append-optimized tables where indexes limit decompression to relevant rows.

PostgreSQL index types

Greengage DB supports the following PostgreSQL index types:

B-tree

B-tree indexes can handle equality and range queries on data that can be sorted in some order. The query planner considers using a B-tree index whenever an indexed column is involved in a comparison using one of these operators: <, <=, =, >=, and >.
GiST

GiST (Generalized Search Tree) indexes are designed to work with complex data types such as geometric objects, text, and arrays. GiST indexes enable efficient searching for spatial, textual, and hierarchical data.
SP-GiST

SP-GiST indexes (space-partitioned GiST), like GiST indexes, offer an infrastructure that supports various kinds of searches. SP-GiST allows you to implement a range of different unbalanced disk-based data structures, such as quadtrees, k-d trees, and radix trees (radix tries).
GIN

GIN (Generalized Inverted Index) indexes are inverted indexes suitable for data values with multiple component values. They are used for full-text search and search in arrays, JSON, and trigrams.

See Index Types in the PostgreSQL documentation for descriptions of these types.

In addition to PostgreSQL index types, Greengage DB also supports Bitmap indexes, which are optimized for analytic workloads.

Bitmap indexes are best suited for data warehousing applications with large amounts of data, many ad hoc queries, and few data modification (DML) transactions. They can be much smaller than other index types while providing the same functionality.

Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. They allow filtering out rows that do not match all conditions before accessing the table.

Bitmap indexes perform best on columns with between 100 and 100000 distinct values, especially when the indexed column is frequently queried in conjunction with other indexed columns. However, they are not suitable for unique or highly selective columns (such as customer names or phone numbers), nor for OLTP systems with heavy concurrent data modifications.

To create an index on the specified column of the specified table or materialized view, use the CREATE INDEX command. To modify the definition of an existing index, use ALTER INDEX.

The following is a simplified syntax of the CREATE INDEX command:

CREATE [UNIQUE] INDEX [<name>] ON <table_name> [USING <method>]
    ( {<column_name> | (<expression>)} [, ...] )
    [ WHERE <predicate> ]

The main clauses and arguments for defining an index are the following:

UNIQUE

Whether to create a unique index. Checks for duplicate values in the table when the index is created (if data already exists) and whenever new data is inserted.

NOTE

Note that defining a unique constraint implicitly creates a unique index.
<name>

The name of the index to be created. The index is always created in the same schema as its parent table. If the name is omitted, Greengage DB chooses a suitable name based on the parent table’s name and the indexed column name.
ON <table_name>

The name (optionally schema-qualified) of the table or materialized view to be indexed.
USING <method>

The name of the index method to be used. Possible values: btree (default), bitmap, gist, spgist, and gin.
<column_name>

The name of the column in the table on which the index is created. Multiple columns can be specified if the index method supports multicolumn indexes. See the Create a multicolumn index section below.
<expression>

An expression based on one or more columns of the table. See the Use index expressions section below.
WHERE <predicate>

Whether to create a partial index. A partial index includes entries for only a portion of a table — typically the ones that benefit most from indexing. See the Create a partial index section below.

The table summarizes the limitations of the index types supported in Greengage DB.

Index method	Multicolumn indexes	Unique indexes	Supported by GPORCA
B-tree	yes	yes (heap tables only)	yes
Bitmap	no	no	yes
GiST	yes	no	yes
SP-GiST	no	no	no
GIN	yes	no	yes

Index method

Multicolumn indexes

Unique indexes

Supported by GPORCA

B-tree

yes

yes (heap tables only)

yes

Bitmap

yes

GiST

yes

SP-GiST

GIN

yes

GPORCA ignores indexes created with unsupported methods. The following features are unsupported when GPORCA is enabled:

Partial indexes
Index expressions

Consider the points below when you create indexes. See also: Examine index usage.

Your query workload.

Indexes improve performance for queries that return a single record or a small dataset.
Compressed tables.

Indexes can improve performance on compressed append-optimized tables for queries that return a targeted set of rows. For compressed data, an index access method means only the necessary rows are uncompressed.
Avoid indexes on frequently updated columns.

Creating an index on a frequently updated column increases the number of writes required when the column is updated.

Create selective B-tree indexes.

Index selectivity is the number of unique values in a column divided by the total number of rows in the table. For example, if a table has 1000 rows and a column has 800 distinct values, the selectivity of the index is 0.8, which is considered good. Unique indexes always have a selectivity ratio of 1.0, which is the best possible.
Use Bitmap indexes for low-selectivity columns.

See Bitmap indexes.
Avoid overlapping indexes.

Indexes with the same leading column are redundant. For example, if there is an index on (category, product), you do not need another one on (category).

Index columns used in joins.

An index on a column used for frequent joins (such as a foreign key column) can improve join performance by enabling more join methods for the query optimizer to use.
Index columns frequently used in predicates.

Columns that are frequently referenced in WHERE clauses are good candidates for indexes. This applies not only to SELECT queries but also to UPDATE and DELETE statements that use such predicates.

Index maintenance

Drop indexes for bulk loads.

For mass loads of data into a table, consider dropping the indexes and re-creating them after the load completes. This is often faster than updating the indexes. See the Rebuild indexes section below.
Consider a clustered index.

A clustered index physically orders records on disk based on the index. When data is stored sequentially (for example, by date), queries that scan a range can fetch data more efficiently using fast sequential reads. See the Cluster tables based on indexes section below.

Prerequisites

To execute commands described in the following sections, connect to the Greengage DB master host using psql as described in Connect to Greengage DB via psql. Create a new database and connect to it:

CREATE DATABASE media_library;
\c media_library

Then, create two tables and fill them with data:

Create tables

AO table

CREATE TABLE movies
(
    id     INTEGER,
    title  TEXT NOT NULL,
    genre  TEXT NOT NULL,
    year   INT,
    rating NUMERIC(3, 1)
)
    WITH (appendoptimized = true, orientation = row)
    DISTRIBUTED BY (id);

INSERT INTO movies (id, title, genre, year, rating)
SELECT gs                                        AS id,
       'Movie ' || gs                            AS title,
       genres[(random() * 4)::int + 1],
       (1925 + (random() * 100)::int),
       round((7.0 + random() * 3.0)::numeric, 1) AS rating
FROM generate_series(1, 100000) gs,
     LATERAL(VALUES(ARRAY ['Sci-Fi', 'Crime', 'Romance', 'Action', 'Drama'])) AS g(genres);

Heap table

CREATE TABLE actors
(
    id   INTEGER,
    name TEXT NOT NULL
)
    DISTRIBUTED BY (id);

INSERT INTO actors (id, name)
SELECT gs                                                                                 AS actor_id,
       first_names[(random() * 4)::int + 1] || ' ' || last_names[(random() * 4)::int + 1] AS name
FROM generate_series(1, 1000) gs,
     LATERAL(VALUES(ARRAY ['John','Emily','Michael','Sarah','David'])) AS f(first_names),
     LATERAL(VALUES(ARRAY ['Smith','Johnson','Williams','Brown','Davis'])) AS l(last_names);

The command below creates a B-tree index on the title column in the movies table:

CREATE INDEX movies_title_idx ON movies (title);

This SELECT query retrieves information about the specified movie by its title using the created index:

SELECT *
FROM movies
WHERE title = 'Movie 12345';

Query plan

                       QUERY PLAN
---------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)
   ->  Bitmap Heap Scan on movies
         Recheck Cond: (title = 'Movie 12345'::text)
         ->  Bitmap Index Scan on movies_title_idx
               Index Cond: (title = 'Movie 12345'::text)
 Optimizer: Pivotal Optimizer (GPORCA)
(6 rows)

See also: Examine index usage.

The query below creates a bitmap index on the genre column:

CREATE INDEX movies_genre_idx ON movies USING bitmap (genre);

The following SELECT query is expected to use the created index:

SELECT AVG(rating) AS avg_rating
FROM movies
WHERE genre = 'Sci-Fi';

Query plan

                           QUERY PLAN
----------------------------------------------------------------
 Aggregate
   ->  Gather Motion 4:1  (slice1; segments: 4)
         ->  Aggregate
               ->  Bitmap Heap Scan on movies
                     Recheck Cond: (genre = 'Sci-Fi'::text)
                     ->  Bitmap Index Scan on movies_genre_idx
                           Index Cond: (genre = 'Sci-Fi'::text)
 Optimizer: Pivotal Optimizer (GPORCA)
(8 rows)

The command below creates an index named movies_genre_year_idx on the genre and year columns:

CREATE INDEX movies_genre_year_idx ON movies (genre, year);

The created index should speed up queries that filter by genre and year, for example:

SELECT AVG(rating) AS avg_rating
FROM movies
WHERE genre = 'Sci-Fi'
  AND year = 2010;

Query plan

                                     QUERY PLAN
------------------------------------------------------------------------------------
 Aggregate
   ->  Gather Motion 4:1  (slice1; segments: 4)
         ->  Aggregate
               ->  Bitmap Heap Scan on movies
                     Recheck Cond: ((genre = 'Sci-Fi'::text) AND (year = 2010))
                     ->  Bitmap Index Scan on movies_genre_year_idx
                           Index Cond: ((genre = 'Sci-Fi'::text) AND (year = 2010))
 Optimizer: Pivotal Optimizer (GPORCA)
(8 rows)

The command below creates a unique B-tree index on the id column in the actors heap table:

CREATE UNIQUE INDEX actors_id_idx ON actors (id);

The example request that uses the index looks as follows:

SELECT *
FROM actors
WHERE id = 123;

Query plan

                   QUERY PLAN
------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)
   ->  Index Scan using actors_id_idx on actors
         Index Cond: (id = 123)
 Optimizer: Pivotal Optimizer (GPORCA)
(4 rows)

Partial indexes are supported by the Postgres planner only. To turn off the GPORCA query optimizer at the session level, set the optimizer parameter to off in the following way:

SET optimizer = off;

This request creates a partial index on the year column to optimize queries that target movies released after 2000:

CREATE INDEX movies_year_post_2000_idx ON movies (year)
    WHERE year > 2000;

The example SELECT request might look like this:

SELECT AVG(rating) AS avg_rating
FROM movies
WHERE year = 2020;

Query plan

                               QUERY PLAN
------------------------------------------------------------------------
 Aggregate
   ->  Gather Motion 4:1  (slice1; segments: 4)
         ->  Aggregate
               ->  Bitmap Heap Scan on movies
                     Recheck Cond: (year = 2020)
                     ->  Bitmap Index Scan on movies_year_post_2000_idx
                           Index Cond: (year = 2020)
 Optimizer: Postgres query optimizer
(8 rows)

Index expressions are supported by the Postgres planner only.

The query below creates an index that can be used to perform case-insensitive comparisons with the lower() function:

CREATE INDEX movies_title_lower_idx ON movies (lower(title));

A sample query that uses this index might look like this:

SELECT *
FROM movies
WHERE lower(title) = lower('Movie 12345');

Query plan

                           QUERY PLAN
----------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)
   ->  Bitmap Heap Scan on movies
         Recheck Cond: (lower(title) = 'movie 12345'::text)
         ->  Bitmap Index Scan on movies_title_lower_idx
               Index Cond: (lower(title) = 'movie 12345'::text)
 Optimizer: Postgres query optimizer
(6 rows)

To view the list of indexes along with their sizes, use the \di+ meta-command:

\di+

The result should look like this:

                                   List of relations
 Schema |           Name            | Type  |  Owner  | Table  |  Size   | Description
--------+---------------------------+-------+---------+--------+---------+-------------
 public | actors_id_idx             | index | gpadmin | actors | 288 kB  |
 public | movies_genre_idx          | index | gpadmin | movies | 960 kB  |
 public | movies_genre_year_idx     | index | gpadmin | movies | 3360 kB |
 public | movies_title_idx          | index | gpadmin | movies | 3360 kB |
 public | movies_title_lower_idx    | index | gpadmin | movies | 3360 kB |
 public | movies_year_post_2000_idx | index | gpadmin | movies | 928 kB  |
(6 rows)

You can also query the pg_indexes system view to get information about indexes:

SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'movies';

The indexdef column shows the definition of each index:

         indexname         |                                            indexdef
---------------------------+------------------------------------------------------------------------------------------------
 movies_title_idx          | CREATE INDEX movies_title_idx ON public.movies USING btree (title)
 movies_genre_idx          | CREATE INDEX movies_genre_idx ON public.movies USING bitmap (genre)
 movies_genre_year_idx     | CREATE INDEX movies_genre_year_idx ON public.movies USING btree (genre, year)
 movies_year_post_2000_idx | CREATE INDEX movies_year_post_2000_idx ON public.movies USING btree (year) WHERE (year > 2000)
 movies_title_lower_idx    | CREATE INDEX movies_title_lower_idx ON public.movies USING btree (lower(title))
(5 rows)

Maintain indexes

After creating an index, you need to collect statistics via ANALYZE:

ANALYZE;

This helps the query optimizer estimate row counts and choose efficient query plans. You can learn more about collecting statistics from Collect statistics via ANALYZE.

In Greengage DB, the pg_stat_*_indexes views display access statistics for indexes only from the master instance. To display index statistics that combine statistics from the master and the segment instances, create the following views:

Create a view providing access statistics for all indexes:

CREATE VIEW pg_stat_all_indexes_ggdb AS
SELECT s.relid,
       s.indexrelid,
       s.schemaname,
       s.relname,
       s.indexrelname,
       m.idx_scan,
       m.idx_tup_read,
       m.idx_tup_fetch
FROM (SELECT relid,
             indexrelid,
             schemaname,
             relname,
             indexrelname,
             sum(idx_scan)      as idx_scan,
             sum(idx_tup_read)  as idx_tup_read,
             sum(idx_tup_fetch) as idx_tup_fetch
      FROM gp_dist_random('pg_stat_all_indexes')
      WHERE relid >= 16384
      GROUP BY relid, indexrelid, schemaname, relname, indexrelname
      UNION ALL
      SELECT *
      FROM pg_stat_all_indexes
      WHERE relid < 16384) m,
     pg_stat_all_indexes s
WHERE m.relid = s.relid
  AND m.indexrelid = s.indexrelid;

Create a view providing statistics for system indexes:

CREATE VIEW pg_stat_sys_indexes_ggdb AS
SELECT *
FROM pg_stat_all_indexes_ggdb
WHERE schemaname IN ('pg_catalog', 'information_schema')
   OR schemaname ~ '^pg_toast';

Create a view providing access statistics for user indexes:

CREATE VIEW pg_stat_user_indexes_ggdb AS
SELECT *
FROM pg_stat_all_indexes_ggdb
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
  AND schemaname !~ '^pg_toast';

You can then query the pg_stat_user_indexes_ggdb view to get access statistics for indexes created in the Create indexes section:

SELECT *
FROM pg_stat_user_indexes_ggdb
WHERE relname IN ('movies', 'actors');

The result should look like this:

 relid  | indexrelid | schemaname | relname |       indexrelname        | idx_scan | idx_tup_read | idx_tup_fetch
--------+------------+------------+---------+---------------------------+----------+--------------+---------------
 450662 |     450686 | public     | movies  | movies_genre_year_idx     |        4 |          131 |             0
 450672 |     450687 | public     | actors  | actors_id_idx             |        1 |            1 |             1
 450662 |     450688 | public     | movies  | movies_year_post_2000_idx |        4 |          997 |             0
 450662 |     450689 | public     | movies  | movies_title_lower_idx    |        4 |            1 |             0
 450662 |     450682 | public     | movies  | movies_genre_idx          |        0 |            4 |             0
 450662 |     450678 | public     | movies  | movies_title_idx          |        4 |            1 |             0
(6 rows)

Use the REINDEX command to rebuild a poorly performing index. REINDEX rebuilds an index using the data stored in the index’s table, replacing the old copy of the index.

The query below recreates all indexes of the movies table:

REINDEX TABLE movies;

The next request recreates the specified index:

REINDEX INDEX movies_genre_idx;

After running REINDEX, the index statistics are reset to zero. To get actual statistics, run ANALYZE on the table after reindexing:

ANALYZE movies;

For bulk data loads, it’s often faster to drop indexes before loading and recreate them afterward rather than updating the indexes during the load:

DROP INDEX movies_genre_idx;
CREATE INDEX movies_genre_idx ON movies USING bitmap (genre);

Storing table data in the same order as an index improves query performance for range scans. When records are physically ordered on disk, range queries can take advantage of fast sequential reads. This can be achieved by creating an intermediate table and loading the data in the desired order:

Create a new table:

CREATE TABLE actors_new
(
    LIKE actors
)
    DISTRIBUTED BY (id);

Load the data from the original table into the new table in the desired order:
```
INSERT INTO actors_new
SELECT *
FROM actors
ORDER BY id;
```
Drop the original table:
```
DROP TABLE actors;
```
Rename the new table with the original table’s name:
```
ALTER TABLE actors_new
    RENAME TO actors;
```

Recreate the index:

CREATE UNIQUE INDEX actors_id_idx ON actors (id);

Run VACUUM ANALYZE for the new table:
```
VACUUM ANALYZE actors;
```

Use the EXPLAIN command to examine index usage for a query. The query below shows a detailed breakdown of the SELECT query from Create a bitmap index:

EXPLAIN (COSTS OFF)
SELECT AVG(rating) AS avg_rating
FROM movies
WHERE genre = 'Sci-Fi';

The result should include Bitmap Heap Scan on the movies table and Bitmap Index Scan on the movies_genre_idx index:

                           QUERY PLAN
----------------------------------------------------------------
 Aggregate
   ->  Gather Motion 4:1  (slice1; segments: 4)
         ->  Aggregate
               ->  Bitmap Heap Scan on movies
                     Recheck Cond: (genre = 'Sci-Fi'::text)
                     ->  Bitmap Index Scan on movies_genre_idx
                           Index Cond: (genre = 'Sci-Fi'::text)
 Optimizer: Pivotal Optimizer (GPORCA)
(8 rows)

Consider the following points when experimenting with indexes:

Test with real data whenever possible. Indexes that work well for test data may not be effective in production.
Avoid very small test sets — they can produce misleading results.
Design test data carefully. Uniform, random, or sorted values can distort statistics and do not reflect real-world usage.
Force index usage for testing by turning off certain plan types (for example, using the enable_seqscan and enable_nestloop configuration parameters).
Compare performance with and without indexes using EXPLAIN ANALYZE.

To drop indexes, use the DROP INDEX command:

DROP INDEX movies_genre_idx;

Note that DROP TABLE always removes any existing indexes for the target table.

Sequences

Views and materialized views

Indexes

Overview

Index types

PostgreSQL index types

Bitmap indexes

Syntax overview

Limitations

Best practices for creating indexes

General indexing strategy

Index types

Indexing for query patterns

Index maintenance

Create indexes

Create a B-tree index

Create a bitmap index

Create a multicolumn index

Create a unique index

Create a partial index

Use index expressions

View the list of indexes

Maintain indexes

Collect statistics

View index statistics

Rebuild indexes

Cluster tables based on indexes

Examine index usage

Drop indexes