The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or click here to continue anyway

Managing big enough data in postgres – arka abs

Postgres can work quite well for large databases without having to look at NoSQL solutions. But if you find your tables getting larger than the physical memory, you might want to look for ways to optimize your peformance. In my current consulting assignment, I have to deal with a table that is 104 gigabytes on Amazon RDS running Postgres. This table is at the heart of the system and any query that as much as glances at the table sideways gets incredibly slow. To make matters worse, we had to run a migration on that table that changed the data type of one of the columns from bigint to varchar.

That would just not work. The one way to solve such situations is to partition the table. PostgreSQL has a feature called as table inheritance that allows child tables to share the same structure of the parent table and the child tables can participate in queries to the parent table. What is cool about this is that postgres also supports the concept of table spaces that lets multiple tables be in separate directories or mount points. So technically, you can mount more often accessed data on fast SSDs or an EBS volume with higher provisioned IOPS and move older data to slower low cost storage devices. This also has the additional benefit of reducing the size of working set and being able to perform seq scans in memory rather than index scans on disk. The data I was dealing with was largely time series data and time seemed like the obvious way to partition it. The first order of business was to export existing data to CSV files using the \copy command using limit and offset. The table I was dealing with contained close to about a 100 million rows with multiple JSON fields so I exported them a million at a time.

The next step was to create a clone table that had the same structure as the large table.

create table large_table_clone as 
   select * from large table where 1=2;

The you can create child tables by specifying the check condition and the table you want to inherit from.

create table large_table_clone_201504 (
  check(start_time >= date '2015-04-01' 
        start_time < date '2015-05-01')
) inherits(large_table_clone)

We can then create a trigger on the parent table to send records to the child table based on the inserted record’s start_time. The trigger would look something like this.

CREATE OR REPLACE FUNCTION partition_function()
  IF ( NEW.start_time <= DATE '2015-04-01' 
       NEW.start_time > DATE '2015-05-01' ) THEN
    INSERT INTO large_table_clone_201504 VALUES (NEW.*);
  ELSIF ( NEW.start_time <= DATE '2015-03-01' 
          NEW.start_time > DATE '2015-04-01' ) THEN
    INSERT INTO large_table_clone_201503 VALUES (NEW.*);
    RAISE EXCEPTION 'Date out of range';
LANGUAGE plpgsql;

CREATE TRIGGER large_table_clone_partition
BEFORE INSERT ON large_table_clone
FOR EACH ROW EXECUTE PROCEDURE partition_function();

The only trouble was I was not sure of the valid date range in my dataset and there was no hope of getting to it from the existing table. So I had to create tables on the fly as data came in.

create or replace 
function partition_function() 
returns trigger as
  table_master varchar := 'large_table_clone';
  table_part varchar := '';
  start_date date := null;
  end_date date := null;
  table_count int;
  table_part := table_master || 
                '_' || 
  start_date := date_trunc('month',NEW.start_time)::date;
  end_date := (date_trunc('month',NEW.start_time) 
               + '1 month')::date;

  execute format('select tablename from pg_tables 
                  where tablename=%L limit 1',table_part);

  get diagnostics table_count = ROW_COUNT;

  if table_count < 1 then
    execute format(
     'create table if not exists %I (
        check( start_time <@ [%L,%L]::tsrange)
      ) inherits (%I)', 
     table_part, start_date, end_date, table_master);
  end if;

  execute format(
   'insert into %I values (($1).*)', 
   table_part) using NEW;

  return null;
language plpgsql volatile
cost 100;

This dynamically creates tables as fills in the data as they are inserted into large_table_clone. In case you were wondering the COPY command respects the insert trigger so you can bulk load the data exported via CSV files to the new tables.

Since we have table inheritance setup, we can now run queries against the large_table_clone table and it will transparently run the queries across all the child tables and get us the result. But this puts us back where we started. If you really want to get the benefits of partitioning, you will have to use constraint exclusion.

set constraint_exclusion=on;
select * from large_table_clone 
where start_time between '2015-01-01' 
                         and '2015-02-15';

This will ensure that it only performs the queries on two of the partitions rather than all the partitions. The next obvious step was to create an index on large_table_clone_yyyy_mm partitions on the start_time and cluster the tables using that index. This worked in our cases as the records are usually immutable and older partitions can hence be safely clustered. This made the large table a lot more manageable and we can feel safe archiving old data without impacting the performance of the production system.

Like this:

Like Loading...


Continue reading on