International Virtual Observatory Alliance

Difference: VOTableUnicode20141016 ( vs. 1)

Revision 12014-10-16 - DaveMorris

META TOPICPARENT	name="DaveMorris"

Note - for the purposes of discussion the following text uses a bytecount attribute instead of an arraysize attribute in the VOTable headers to make it clear that the attribute represents the number of bytes in each FIELD not the number of characters.

Unicode in PostgreSQL and VOTable

Experiments for generating VOTable FIELD headers from strings containing unicode code points in a PostgreSQL database.

Deploy a default PostgreSQL database instance using the official PostgreSQL Docker container.

    docker run --name 'postgres-test' -d postgres

Run the psql command line client in another container, and check the default database encoding.

    docker run -it --rm --link 'postgres-test:postgres' --rm postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres --list'

All the databases and templates are stored using UTF8 encoding.

                                         List of databases
           Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
        -----------+----------+----------+------------+------------+-----------------------
         postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
         template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
                   |          |          |            |            | postgres=CTc/postgres
         template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
                   |          |          |            |            | postgres=CTc/postgres

Run the psql command line client in another container, and create a test database.

    docker run -it --rm --link 'postgres-test:postgres' --rm postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres'

    CREATE DATABASE testdata ;
    \c testdata ;
    CREATE TABLE testtable (
        ident SERIAL,
        name CHAR(4)
        );

Add a row containing only ASCII characters.

    INSERT INTO testtable (name) VALUES ('IVOA') ;

Add a row containing the Greek letter Omega.

    INSERT INTO testtable (name) VALUES (U&'IV\03A9A') ;

Check the table contents.

    SELECT * FROM testtable ;

     ident | name 
    -------+------
         1 | IVOA
         2 | IVΩA

Compare the character count and byte count.

    SELECT ident, name, char_length(name), octet_length(name) FROM testtable ;

     ident | name | char_length | octet_length 
    -------+------+-------------+--------------
         1 | IVOA |           4 |            4
         2 | IVΩA |           4 |            5
    (2 rows)

It is possible calculate the maximum byte count for a colum.

    SELECT max(char_length(name)), max(octet_length(name)) FROM testtable ;

     max | max 
    -----+-----
       4 |   5
    (1 row)

Note, this only works in this example because the PostgreSQL database storage encoding is UTF-8 and the stored size is the same as the UTF-8 encoded size, change the database encoding and the byte count may be different.

This will not work the same way on other database platforms.

If we use the calculated byte count in the VOTable header, then the array size of the column will depend on the contents of the database at the time when the byte count was calculated and/or the VOTable was generated.

With only the first row the maximum byte count is 4.

    INSERT INTO testtable (name) VALUES ('IVOA') ;
    SELECT max(octet_length(name)) FROM testtable ;
     max 
    -----
       4
    (1 row)

    <FIELD name='name' type='char' bytecount='4'/>

Adding the second row changes the maximum byte count to 5.

    INSERT INTO testtable (name) VALUES (U&'IV\03A9A') ;
    SELECT max(octet_length(name)) FROM testtable ;
     max 
    -----
       5
    (1 row)

    <FIELD name='name' type='char' bytecount='5'/>

Deleting the second row changes the maximum byte count back to 4.

    DELETE FROM testtable WHERE ident = 2 ;
    SELECT max(octet_length(name)) FROM testtable ;
     max 
    -----
       4
    (1 row)

    <FIELD name='name' type='char' bytecount='4'/>

So, in this particular example it is possible to calculate the maximum number of bytes for a column, but only by scanning all the data in the column before we generate the VOTable header.

For non-static data we would need to calculate the byte count for the VOTable header in the same transaction as the SELECT statement that generates the VOTable data to avoid a potential race condition.

This may not be possible on other database platforms.

If we do use the calculated byte count to set the array size in the VOTable header.

    SELECT max(octet_length(name)) FROM testtable ;
     max 
    -----
       5
    (1 row)

    <FIELD name='name' type='char' bytecount='5'/>

If that VOTable is uploaded to another TAP service, we can't reverse engineer the number of characters from the number of bytes, so the uploaded table has to be created with variable size.

    CREATE TABLE uploaded (
        ident SERIAL,
        name CHAR
        );

This works in this example because in PostgreSQL there is no real distinction between fixed and variable length strings.

This may cause performance problems on other database platforms.

View topic | History: r1 | More topic actions...