TWiki
>
IVOA Web
>
TWikiUsers
>
DaveMorris
>
VOTableUnicode20141016
(2014-10-16,
DaveMorris
)
(raw view)
E
dit
A
ttach
Note - for the purposes of discussion the following text uses a =bytecount= attribute instead of an =arraysize= attribute in the VOTable headers to make it clear that the attribute represents the number of *bytes* in each FIELD not the number of _characters_. ---++ Unicode in PostgreSQL and VOTable Experiments for generating VOTable FIELD headers from strings containing unicode code points in a [[http://www.postgresql.org/][PostgreSQL]] database. Deploy a default PostgreSQL database instance using the official [[https://registry.hub.docker.com/_/postgres/][PostgreSQL]] [[https://www.docker.com/whatisdocker/][Docker]] container. <pre> docker run --name 'postgres-test' -d postgres </pre> Run the [[http://www.postgresql.org/docs/9.3/static/app-psql.html][psql]] command line client in another container, and check the default database encoding. <pre> docker run -it --rm --link 'postgres-test:postgres' --rm postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres --list' </pre> All the databases and templates are stored using UTF8 encoding. <pre> List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+------------+------------+----------------------- postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 | template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres + | | | | | postgres=CTc/postgres </pre> Run the psql command line client in another container, and create a test database. <pre> docker run -it --rm --link 'postgres-test:postgres' --rm postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres' </pre> <pre> CREATE DATABASE testdata ; \c testdata ; CREATE TABLE testtable ( ident SERIAL, name CHAR(4) ); </pre> Add a row containing only ASCII characters. <pre> INSERT INTO testtable (name) VALUES ('IVOA') ; </pre> Add a row containing the Greek letter Omega. <pre> INSERT INTO testtable (name) VALUES (U&'IV\03A9A') ; </pre> Check the table contents. <pre> SELECT * FROM testtable ; </pre> <pre> ident | name -------+------ 1 | IVOA 2 | IVΩA </pre> Compare the character count and byte count. <pre> SELECT ident, name, char_length(name), octet_length(name) FROM testtable ; ident | name | char_length | octet_length -------+------+-------------+-------------- 1 | IVOA | 4 | 4 2 | IVΩA | 4 | 5 (2 rows) </pre> It is possible calculate the maximum byte count for a colum. <pre> SELECT max(char_length(name)), max(octet_length(name)) FROM testtable ; max | max -----+----- 4 | 5 (1 row) </pre> Note, this only works in this example because the PostgreSQL database storage encoding is UTF-8 and the stored size is the same as the UTF-8 encoded size, change the database encoding and the byte count may be different. *This will not work the same way on other database platforms.* ---- If we use the calculated byte count in the VOTable header, then the array size of the column will depend on the contents of the database at the time when the byte count was calculated and/or the VOTable was generated. With only the first row the maximum byte count is 4. <pre> INSERT INTO testtable (name) VALUES ('IVOA') ; SELECT max(octet_length(name)) FROM testtable ; max ----- 4 (1 row) <FIELD name='name' type='char' bytecount='4'/> </pre> Adding the second row changes the maximum byte count to 5. <pre> INSERT INTO testtable (name) VALUES (U&'IV\03A9A') ; SELECT max(octet_length(name)) FROM testtable ; max ----- 5 (1 row) <FIELD name='name' type='char' bytecount='5'/> </pre> Deleting the second row changes the maximum byte count back to 4. <pre> DELETE FROM testtable WHERE ident = 2 ; SELECT max(octet_length(name)) FROM testtable ; max ----- 4 (1 row) <FIELD name='name' type='char' bytecount='4'/> </pre> So, in this particular example it is possible to calculate the maximum number of bytes for a column, but only by scanning all the data in the column before we generate the VOTable header. For non-static data we would need to calculate the byte count for the VOTable header in the same transaction as the SELECT statement that generates the VOTable data to avoid a potential race condition. *This may not be possible on other database platforms.* ---- If we do use the calculated byte count to set the array size in the VOTable header. <pre> SELECT max(octet_length(name)) FROM testtable ; max ----- 5 (1 row) <FIELD name='name' type='char' bytecount='5'/> </pre> If that VOTable is uploaded to another TAP service, we can't reverse engineer the number of characters from the number of bytes, so the uploaded table has to be created with variable size. <pre> CREATE TABLE uploaded ( ident SERIAL, name CHAR ); </pre> This works in this example because in PostgreSQL there is no real distinction between fixed and variable length strings. *This may cause performance problems on other database platforms.*
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r1 - 2014-10-16
-
DaveMorris
IVOA
Log in
or
Register
IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics
Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki
TWiki intro
TWiki tutorial
User registration
Notify me
Working Groups
Applications
Data Access Layer
Data Model
Distributed Services & Protocols
Registry
Semantics
Interest Groups
Data Curation
Education
Knowledge Discovery
High Energy
Operations
Radio Astronomy
Solar System
Time Domain
Committees
Stds&Procs
www.ivoa.net
Documents
Events
Members
XML Schema
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback