[SQL] Help: Function for splitting VARCHAR column and migrating its data to 2 new tables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[SQL] Help: Function for splitting VARCHAR column and migrating its data to 2 new tables

ogjunk-pgjedan
I am restructuring my DB schema and need help migrating data from 1
column of an existing table to two new tables.  I have some Java code
that can do this for me, but it's very slow, and I am now hoping I can
migrate this data with some clever SQL instead.

Here are my 3 tables:

user_data (existing, old table)
---------
  id           (PK),
  user_id      (FK)
  keywords     VARCHAR(256)
    -- this contains comma separated keywords
    -- e.g. "new york,san francisco, dallas, food"
    -- also "keywords without strings are really just 1 keyword"
  add_date     TIMESTAMP


So now I'm trying to migrate this "keywords" VARCHAR column to a more
normalized schema:

user_data_keyword (new lookup table to populate)
-----------------
  id           (PK) -- I may change PK to PK(user_data_id, keyword_id)
  user_data_id (FK)
  keyword_id   (FK)


keyword (new table to populate)
-------
  id           (PK)
  name         VARCHAR(64) NOT NULL UNIQUE
  add_date     TIMEZONE


I just found
http://www.postgresql.org/docs/current/static/functions-string.html ,
but if anyone could lend me a hand by getting me started with writing a
function for this, I'd really appreciate it.

Thanks,
Otis


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Reply | Threaded
Open this post in threaded view
|

Re: [SQL] Help: Function for splitting VARCHAR column and migrating its data to 2 new tables

Tony Wasson
On 5/23/05, [hidden email] <[hidden email]> wrote:

> I am restructuring my DB schema and need help migrating data from 1
> column of an existing table to two new tables.  I have some Java code
> that can do this for me, but it's very slow, and I am now hoping I can
> migrate this data with some clever SQL instead.
>
> Here are my 3 tables:
>
> user_data (existing, old table)
> ---------
>   id           (PK),
>   user_id      (FK)
>   keywords     VARCHAR(256)
>     -- this contains comma separated keywords
>     -- e.g. "new york,san francisco, dallas, food"
>     -- also "keywords without strings are really just 1 keyword"
>   add_date     TIMESTAMP
>
>
> So now I'm trying to migrate this "keywords" VARCHAR column to a more
> normalized schema:
>
> user_data_keyword (new lookup table to populate)
> -----------------
>   id           (PK) -- I may change PK to PK(user_data_id, keyword_id)
>   user_data_id (FK)
>   keyword_id   (FK)
>
>
> keyword (new table to populate)
> -------
>   id           (PK)
>   name         VARCHAR(64) NOT NULL UNIQUE
>   add_date     TIMEZONE
>
>
> I just found
> http://www.postgresql.org/docs/current/static/functions-string.html ,
> but if anyone could lend me a hand by getting me started with writing a
> function for this, I'd really appreciate it.
>
> Thanks,
> Otis
>

I am not aware of any extremely clever SQL to make this ALL happen.
However you can do everything with pl/pgsql. My contribution below
will build the keyword table for you. Once you have this working and
you understand it, you could extend it to build your user_data_keyword
 table.

CREATE OR REPLACE FUNCTION split_on_commas(TEXT) RETURNS SETOF TEXT
LANGUAGE 'plpgsql' AS '
DECLARE
   mystring  ALIAS FOR $1;
   incomma   BOOLEAN := FALSE;       -- ## Catch the first word
   endpos    INTEGER; -- end character
   startpos  INTEGER := 0;
   pos       INTEGER;
   outstring TEXT;
BEGIN
   SELECT INTO endpos CHAR_LENGTH(mystring);
   FOR pos IN 1 .. endpos
   LOOP
       -- There are 2 single quotes, a space, and 2 single quotes below:
       IF SUBSTRING(mystring,pos,1) =  '','' THEN
           incomma := TRUE;
           outstring := SUBSTRING(mystring,startpos,pos-startpos);
           RETURN NEXT outstring;
       ELSE
           IF incomma IS TRUE THEN
               incomma := FALSE;
               startpos := pos;
           END IF;
       END IF;
   END LOOP;
   IF incomma IS FALSE THEN     -- ## Catch the last phrase
       outstring := SUBSTRING(mystring,startpos,endpos);
       RETURN NEXT outstring;
   END IF;
RETURN;
END;
';

-- try it! You will get a set of keyword all split out
SELECT * FROM split_on_commas('foo1,bar2,foo3');

 From here, we need a migration function that will loop through row(s)
in your table. I also noticed that some of your words may have leading
and trailing spaces. The function drop_first_and_last_space should
take care of that. Notice that I use this function below in the
build_keyword_table function.

CREATE OR REPLACE FUNCTION drop_first_and_last_space(TEXT) RETURNS
TEXT LANGUAGE 'plpgsql' AS '
DECLARE
   in_string ALIAS FOR $1;
   out_msg TEXT;
BEGIN
   out_msg := in_string;
   IF SUBSTRING(out_msg,1,1) =  '' '' THEN
       RAISE NOTICE ''dropping leading space'';
       out_msg := substring(out_msg,2,length(out_msg));
   END IF;
   --and the last space
   IF SUBSTRING(out_msg,length(out_msg),1) =  '' '' THEN
       RAISE NOTICE ''dropping trailing space'';
       out_msg := substring(out_msg,1,length(out_msg)-1);
   END IF;
RETURN out_msg;
END
';

-- This function uses the INSERT ((SELECT)EXCEPT(SELECT)); syntax
mentioned in http://www.varlena.com/varlena/GeneralBits/19.htm. Note
that I am unsure why I didn't need to use FOR r in EXECUTE ''SELECT...
syntax. But this did work in my small scale test.

CREATE OR REPLACE FUNCTION build_keyword_table(TEXT) RETURNS TEXT
LANGUAGE 'plpgsql' AS '
DECLARE
   match   ALIAS FOR $1;
   rec     RECORD;
   r       RECORD;
   out     TEXT := ''done'';
BEGIN
   FOR rec IN SELECT keywords
       FROM user_data
       WHERE id SIMILAR TO match
       ORDER BY id
   LOOP
       RAISE NOTICE '' working on %'',rec.keywords;
       FOR r IN SELECT drop_first_and_last_space(split_on_commas) AS
kw FROM split_on_commas(rec.keywords)
       LOOP
           RAISE NOTICE ''trying to insert - %'',r.kw;
           INSERT INTO keyword (name) ( (SELECT r.kw) EXCEPT (SELECT
name FROM keyword WHERE name=r.kw));
       END LOOP;
   END LOOP;
RETURN out;
END;
';

Then try it all together, like so:
SELECT build_keyword_table('%');
The input goes to a SIMILAR TO on id - and % means all elements. You
can match a single row by using something like SELECT
build_keyword_table('123');

I also used this as my test data... It worked for me!

CREATE TABLE user_data (
 id           SERIAL,
 user_id      INTEGER,
 keywords     VARCHAR(256) NOT NULL,
 add_date     TIMESTAMP,
 PRIMARY KEY(id)
);

INSERT INTO user_data (keywords) VALUES ('new york,san francisco,
dallas, food');
INSERT INTO user_data (keywords) VALUES ('phoenix, hot, summer, fun');

CREATE TABLE keyword (
 name         VARCHAR(64) NOT NULL,
 id           SERIAL,
 add_date     TIMESTAMP,
 PRIMARY KEY(name)
);
-- todo put a UNIQUE INDEX on keyword (id)

I hope this helps.
Tony Wasson

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [hidden email]