Tony’s Oracle Tips

July 2, 2017

CSV parsing and tokenizing strings

Filed under: Uncategorized — tonyhasler @ 2:28 pm

Most of you will be familiar with the “Comma Separated Values” data format. It is used in spreadsheets and other places to store ordered lists of character data. The delimiter is typically a comma and if a comma is in a list item the item as a whole needs to be quoted, typically with double quotes.

So

a,"b,c",d

is actually just three items

a
"b,c"
d

Also, if we wish to include a double-quote in our item it must be repeated to ensure that it is not interpreted as a sentinel for the list item. So, for example:

item 1," this has not ""one"", but ""two"" quoted items",item 3

has just three items.

Frequently we need our Oracle SQL or PL/SQL code to parse CSV strings and convert them into ordered lists of items. External tables are often touted as the solution to this problem but they come with the not insignificant complication of needing the data stored on the database server in flat files. Can we not just parse the CSV data in our code?

Parsing CSV data has been the topic of many a post such as this Ask Tom article but to my mind there are two key parts to this puzzle:

  1. Find a regular expression that will identify the items.
  2. Find a way to separate the string into items.

I struggled with the first part of this for a while but then came across the answer buried in XSLT code here. To demonstrate how this works, let us change the delimiters in a CSV text string to pipes:

WITH q1 AS ( SELECT 'a,"b,""c""",d' comma_string FROM dual ) SELECT
rtrim(regexp_replace(comma_string || ',','(("[^"]*")+|[^,]*),','\1|'),'|')
FROM q1;

gives us

a|"b,""c"""|d

Now that we have that bit sorted out, how do we tokenize the data. The Ask Tom article above is one of many to discuss this topic but my benchmarks suggest that APEX_UTIL.STRING_TO_TABLE offers the best prospects for performance. If you want to pass your items to PL/SQL then you are home but if you want the items returned to SQL then a little more work is needed.

Take a look at this PL/SQL function and its associated types.

CREATE OR REPLACE TYPE col_type AS OBJECT (
col_number INTEGER,
col_value VARCHAR2(32767)
);
/

CREATE OR REPLACE TYPE col_type_t AS
TABLE OF col_type;
/

CREATE OR REPLACE FUNCTION tokenize (
    p_csv_row   VARCHAR2
) RETURN col_type_t
    PIPELINED
IS
    str_table   apex_application_global.vc_arr2;
    c_delim     CONSTANT char(1) := chr(1); -- An unprintable character hopefully not in string
BEGIN
    str_table := apex_util.string_to_table(
        regexp_replace(
            p_csv_row
             ||  ',',
            '(("[^"]*")+|[^,]*),',
            '\1'
             ||  c_delim
        ),
        c_delim
    );
--
-- We get an extra item because
-- we added a delimiter at the end
--
    FOR i IN 1..str_table.count () - 1 LOOP
        PIPE ROW (
            col_type(
                i,
                rtrim(
                    str_table(
                        i
                    ),
                c_delim)
            )
        );
    END LOOP;
END tokenize;
/

The code replaces the comma separators (and only those commas that are separators…not the ones inside quotes) with an unprintable character and then uses  APEX_UTIL.STRING_TO_TABLE to tokenize the strings before returning the results to the caller. Let me create some test data for benchmarking before proceeding:

CREATE TABLE csv_tests
AS SELECT
ROWNUM row_number,
rpad('a,"b,""c,"",d",e',1000,'x') csv_row
FROM dual CONNECT BY level <= 10000;

We have 10,000 rows, each approximately 1K in length with three items. Here is how you might use the pipelined function defined above:

WITH r AS ( SELECT
        *
    FROM
        csv_tests
    ORDER BY row_number ) SELECT
    r.row_number,
    c.col_number,
    c.col_value
FROM
    r,
    TABLE ( tokenize(r.csv_row) ) c;

Here is the first of the 30,000 rows (truncated)


ROW_NUMBER COL_NUMBER COL_VALUE
          1         1 a
          1         2 "b,""c,"",d"
          1         3 exxxxxxxxxxxx
          2         1 a
          2         2 "b,""c,"",d"
          2         3 exxxxxxxxxxxx

Notice that table expansion automatically orders the columns so I just ordered the rows in advance.

Using a benchmark that ignores network (such as CREATE TABLE AS SELECT) the query runs in under 5 seconds on my cloud database.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.