How Parse data between unknown strings

Question

I have a text Column with data as below

RawDataColumn
THANK 1000 1500 1740 1  YOU 1000 1740 1820 1  ABC 1000 1820 1960 1  XYZABC 1000 1960 2240 1  DFGS 1000 2240 2380 1  THINK 1000 2380 2480 1

I want to parse the Text column to multple columns as below

Word   A     B    C   D
THANK 1000 1500 1740 1 
YOU   1000 1740 1820 1
ABC   1000 1820 1960 1
XYZA  1000 1960 2240 1
DFGS 1000 2240 2380 1
THINK 1000 2380 2480 1

SQL Server Version : SQL Server 2016

what will be the best way to do it with better performance. @TimBiegeleisen — Jay Nani
– Jay Nani, Commented Apr 8, 2020 at 10:41
Try using a scripting language such as Python or Perl. Then, re-import the data when you already have separate well defined rows. — Tim Biegeleisen
– Tim Biegeleisen, Commented Apr 8, 2020 at 10:43
If [word] is always alpha and A-D always nuneric, you can create a udf with SUBSTRING and PATINDEX — jigga
– jigga, Commented Apr 8, 2020 at 11:48
@jigga - Yes, word is always Alpha and A-D columns are numeric. I couldn't get the logic how to handle it using SUBSTRING and PATINDEX. — Jay Nani
– Jay Nani, Commented Apr 8, 2020 at 12:07

Tim Biegeleisen · Accepted Answer · 2020-04-08 10:46:54Z

1

SQL Server is not the best place to handle such text scrubbing requirements. I will give a Python script which can generate a text file with clearly defined lines:

inp = "THANK 1000 1500 1740 1  YOU 1000 1740 1820 1  ABC 1000 1820 1960 1  XYZABC 1000 1960 2240 1  DFGS 1000 2240 2380 1  THINK 1000 2380 2480 1"
lines = re.findall(r'\S+ \d+ \d+ \d+ \d+', inp)
f = open('output.txt', 'w')
for line in lines:
    f.write(line + '\n')
f.close()

Now the output file output.txt should have proper lines of data, separated by space for each column. You may try a similar approach with really any other language, and then import into SQL Server.

answered Apr 8, 2020 at 10:46

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jay Nani Over a year ago

It tried using PowerShell creating files and importing them back to SQL but when I deal with 120,000 files(each column data as 1 file) , it is effecting the performance( takes 15+ hrs to import).

jigga · Accepted Answer · 2020-04-09 10:20:45Z

In regards to my comment, this is one way to do this (not my best work :D )

CREATE FUNCTION dbo.Split
(
    @string nvarchar(max)
)
RETURNS @result TABLE (Word nvarchar(max), A int, B int, C int, D int)
AS
 BEGIN

    DECLARE @sub nvarchar(max)
    DECLARE @Word nvarchar(max)
    DECLARE @A int
    DECLARE @B int
    DECLARE @C int
    DECLARE @D int

    IF @string IS NULL 
     BEGIN
        INSERT INTO @result VALUES(NULL, NULL, NULL, NULL, NULL)
     END

    ELSE
     BEGIN
        WHILE LEN(@string) > 0
         BEGIN
            IF @string LIKE '% [A-Z]%'
             BEGIN
                SET @sub = SUBSTRING(@string, 0, PATINDEX('% [A-Z]%',  @string))
             END
            ELSE
             BEGIN
                SET @sub = @string
             END

            SET @string = LTRIM(RTRIM(RIGHT(@string, LEN(@string) - LEN(@sub))))
            SET @Word = LEFT(@sub, CHARINDEX(' ', @sub) - 1)

            SET @sub = SUBSTRING(@sub, CHARINDEX(' ', @sub) + 1, LEN(@sub))
            SET @A = LEFT(@sub, CHARINDEX(' ', @sub))

            SET @sub = SUBSTRING(@sub, CHARINDEX(' ', @sub) + 1, LEN(@sub))
            SET @B = LEFT(@sub, CHARINDEX(' ', @sub))

            SET @sub = SUBSTRING(@sub, CHARINDEX(' ', @sub) + 1, LEN(@sub))
            SET @C = LEFT(@sub, CHARINDEX(' ', @sub))

            SET @D = SUBSTRING(@sub, CHARINDEX(' ', @sub) + 1, LEN(@sub))

            INSERT INTO @result VALUES(@Word, @A, @B, @C, @D)
         END
     END
    RETURN  
 END

Jay Nani · Accepted Answer · 2020-04-09 13:19:07Z

create table test (RawDataColumn varchar(2000))
insert into test values('THANK 1000 1500 1740 1  YOU 1000 1740 1820 1  ABC 1000 1820 1960 1  XYZABC 1000 1960 2240 1  DFGS 1000 2240 2380 1  THINK 1000 2380 2480 1')
;with mycte as (

Select value as val1 from test
Cross apply String_split( replace(RawDataColumn,'  ','|'),'|')
)




Select    Max(Case when rn=1 then value end) word
, Max(Case when rn=2 then value end) A
, Max(Case when rn=3 then value end) B
, Max(Case when rn=4 then value end) C
, Max(Case when rn=5 then value end) D
from mycte  s
Cross apply (
SELECT ss.[value], ROW_NUMBER() OVER (PARTITION BY s.val1 ORDER BY s.val1 ) AS rn
FROM string_Split(val1,' ') AS ss
) as d
Group by s.val1


drop table test

Collectives™ on Stack Overflow

How Parse data between unknown strings

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related