SQL Server Regular expression extract pattern from DB colomn

Question

I have a question about SQL Server: I have a database column with a pattern which is like this:

up to 10 digits
then a comma
up to 10 digits
then a semicolon

e.g.

100000161, 100000031; 100000243, 100000021;
100000161, 100000031; 100000243, 100000021;

and I want to extract within the pattern the first digits (up to 10) (1.) and then a semicolon (4.)

(or, in other words, remove everything from the semicolon to the next semicolon)

100000161; 100000243; 100000161; 100000243;

Can you please advice me how to establish this in SQL Server? Im not very familiar with regex and therefore have no clue how to fix this.

Thanks,

Alex

SQL Server is notorious among the enterprise databases for having fairly lousy regex replace support, which is probably what you would want to be using for this problem. Is there any chance you could scrub this data somewhere else? — Tim Biegeleisen
– Tim Biegeleisen, Commented Sep 1, 2017 at 12:08
@TimBiegeleisen No matter how lousy the regex support is, something this simple will never be a problem in any regex engine. Regex is also something you would definitely NOT want to use for this task. — Tomalak
– Tomalak, Commented Sep 1, 2017 at 12:10
@Tomalak SUBSTRING_INDEX is not a SQL Server function, it's a MySQL function, and yes, regex is the sort of thing you would want to use here. — Tim Biegeleisen
– Tim Biegeleisen, Commented Sep 1, 2017 at 12:11
@user3898488 are you trying to return the first field from each pair. You can use STRING_SPLIT to split first by ;, then by ,. It would be better if you parsed the data before storing it in the database though. You can't take advantage of indexes if you need to apply functions on a column's values — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Sep 1, 2017 at 12:21
The sample data you show does not match the pattern that you wrote. — Andrew Morton
– Andrew Morton, Commented Sep 1, 2017 at 12:22

user7715598 · Accepted Answer · 2017-09-01 12:24:49Z

1

Try this

Declare @Sql Table (SqlCol nvarchar(max))
INSERT INTO @Sql
SELECT'100000161,100000031;100000243,100000021;100000161,100000031;100000243,100000021;'
   ;WITH cte 
     AS (SELECT Row_number() 
                  OVER( 
                    ORDER BY (SELECT NULL))         AS Rno, 
                split.a.value('.', 'VARCHAR(1000)') AS Data 
         FROM   (SELECT Cast('<S>' 
                             + Replace( Replace(sqlcol, ';', ','), ',', 
                             '</S><S>') 
                             + '</S>'AS XML) AS Data 
                 FROM   @Sql)AS A 
                CROSS apply data.nodes('/S') AS Split(a)) 
SELECT Stuff((SELECT '; ' + data 
              FROM   cte 
              WHERE  rno%2 <> 0 
                     AND data <> '' 
              FOR xml path ('')), 1, 2, '') AS ExpectedData 

ExpectedData
-------------
100000161; 100000243; 100000161; 100000243

answered Sep 1, 2017 at 12:24

user7715598

Sign up to request clarification or add additional context in comments.

3 Comments

Panagiotis Kanavos Over a year ago

You don't need all this to extract the first value. Use different inner and outer tags instead of a single <S> and select the one you want

Farbkreis Over a year ago

This looks good. The only issue I realize here when I checked on real data was that the outbut of select column from table is written into one line while the source data is in different rows?

Panagiotis Kanavos Over a year ago

@user3898488 what do you want? All results in a single row? A pair per input row?

Jay Wheeler · Accepted Answer · 2017-09-01 12:26:17Z

1

I believe this will get you what you are after as long as that pattern truly holds. If not it's fairly easy to ensure it does conform to that pattern and then apply this

Select Substring(TargetCol, 1, 10) + ';' From TargetTable

answered Sep 1, 2017 at 12:26

Jay Wheeler

3792 silver badges7 bronze badges

3 Comments

Andrew Morton Over a year ago

OP changed the spec a little, so it would now be SELECT LEFT(TargetCol, CHARINDEX(',', TargetCol) - 1) + ';' WHERE CHARINDEX(',', TargetCol) BETWEEN 1 AND 11;.

Farbkreis Over a year ago

This looks pretty good, but where do i have add the from targettable? I checked your first command and it runs fine, but im failing in merging your two commands

Andrew Morton Over a year ago

@user3898488 Oops! I tested with a variable and changed the name without adding in the FROM, so...

SELECT LEFT(TargetCol, CHARINDEX(',', TargetCol) - 1) + ';' FROM SomeTable WHERE CHARINDEX(',', TargetCol) BETWEEN 1 AND 11;

. But that won't help if you have more than one pair of data in a row.

Panagiotis Kanavos · Accepted Answer · 2017-09-01 13:17:19Z

You can take advantage of SQL Server's XML support to convert the input string into an XML value and query it with XQuery and XPath expressions.

For example, the following query will replace each ; with </b><a> and each , to </a><b> to turn each string into <a>100000161</a><a>100000243</a><a />. After that, you can select individual <a> nodes with /a[1], /a[2] :

declare @table table (it nvarchar(200))

insert into @table values
('100000161, 100000031; 100000243, 100000021;'),
('100000161, 100000031; 100000243, 100000021;')

select 
    xCol.value('/a[1]','nvarchar(200)'), 
    xCol.value('/a[2]','nvarchar(200)')
from (
    select convert(xml, '<a>' 
                        + replace(replace(replace(it,';','</b><a>'),',','</a><b>'),' ','')
                        + '</a>')
                  .query('a') as xCol
    from @table) as tmp 

-------------------------
A1          A2
100000161   100000243
100000161   100000243

value extracts a single value from an XML field. nodes returns a table of nodes that match the XPath expression. The following query will return all "keys" :

select 
    a.value('.','nvarchar(200)')
from (
    select convert(xml, '<a>' 
                        + replace(replace(replace(it,';','</b><a>'),',','</a><b>'),' ','')
                        + '</a>')
                  .query('a') as xCol
    from @table) as tmp 
    cross apply xCol.nodes('a') as y(a)
where a.value('.','nvarchar(200)')<>''

------------
100000161
100000243
100000161
100000243

With 200K rows of data though, I'd seriously consider transforming the data when loading it and storing it in indivisual, indexable columns, or add a separate, related table. Applying string manipulation functions on a column means that the server can't use any covering indexes to speed up queries.

If that's not possible (why?) I'd consider at least adding a separate XML-typed column that would contain the same data in XML form, to allow the creation of an XML index.

Collectives™ on Stack Overflow

SQL Server Regular expression extract pattern from DB colomn

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related