MySQL Field Data Parsing

Question

Started here, I was asked to create a new post in order to flesh it out which I'll try to do here but I'm afraid it's rather wordy. I created a test table with what seems to be a good cross-section of the data that needs to be parsed. Using this query from here to which I added some additional fields that are needed, this is the query now.

SELECT ID, PartNo,
  GROUP_CONCAT(
    REPLACE(t.part, '-', CONCAT(' ', LEFT(t.part, 2)))
    ORDER BY t.ModelList
    SEPARATOR ' '                                      
  ) Models
FROM (
  SELECT t.ID, t.PartNo, t.Models, p.ModelList,
    REPLACE(REPLACE(
      SUBSTRING_INDEX(t.Models, ';', p.ModelList),
      SUBSTRING_INDEX(t.Models, ';', p.ModelList - 1),
      ''
    ), ';', '') part 
  FROM parts_listing t CROSS JOIN (
    SELECT 1 ModelList UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL
    SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL
    SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
  ) p 
  WHERE REPLACE(REPLACE(Models, '-', ''), ';', '') REGEXP '^[0-9]*$'
) t
WHERE t.part <> ''
GROUP BY t.ID, t.PartNo, t.Models
ORDER BY t.PartNo, t.Models

For the most basic parts, the Models data looks like this:

2206-13-26-33; 2302-06-13-32-33

. . . and this is the part that I am working on now. The Models column also contains ALL 23RD etc that will also need to be parsed somehow but for now I am concerned only with the data as shown above and I need to parse it to something like:

2206 2213 2226 2233 2302 2306 2313 2332 2333

The query above, however, seems to be repeating the first bit so it comes out as:

2206 2213 2226 2233 2202 2206 2213 2232 2233

It gives the result in one chunk but it would also be okay if each result was on it's own line as ultimately it will be dumped into another table and that is the way it will be by then.

It also gives too few results. The actual table contains over 15,000 rows but this is fetching only a small fraction and in the data sample below, it should fetch about 20 rows but actually fetches only 6 although the fiddle, which has a very limited data set, does indeed give the proper results. Running the the fiddle query against my own data fetches only 555 rows.

A cross-section sample of data can be created with this:

DROP TABLE IF EXISTS `parts_listing`;
CREATE TABLE IF NOT EXISTS `parts_listing` (
  `ID` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
  `GroupNumber` varchar(20) DEFAULT NULL,
  `PartNo` varchar(30) DEFAULT NULL,
  `Models` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=34 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;

INSERT INTO `parts_listing` (`ID`, `GroupNumber`, `PartNo`, `Models`) VALUES
    (1, '0.000', '391906', '2201-02-11-20-22-32-40; 2301'),
    (2, '0.000', '391907', '2206-26-33'),
    (3, '0.000', '399391', '2213'),
    (4, '0.000', '415209', '2302-06-32-33'),
    (5, '0.000', '415245', '2313'),
    (6, '0.000', '437273', 'ALL 24TH; 25TH; 26TH'),
    (7, '0.000', '436803', '2631 (BODY 2678); 5431 (BODY 5478)'),
    (8, '0.000', '448611', 'ALL 54TH'),
    (9, '0.0015', '372174', 'ALL 22ND; 23RD'),
    (10, '0.0015', '399324', '2201-02-11-20-22-32-40; 2301'),
    (11, '0.0015', '399324', '2206-13-26-33; 2302-06-13-32-33'),
    (12, '0.0015', '372146', 'ALL 22ND; 23RD'),
    (13, '0.0015', 'G179359', '2201-02-11-20-22-32-40; 2301'),
    (14, '0.0015', 'G179559', '2206-13-26-33; 2302-06-13-32-33'),
    (15, '0.0015', '422024', '2201-02-11-20-22-32-40; 2301'),
    (16, '0.0015', '422024', '2206-13-26-33; 2302-06-13-32-33'),
    (17, '0.0016', '378664', 'ALL 22ND; 23RD'),
    (18, '0.0016', 'G178532', 'ALL 22ND; 23RD'),
    (19, '0.0016', 'G123197', 'ALL 22ND; 23RD'),
    (20, '30.793', '448421', 'ALL 54TH'),
    (21, '30.7932', '422762', '2462-65-92-95-98; 2562-65-92-95; 2662-65-92-95'),
    (22, '30.7932', '448902', '5462-65-67-82-85-92-95-97'),
    (23, '30.7938', '399941', 'ALL 22ND; 23RD'),
    (24, '30.7938', '425809', 'ALL 24TH; 25TH; 26TH'),
    (25, '30.7938', '447799', 'ALL 54TH'),
    (26, '30.79383', '399927', 'ALL 22ND; 23RD; 24TH; 25TH; 26TH'),
    (27, '30.79396', 'PA437550', '2562-65-77-79; 2677-79'),
    (28, '30.79398', '422470', '2452-72; 2552-72'),
    (29, '30.79398', '422471', '2452-72; 2552-72'),
    (30, '30.79398', '422553', 'ALL 24TH; 25TH'),
    (31, '30.8007', '425517', '2413-52-62-65-67-72-92-95-98; 2552-62-65-72-77-92-95; 2650-51-52-62-65-72-77-92-95-97; 5450-51-52-62-65-67-72-77-82-85-92-95-97'),
    (32, '30.8007', '425920', 'RHD, 2462-65-92-95; 2562-65-92-95; 2662-65-92-95-97; 5462-65-82-85-92-95-97'),
    (33, '30.8044', 'G129662', '2293');

Andrei Odegov · Accepted Answer · 2019-09-12 10:03:35Z

0

I think you need a more sophisticated way to verify that the Models column values are correct due to spaces after semicolons. In addition, if you need to process the Models column values that contain more than ten parts, you will need a longer sequence of numbers. I use the INFORMATION_SCHEMA COLUMNS table for this. You can also create an auxiliary table for this.

SELECT
  CONCAT(
    IF(N.n = 0, '', LEFT(t.part, 2)),
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(t.part, '-', N.n + 1), '-', -1
    )
  ) AS part, t.isHD, t.ID, t.GroupNumber, t.PartNo, t.Models
FROM Numbers AS N CROSS JOIN (
  SELECT
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(t.part, ' ', N.n + 1), ' ', -1
    ) AS part, t.isHD, t.ID, t.GroupNumber, t.PartNo, t.Models
  FROM Numbers AS N CROSS JOIN (
    SELECT
      SUBSTRING_INDEX(
        SUBSTRING_INDEX(t.part, ';', N.n + 1), ';', -1
      ) AS part, t.isHD, t.ID, t.GroupNumber, t.PartNo, t.Models
    FROM Numbers AS N CROSS JOIN (
      SELECT
        REPLACE(
          SUBSTRING_INDEX(
            SUBSTRING_INDEX(t.part, 'LHD', N.n + 1), 'LHD', -1
          ), ',', ''
        ) AS part, IF(N.n > 0, 1, t.isHD) AS isHD,
        t.ID, t.GroupNumber, t.PartNo, t.Models
      FROM Numbers AS N CROSS JOIN (
        SELECT
          SUBSTRING_INDEX(
            SUBSTRING_INDEX(p.Models, 'RHD', N.n + 1), 'RHD', -1
          ) AS part, IF(N.n > 0, 2, 0) AS isHD,
          p.ID, p.GroupNumber, p.PartNo, p.Models
        FROM Numbers AS N CROSS JOIN parts_listing AS p
        WHERE p.Models REGEXP '[[:<:]]([LR]HD,? +)?[0-9]+(-[0-9]+)*[[:>:]]' AND
              N.n <= (LENGTH(p.Models) - LENGTH(REPLACE(p.Models, 'RHD', ''))) / 3
      ) AS t WHERE N.n <= (LENGTH(t.part) - LENGTH(REPLACE(t.part, 'LHD', ''))) / 3
    ) AS t WHERE N.n <= LENGTH(t.part) - LENGTH(REPLACE(t.part, ';', ''))
  ) AS t WHERE N.n <= LENGTH(t.part) - LENGTH(REPLACE(t.part, ' ', ''))
) AS t WHERE t.part REGEXP '^[0-9]+(-[0-9]+)*$' AND
             N.n <= LENGTH(t.part) - LENGTH(REPLACE(t.part, '-', ''));

Partial output (29 rows):

+------+------+----+-------------+---------+--------------------------------------+
| part | isHD | ID | GroupNumber | PartNo  |                Models                |
+------+------+----+-------------+---------+--------------------------------------+
| 2201 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2202 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2211 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2220 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2222 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2232 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2240 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2301 |    0 |  1 | 0.000       | 391906  | 2201-02-11-20-22-32-40; 2301         |
| 2213 |    1 |  3 | 0.000       | 399391  | LHD 2213                             |
| 2313 |    2 |  5 | 0.000       | 415245  | RHD, 2313                            |
| 2631 |    0 |  7 | 0.000       | 436803  | 2631 (BODY 2678); 5431 (BODY 5478)   |
| 5431 |    0 |  7 | 0.000       | 436803  | 2631 (BODY 2678); 5431 (BODY 5478)   |
| 2206 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2213 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2226 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2233 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2302 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2306 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2313 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2332 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2333 |    1 | 14 | 0.0015      | G179559 | LHD, 2206-13-26-33; 2302-06-13-32-33 |
| 2201 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2202 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2211 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2220 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2222 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2232 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2240 |    0 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
| 2301 |    2 | 15 | 0.0015      | 422024  | 2201-02-11-20-22-32-40; RHD, 2301    |
+------+------+----+-------------+---------+--------------------------------------+

Demo on DB Fiddle.

edited Sep 12, 2019 at 10:03

answered Sep 9, 2019 at 7:10

Andrei Odegov

3,5722 gold badges18 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

user5175034 Over a year ago

This is good but I'm not sure if it's pulling up everything. It appears that each section is on its own line in the results which is fine but, if I counted properly, there should be about 39 rows in the output, not 33. Maybe it is missing the entries in 2631 (BODY 2678); 5431 (BODY 5478) as it should, for now, ignore anything within parenthesis which I forgot to say in the original question. Also, doing error checking is both good and bad. If it filters any errors, that's good but it also keeps me from seeing and repairing the errors which is not good!

Andrei Odegov Over a year ago

I updated my answer by adding another layer to split the Models column values by spaces.

Andrei Odegov Over a year ago

My answer has been updated. In it I use an auxiliary table of numbers. But I do not understand what to do with the RHD (and the LHD). Can you give an example?

Andrei Odegov Over a year ago

This table is created in lines 3 through 10 of the SQL Schema panel. Using this table makes the query faster, concise, and readable. In the latest version of my answer, I also tried to process the LHD and the RHD.

Andrei Odegov Over a year ago

Is the RHD/LHD always at the beginning of a line?

|

nbk · Accepted Answer · 2019-09-10 08:58:12Z

0

You can try this. I changed the parts where You split the Models column. I split them into different rows, and let the code from forpas go over it and join them again together. rows with text in it aren't touched, because you didn't specify nothing.

select  Id,GroupNumber,PartNo,
  group_concat(
    replace(t.part, '-', concat(' ', left(t.part, 2)))
    order by t.partno, t.part ASC
    separator ' '                                      
  ) Models
from
(SELECT
  Id,GroupNumber,PartNo,
  TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(models, ';', n.digit+1), ';', -1)) part
FROM
  ( SElECT Id,GroupNumber,PartNo,
   REPLACE(REPLACE(REPLACE(Models, ')', ''), 'RHD,', ''), '(BODY', ';') Models
   FROM
   parts_listing ) p_l
  INNER JOIN
  (SELECT 0 digit UNION ALL select 1 union all select 2 union all select 3 union all
    select 4 union all select 5 union all select 6 union all
    select 7 union all select 8 union all select 9) n
  ON LENGTH(REPLACE(Models, ';' , '')) <= LENGTH(Models)-n.digit
  Where REPLACE(REPLACE(REPLACE(Models, '-', ''), ' ', ''), ';', '') REGEXP '^[0-9]+$'
ORDER BY
  Id,
  n.digit,part) t
   Group by  Id,GroupNumber,PartNo
   ;

Gets you

Id  GroupNumber     PartNo  Models
1   0.000   391906  2201 2202 2211 2220 2222 2232 2240 2301
2   0.000   391907  2206 2226 2233
3   0.000   399391  2213
4   0.000   415209  2302 2306 2332 2333
5   0.000   415245  2313
7   0.000   436803  2631 2678 5431 5478
10  0.0015  399324  2201 2202 2211 2220 2222 2232 2240 2301
11  0.0015  399324  2206 2213 2226 2233 2302 2306 2313 2332 2333
13  0.0015  G179359     2201 2202 2211 2220 2222 2232 2240 2301
14  0.0015  G179559     2206 2213 2226 2233 2302 2306 2313 2332 2333
15  0.0015  422024  2201 2202 2211 2220 2222 2232 2240 2301
16  0.0015  422024  2206 2213 2226 2233 2302 2306 2313 2332 2333
21  30.7932     422762  2462 2465 2492 2495 2498 2562 2565 2592 2595 2662 2665 2692 2695
22  30.7932     448902  5462 5465 5467 5482 5485 5492 5495 5497
27  30.79396    PA437550    2562 2565 2577 2579 2677 2679
28  30.79398    422470  2452 2472 2552 2572
29  30.79398    422471  2452 2472 2552 2572
31  30.8007     425517  2413 2452 2462 2465 2467 2472 2492 2495 2498 2552 2562 2565 2572 2577 2592 2595 2650 2651 2652 2662 2665 2672 2677 2692 2695 2697 5450 5451 5452 5462 5465 5467 5472 5477 5482 5485 5492 5495 5497
32  30.8007     425920  2462 2465 2492 2495 2562 2565 2592 2595 2662 2665 2692 2695 2697 5462 5465 5482 5485 5492 5495 5497
33  30.8044     G129662     2293

Dbfiddle example

edited Sep 10, 2019 at 8:58

answered Sep 8, 2019 at 19:42

nbk

49.6k8 gold badges32 silver badges53 bronze badges

7 Comments

user5175034 Over a year ago

Thank you. It seems to be working after a fashion but you'll note that some of the results have only three digit numbers when all must be four digit. Also, note that there are some rows that contain both ordinal numbers and dash-separated values and those dash-separated values need to be parsed, ignoring the other pieces. I have to somehow parse the ordinal numbers too but that will probably have to be for a different post as it requires a database lookup. For the current code, only the four digit numbers should be in the results and rows that have only ordinals shouldn't show up at all.

nbk Over a year ago

so i changed it. You other rows, as i said in the inner select you have all data in a singkle row i.e ALL 54TH thtat you can detect and replace it. I don'T know what

user5175034 Over a year ago

Thank you, that appears to be pulling out the proper information but it's coming out in reverse order with the higher numbers first. Any way to keep them in the order that they are as shown in my question above? Also, it should be returning only 20 rows (if I counted correctly) but it is pulling up all. It does not need to show any rows that do not contain the four-digit numbers and I'm not sure how to do that.

nbk Over a year ago

So ichanged ia one more time., now it is sorted and everything without numbers out

nbk Over a year ago

As you can see, i removed or changed(See replacement of (Body to ;) the unwanted parts, so far that works because there are very simple pattern to eliminate.or repplace

|

Collectives™ on Stack Overflow

MySQL Field Data Parsing

2 Answers 2

18 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

18 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related