Java Regex to extract specific words

Question

I am trying to extract all the presence of 'and', 'a', 'the', 'an','& amp ;' from a block of text along with all the presence of digits.

I tried to create different regex for that purpose but fail to get the accurate result.

All the digits are extracted fine but I am unable to fetch all the aforementioned strings through regex.

My basic regex was

 Pattern p = Pattern.compile("^[0-9]");

then I tried different combinations like

 Pattern p = Pattern.compile("^[0-9](&amp;)");
 Pattern p = Pattern.compile("^[0-9]+[&amp;]");

to get aforementioned strings but of no use.

Example of the text:

System requirements: iOS 6.0 and Android (varies) &amp;
Version used in this guide: 2.2.4 (iPhone), 13.1.2 (Android)

Expected Result

 6.0,and,&amp;,2.2.4,13.1.2

@ohaal - the OP shown us what he / she has tried.. My basic regex was.. — TheLostMind
– TheLostMind, Commented Jun 8, 2015 at 7:16
It's unclear what you are trying to do. Please add the expected output from the input you've provided. — Maroun
– Maroun, Commented Jun 8, 2015 at 7:18
"I tried to create different regex for that purpose but fail to get the accurate result." -- His current "attempt" is a regex which looks for a single digit at the start of a string... how on earth is that supposed to ever be able to capture 'and', 'a', 'the' and 'an'. Surely there must be a better attempt. Even just writing those words out literally without any use of regex would be a better attempt... — ohaal
– ohaal, Commented Jun 8, 2015 at 7:18

ohaal · Accepted Answer · 2015-06-08 08:12:03Z

You are nowhere even close with your "attempts" and I almost feel bad for just handing you the solution, but if you really are "keen to learn new things" (as you say in your SO profile), have a look at a regex tutorial.

A basic use of alternation, grouping, quantifiers and anchors(/word boundaries) will solve your problem.

(\b(?:a|an|and|the)\b|&amp;|\d+(?:\.\d+)*)

Explanation:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      a                        'a'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      an                       'an'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      the                      'the'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    &amp;                    '&amp;'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1

For use in Java, you would have to escape every \.

(\\b(?:a|an|and|the)\\b|&amp;|\\d+(?:\\.\\d+)*)

I have never said that I am good with regex,but yes I know basics of creating regex, I tried lots of combinations, and I couldn't mention all so i mentioned the very basic for visitors just to give an idea. Anyways, appreciate your time and explanation.
@HappyDev: No offence, but based on the attempts you provided, you do not know the basics of regex. The basics of regex (i.e. some of the first things you should learn) is exactly what would be required to solve this problem (grouping, alternation, quantifiers and anchors). Anyway, thanks for the downvote. :)

karthik manchala · Accepted Answer · 2015-06-08 07:29:29Z

0

You can use the following regex:

(\\ban?d?\\b|\\bthe\\b|\\B&amp;\\B|[\\d.]+)

See DEMO

answered Jun 8, 2015 at 7:29

karthik manchala

13.7k1 gold badge34 silver badges55 bronze badges

Collectives™ on Stack Overflow

Java Regex to extract specific words

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related