0

I am trying to extract all the presence of 'and', 'a', 'the', 'an','& amp ;' from a block of text along with all the presence of digits.

I tried to create different regex for that purpose but fail to get the accurate result.

All the digits are extracted fine but I am unable to fetch all the aforementioned strings through regex.

My basic regex was

 Pattern p = Pattern.compile("^[0-9]");

then I tried different combinations like

 Pattern p = Pattern.compile("^[0-9](&)");
 Pattern p = Pattern.compile("^[0-9]+[&]");

to get aforementioned strings but of no use.

Example of the text:

System requirements: iOS 6.0 and Android (varies) &
Version used in this guide: 2.2.4 (iPhone), 13.1.2 (Android)

Expected Result

 6.0,and,&,2.2.4,13.1.2
7
  • 1
    What is your expected result? Commented Jun 8, 2015 at 7:14
  • Could you show us your failed attempts? Commented Jun 8, 2015 at 7:15
  • @ohaal - the OP shown us what he / she has tried.. My basic regex was.. Commented Jun 8, 2015 at 7:16
  • 2
    It's unclear what you are trying to do. Please add the expected output from the input you've provided. Commented Jun 8, 2015 at 7:18
  • "I tried to create different regex for that purpose but fail to get the accurate result." -- His current "attempt" is a regex which looks for a single digit at the start of a string... how on earth is that supposed to ever be able to capture 'and', 'a', 'the' and 'an'. Surely there must be a better attempt. Even just writing those words out literally without any use of regex would be a better attempt... Commented Jun 8, 2015 at 7:18

2 Answers 2

1

You are nowhere even close with your "attempts" and I almost feel bad for just handing you the solution, but if you really are "keen to learn new things" (as you say in your SO profile), have a look at a regex tutorial.

A basic use of alternation, grouping, quantifiers and anchors(/word boundaries) will solve your problem.

(\b(?:a|an|and|the)\b|&|\d+(?:\.\d+)*)

Explanation:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      a                        'a'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      an                       'an'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      the                      'the'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    &                    '&'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1

For use in Java, you would have to escape every \.

(\\b(?:a|an|and|the)\\b|&|\\d+(?:\\.\\d+)*)
Sign up to request clarification or add additional context in comments.

2 Comments

I have never said that I am good with regex,but yes I know basics of creating regex, I tried lots of combinations, and I couldn't mention all so i mentioned the very basic for visitors just to give an idea. Anyways, appreciate your time and explanation.
@HappyDev: No offence, but based on the attempts you provided, you do not know the basics of regex. The basics of regex (i.e. some of the first things you should learn) is exactly what would be required to solve this problem (grouping, alternation, quantifiers and anchors). Anyway, thanks for the downvote. :)
0

You can use the following regex:

(\\ban?d?\\b|\\bthe\\b|\\B&\\B|[\\d.]+)

See DEMO

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.