16

I know this question has been asked before but I can't get any of the answers I have looked at to work. I have a JSON file which has thousands of lines and want to simply extract the text between two strings every time they appear (which is a lot).

As a simple example my JSON would look like this:

    "customfield_11300": null,
    "customfield_11301": [
      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
    "customfield_10730": null,
    "customfield_11302": null,
    "customfield_10720": 0.0,
    "customfield_11300": null,
    "customfield_11301": [
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],
    "customfield_10730": null,
    "customfield_11302": null,
    "customfield_10720": 0.0,

So I want to output everything between "customfield_11301" and "customfield_10730":

      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],

I'm trying to keep it as simple as possible - so don't care about brackets being displayed in the output.

This is what I have (which outputs way more than what I want):

$importPath = "todays_changes.txt"
$pattern = "customfield_11301(.*)customfield_10730"

$string = Get-Content $importPath
$result = [regex]::match($string, $pattern).Groups[1].Value
$result
2
  • 1
    why don't you decode the JSON into an object and address the properties directly? Commented Apr 20, 2016 at 14:05
  • 1
    The quick answer is - change your greedy capture (.*) to non greedy - (.*?). That should do it. Commented Apr 20, 2016 at 14:13

5 Answers 5

14

Here is a PowerShell function which will find a string between two strings.

function GetStringBetweenTwoStrings($firstString, $secondString, $importPath){

    #Get content from file
    $file = Get-Content $importPath

    #Regex pattern to compare two strings
    $pattern = "$firstString(.*?)$secondString"

    #Perform the opperation
    $result = [regex]::Match($file,$pattern).Groups[1].Value

    #Return result
    return $result

}

You can then run the function like this:

GetStringBetweenTwoStrings -firstString "Lorem" -secondString "is" -importPath "C:\Temp\test.txt"

My test.txt file has the following text within it:

Lorem Ipsum is simply dummy text of the printing and typesetting industry.

So my result:

Ipsum

Sign up to request clarification or add additional context in comments.

3 Comments

There are several limitations to note: Only one match is returned, whereas the question asks for all. Due to using Get-Content without -Raw, your function inadvertently modifies the file's content before matching, by turning it into a space-separated single-line string first. $firstString and $secondString, despite what the parameter names suggest, must be regex patterns, not literal strings. For simple strings that contain no regex metacharacters, such as in this case, that won't be a problem, but if you pass literal strings such as 'foo(, your function breaks.
As an aside: it's better to observe PowerShell's Verb-Noun naming convention -> GetStringBetweenTwoStrings -> Get-StringBetweenTwoStrings
I love this answer. I used it to make my own solution. Answer upvoted!
11

The quick answer is - change your greedy capture (.*) to non greedy - (.*?). That should do it.

customfield_11301(.*?)customfield_10730

Otherwise the capture will eat as much as it can, resulting in it continuing 'til the last customfield_10730.

Regards

1 Comment

With this approach if I have multiple times the same pattern on a single line , it only returns the first occurence . Any idea on how to apply this to multiple occurrences in the same line ?
5

You need to make your RegEx Lazy:

customfield_11301(.*?)customfield_10730

Live Demo on Regex101

Your Regex was Greedy. This means it will find customfield_11301, and then carry until it finds the very last customfield_10730.

Here is a simpler example of Greedy vs Lazy Regex:

# Regex (Greedy): [(.*)]
# Input:          [foo]and[bar]
# Output:         foo]and[bar

# Regex (Lazy):   [(.*?)]
# Input:          [foo]and[bar]
# Output:         "foo" and "bar" separately

Your Regex was very similar to the first one, it captured too much, whereas this new one captures the least amount of data possible, and will therefore work as you intended

1 Comment

Thank you kindly for your help, @ClasG answered a few minutes before you so I'll accept his as the answer. But thank you especially for the regex101 demo link, that really helped me understand what was happening.
2

First issue is Get-Content pipe will give you line by line not the entire content at once. You can pipe Get-Content with Out-String to get entire content as a single string and do the Regex on the content.

A working solution for your problem is:

Get-Content .\todays_changes.txt | Out-String | % {[Regex]::Matches($_, "(?<=customfield_11301)((.|\n)*?)(?=customfield_10730)")} | % {$_.Value}

And the output will be:

": [
  {
    "self": "xxxxxxxx",
    "value": "xxxxxxxxx",
    "id": "10467"
  }
],
"

": [
  {
    "self": "zzzzzzzzzzzzz",
    "value": "zzzzzzzzzzz",
    "id": "10467"
  }
],
"

1 Comment

Use Get-Content -Raw to get a file's entire content as a single, multiline string.
1

As an aside: Since your input appears to be JSON, you're normally better off parsing it into an object graph with ConvertFrom-Json, which you can easily query; however, your JSON appears to be nonstandard in that it contains duplicate property names.


There's good information in the existing answers, but let me try to cover all aspects in a single answer:

tl;dr

# * .Matches() (plural) is used to get *all* matches
# * Get-Content -Raw reads the file *as a wole*, into a single, multiline string
# * Inline regex option (?s) makes "." match newlines too, to match *across lines*
# * (.*?) rather than (.*) makes the matching *non-greedy*.
# * Look-around assertions - (?<=...) and (?=...) - to avoid the need for capture groups.
[regex]::Matches(
  (Get-Content -Raw todays_changes.txt),
  '(?s)(?<="customfield_11301":).*?(?="customfield_10730")'
).Value

Output with your sample input:

 [
      {
        "self": "xxxxxxxx",
        "value": "xxxxxxxxx",
        "id": "10467"
      }
    ],
    
 [
      {
        "self": "zzzzzzzzzzzzz",
        "value": "zzzzzzzzzzz",
        "id": "10467"
      }
    ],    

For an explanation of the regex and the ability to experiment with it, see this regex101.com page


As for what you tried:

$pattern = "customfield_11301(.*)customfield_10730"

As has been noted, the primary problem with this regex is that (.*) is greedy, and will keep matching until the last occurrence of customfield_10730 has been found; making it non-greedy - (.*?) solves that problem.

Additionally, this regex will not match across multiple lines, because . by default does not match newline characters (\n). The easiest way to change that is to place inline regex option (?s) at the start of the pattern, as shown above.

It was only a lucky accident that still caused cross-line matching in your attempt, as explained next:

$string = Get-Content $importPath

This stores an array of strings in $string, with each element representing a line from the input file.

To read a file's content as a whole into a single, multiline string, use Get-Content's -Raw switch: $string = Get-Content -Raw $importPath

$result = [regex]::match($string, $pattern).Groups[1].Value

Since your $string variable contained an array of strings, PowerShell implicitly stringified it when passing it to the [string] typed input parameter of the [regex]::Match() method, which effectively created a single-line representation, because the array elements are joined with spaces (by default; you can specify a different separator with $OFS, but that is rarely done in practice).

For instance, the following two calls are - surprisingly - equivalent:

[regex]::Match('one two'), 'e t').Value # -> 'e t'

# !! Ditto, because array @('one', 'two') stringifies to 'one two'
[regex]::Match(@('one', 'two'), 'e t').Value # -> 'e t'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.