PowerShell extract by pattern

Question

I have a folder full of text files and each file look something like below:

# Mainline
apple
orange
banana
onion #small#

# lineA
orange
banana
watermelon
raisins #packed#
raisins #unpacked#

# lineB
chocolate
nuts
sugar
coffee

# lineC
lemon
honey
carrots
broccoli

All files always start with # Mainline but the order of other lines are not the same. Some files missing lineA, some files missing lineC, some files have lineB first before lineA and so on.

I'm trying to see if I can extract the text between each of the lines beginning with # and make them their own file.

i.e., file1_mainline would have

# Mainline
apple
orange
banana
onion #small#

file1_lineA would have

# lineA
orange
banana
watermelon
raisins #packed#
raisins #unpacked#

and so on. I've tried using

$file = get-content "filename"
$Mainstring = "# Mainline"
$lineAString = "# lineA"
$lineBString = "# lineB"
$lineCString = "# lineC"

$MainExt = "$Mainstring(.*?)$lineAstring"
$lineAExt = "$lineAstring(.*?)$lineAstring"
$lineBExt = "$lineBstring(.*?)$lineCstring"
$lineCExt = "$lineCstring(.*)"
[regex]::Match($file,$MainExt).Groups[1].value | out-file file1_main.txt
[regex]::Match($file,$lineAExt).Groups[1].value | out-file file1_lineA.txt
[regex]::Match($file,$lineBstring).Groups[1].value | out-file file1_lineB.txt
[regex]::Match($file,$lineCstring).Groups[1].value | out-file file1_lineC.txt

Along with the fact that there might be a simpler approach to deal with this all, I'm running into the following problems:

The files are from a Unix subsystem, I'm not sure if that's causing the issues but the line breaks are not preserved in the resulting file.
Some files, where the order is not well preserved is where the script breaks.

I've looked up enough on here previously, but I can't seem to find a way to put together a working code. Any help is appreciated.

If you are trying to match text that spans several lines, you should use Get-Content -Raw. Also if you expect .*? to match text across several lines, you need to be in single line mode --> [regex]::Match($file,$MainExt,'SingleLine').Groups[1].value — AdminOfThings
– AdminOfThings, Commented Jan 27, 2021 at 13:40

Mark Elvers · Accepted Answer · 2021-01-27 14:05:22Z

1

Why not make it totally generic? Don't search for a specific block just deal with the blocks as they appear regardless of the order. If the input is as you said describe, scan through the file line by line and pick out the lines starting with # and then use the subsequent text on that line to create the filename. Then output all the following lines to that file until you next hit the next # line. Something like this:

foreach ($file in (gci *.txt)) {
    $c = Get-Content $file.fullname
    $filename = $null;
    foreach ($line in $c) {
        if ($line -match '^# (?<name>.*)') {
            $filename = "$($file.fullname.Substring(0, $file.fullname.Length - $file.Extension.Length))_$($Matches.name)$($file.extension)"
        }
        if ($filename) {
            Add-Content $filename $line
        }
    }
}

answered Jan 27, 2021 at 14:05

Mark Elvers

6573 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mklement0 · Accepted Answer · 2021-01-27 15:33:45Z

The easiest, but not the most obvious solution:

Use Get-Content -Raw to read the entire file at once.
Use -split, the string splitting operator, to split the input file's content into the blocks of interest in a single operation.
- The solution below takes advantage of the fact that even though the regex you pass to -split specifies the separator to use for splitting (not what element data to match), by enclosing (part of) that separator in (...), a capture group, separator matches too are included in the output array; any resulting empty elements in the output array can easily be filtered out with -ne ''
Loop over all blocks, use -split again to extract the file name from each block's first line, and save each block to a file with that name using Set-Content.

This approach relies solely on the format of the blocks, not on their content, so you don't have to depend on a specific ordering; the block format is:

A block starts with a #-prefixed line containing a (file) name, followed by any number of non-empty lines.
A new block starts after an empty line (effectively, two consecutive newlines, \n\n).

$file = 'in.txt'

foreach (
  $block in (Get-Content -Raw $file) -split '(?s)^(#.+?)(?:\n\n|\n?\z)' -ne ''
) {

  # Extract the file name from the block's first line.
  $fileName = 'file_' + ($block -split '\n')[0] -replace '[#\s]'

  # Preview the result:
  Write-Host @"
Writing to [$fileName]:
[$block]

"@

  # Uncomment this to actually save to files.
  # Set-Content $fileName -Value $block

}

^{Note: To make the solution work with CRLF (Windows-style) newlines too, use \r?\n instead of \n.}

With your sample input, the above yields:

Writing to file [file_Mainline]:
[# Mainline
apple
orange
banana
onion #small#]

Writing to [file_lineA]:
[# lineA
orange
banana
watermelon
raisins #packed#
raisins #unpacked#]

Writing to [file_lineB]:
[# lineB
chocolate
nuts
sugar
coffee]

Writing to [file_lineC]:
[# lineC
lemon
honey
carrots
broccoli]

As for what you tried:

As AdminOfThings points out, the SingleLine regex option is required in order for metacharacter . to match across lines; by default, . matches everything but a (LF-only) newline^[1].

There are two ways to specify regex-matching options:

As an extra [regex] constructor parameter specifying the option(s) to apply, as demonstrated by AdminOfThings; in this case, the specified options invariably apply to the entire regex; e.g.:

# Thanks to SingleLine, .+ matches the entire multiline string.
PS> [regex]::new('.+', 'SingleLine').Match("foo`nbar").Value
foo
bar

Inline, inside a (?<option-letters>) construct, where s is the option letter representing SingleLine, as used in the solution above and demonstrated in the following example; while this construct is often placed at the start of the regex so as to apply to the entire regex, it can be applied to parts of a regex (it takes effect for the remainder of the enclosing (sub)expression or until explicitly deactivated with another construct in which the option is negated with -, e.g., (?-s)):

# Ditto
PS> [regex]::new('(?s).+').Match("foo`nbar").Value
foo
bar

Given that it is usually not necessary in PowerShell to work directly with the [regex] type, the inline syntax is a way to apply options in combination with PowerShell's regex-based operators, such as -match, -replace and -split.

^{[1] A notable pitfall is that . does match CR (\r) by default, so in an input string that has Windows-style CRLF newlines, something like .* will match the first line with a trailing CR; e.g.:
[regex]::new('.*').Match("foo`r`nbar").Value -replace '\r', '<CR>' yields foo<CR>.}

Collectives™ on Stack Overflow

PowerShell extract by pattern

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related