PHP & Regular Expressions

Question

I've currently got an issue.

I'm attempting to format a block of text using regular expressions, and I'll explain what I've got so far and then I'll go on to explain my problem.

I have a text file, with some narrative text.

VOLUME I



CHAPTER I


Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type 

It was popularised in the 1960s with the release of Letraset sheets containing 
Lorem Ipsum passages, and more recently with desktop publishing software like 
Aldus PageMaker including versions of Lorem Ipsum.


VOLUME II



CHAPTER II


Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
It has survived not only five centuries, but also the leap into electronic 
typesetting, remaining essentially unchanged. 

It was popularised in the 1960s with the release of Letraset sheets 
containing Lorem Ipsum passages, and more recently with desktop 
publishing software like Aldus PageMaker including versions of Lorem Ipsum.

...
...

It has multiple VOLUMES and CHAPTERS, and needs to be formatted by PHP to look like it does in the text file, with appropriate spacing.

First, I call this formatting function to handle some whitespacing and cleanup.

<?php    
function formatting($AStr)
{
    return preg_split('/[\r\n]{2,}/', trim($AStr));        
}    
?>

Then, I call the file and continue attempting to format.

<!DOCTYPE html>
<html>
  <head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
<body>

<h1>Jane Austen</h1>

<h2>Emma</h2>

<?php

require_once 'format.inc.php';

$p = file_get_contents('emma.txt');

$p = formatting($p);

/*
foreach ($p as $l) {
    $l = trim($l);
    preg_replace('/(VOLUME +[IVX]+)/', "jjj", $l);
    $volumePattern = '/(VOLUME +[IVX]+)/';
    $chaperPattern = '/(CHAPTER +[IVX]+)/';
    $l = str_replace("\r\n", ' ', $l);

    if (preg_match('/(VOLUME +[IVX]+)/', $l, $m)) {
        echo '<h3>' . $m[1] . '</h3>';
    }
    if (preg_match('/(CHAPTER +[IVX]+)/', $l, $m)) {
        echo '<h3>' . $m[1] . '</h3>';
    }
    preg_replace('/(VOLUME +[IVX]+)/', "jjj", $l);
    echo $l . "\n";
}*/

foreach ($p as $l) {
    //$l = trim($l);
    //$l = str_replace("[\r\n]", '\n', $l);
    if (preg_match('/[\.\w]/', $l, $m)) {
        echo "\n";
    }
    if (preg_match('/(VOLUME +[IVX]+)/', $l, $m)) {
        echo '<h3>' . $m[1] . '</h3>';
    }
    $l = preg_replace('/(VOLUME +[IVX]+)/', '', $l);
    if (preg_match('/(CHAPTER +[IVX]+)/', $l, $m)) {
        echo '<h3>' . $m[1] . '</h3>';
    }
    $l = preg_replace('/(CHAPTER +[IVX]+)/', '', $l);
    echo $l . "\n";
}


?>

</body>
</html>

The issue is, that I can't get the whitespace (newline) between each paragraph to print. I've tried, but I can't. I tried by using this line:

if (preg_match('/[\.\w]/', $l, $m)) {
            echo "\n";
        }

means it print only VOLUME 1 and CHAPTER 1 and content of chapter 1 OR repeating these same and same upto final chapter — xkeshav
– xkeshav, Commented Aug 24, 2012 at 10:22

DaveRandom · Accepted Answer · 2012-08-24 10:46:52Z

3

This might be massively over-simplified, but can't you just do this?

<!DOCTYPE html>
<html>
  <head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
<body>

<h1>AUTHOR NAME</h1>

<h2>TITLE</h2>

<?php

  $p = file_get_contents('emma.txt');
  echo preg_replace('/^\s*((?:VOLUME|CHAPTER)\s+[IVX]+)\s*$/im', '<h3>$1</h3>', $p); 

?>

</body>
</html>

EDIT

To also wrap the body paragraphs in <p></p> (assuming there are no new lines in a paragraph) try this:

<!DOCTYPE html>
<html>
  <head>
    <title></title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="stylesheet" type="text/css" href="style.css" />
  </head>
<body>

<h1>AUTHOR NAME</h1>

<h2>TITLE</h2>

<?php

  $p = file_get_contents('emma.txt');
  echo preg_replace_callback('/^\s*(?:(?P<header>(?:VOLUME|CHAPTER)\s+[IVX]+)|(?P<body>.+))\s*$/im', function($matches) {
    if (!empty($matches['body'])) {
      return '<p>'.htmlspecialchars($matches['body']).'</p>';
    } else {
      return '<h3>'.htmlspecialchars($matches['header']).'</h3>';
    }
  }, $p);

?>

</body>
</html>

See it working

edited Aug 24, 2012 at 10:46

answered Aug 24, 2012 at 10:22

DaveRandom

88.8k11 gold badges159 silver badges174 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

xkeshav Over a year ago

+1 neat solution. i for case insensitive but what for s and m in /ism

Sammaye Over a year ago

@diEcho They are flags: xregexp.com/flags bascially for doing global searches across all text

DaveRandom Over a year ago

i makes it case-insensitive, s makes . match newlines (which in retrospect is not required for this) and m means that ^$ assertions will match any new line, not just the start/end of the subject string. php.net/manual/en/reference.pcre.pattern.modifiers.php

Edge Over a year ago

Is it possible to also seperate the paragraphs? For example, the end of the last paragraph reads 'vex her.' would it be possible to create new lines like the text file has?

DaveRandom Over a year ago

@Andrew You mean to wrap each \r\n\r\n separated block that is not a volume/chapter in <p></p>?

xavier Z · Accepted Answer · 2012-08-24 10:27:30Z

1

you have diferrent errors, first in 'formating' function the regexp must be :

function formatting($AStr)
{
    return preg_split('/[\r\n]{2,}/', trim($AStr));        
}

after you must know that preg_replace has no variable passed by reference so you must replace your line by the return of the function :

foreach ($p as $l) {
    $l = trim($l);
    preg_replace('#VOLUME\s+[A-z]+#Ui', "jjj", $l);
    $l = str_replace("\r\n", ' ', $l);
    if (preg_match('/(VOLUME +[IVX]+)/', $l, $m)) {
        echo '<h3>' . $m[1] . '</h3>';
    }
    $l = preg_replace('/(VOLUME +[IVX]+)/', '', $l);
    if (preg_match('/(CHAPTER +[IVX]+)/', $l, $m)) {
        echo '<h3>' . $m[1] . '</h3>';
    }
    $l = preg_replace('/(CHAPTER +[IVX]+)/', '', $l);
    echo $l . "\n";
}

answered Aug 24, 2012 at 10:27

xavier Z

1446 bronze badges

4 Comments

Sammaye Over a year ago

[\r\n] will match \,r and n not \r\n. OMG fail comment handling by SO cant add back slash in code formatting

xavier Z Over a year ago

Sammaye Over a year ago

Hmm it doesn't match nothing actually; using spaweditor.com/scripts/regex/index.php with /[\r\n]{2,}/ and the text \r\n it returns an empty array. I also tried with regexpal

Edge Over a year ago

Is there any way I can also keep the newlines inbetween each paragraph?

Collectives™ on Stack Overflow

PHP & Regular Expressions

2 Answers 2

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related