2

I have extracted records from a database and stored them on an HTML page with only text. Each record is stored in a <p> paragraph field and separated by a line break <br /> and a line <hr>. For example:

Company Name<br/>
555-555-555<br />
Address Line 1<br />
Address Line 2<br />
Website: www.example.com<br />

I just need to place these records into a CSV file. I used fputcsv in combination with array() and file_get_contents() but it read my the entire source code of the webpage into a .csv file and alot of data was missing as well. These are multiple records stored in the same format. So after an entire record block as seen above, it is separate by an <hr> line tag. I want to read the company name into the Name column, the Phone number into the Phone column, the addresses into the Address column and the Website into the Website column as shown below.

https://i.sstatic.net/00Gxw.png
How can i do this?

Snippet of the HTML:

            1 Stop Signs<br />
            480-961-7446<br />
500 N. 56th Street<br />
        Chandler, AZ  85226<br />

<br />
                Website: www.1stopsigns.com<br />
            <br />
            </p><br /><hr><br />

It's spaced like this in the source of the HTML.

6
  • If you have the database, why don't you generate the csv from there, why parse the html? Commented Feb 16, 2012 at 23:37
  • I don't have the database, the credentials are lost, so i had to extract the records and they must be placed into a .csv file Commented Feb 16, 2012 at 23:38
  • @Brinard when you say the credentials are lost for the database, is this script still running live on a server? If so, can you gain access to the source code where the database connection string are, and get the username and password from there? Commented Feb 16, 2012 at 23:52
  • @cb1: I get the feeling that this is pulling data off of someone elses webpage and not one that Brinard has access to in order to create a marketing list. Commented Feb 16, 2012 at 23:55
  • Why would i do that? The data belongs to a client of the company i work for. @cb1 I posted a question on this already and the source code of the script is written in ASP.net where was the credentials weren't sup[lied in the script which was strange Commented Feb 17, 2012 at 0:05

3 Answers 3

3

Assuming that your data follows a pattern where every record is separated by a <hr> tag and every field within is separated by a <br /> then you should be able to split out the data.

There are loads of ways to do this, but a naive way that might work using explode() might be something like:

// open a file pointer to csv
$fp = fopen('records.csv', 'w');

// first, split each record into a separate array element
$records = explode('<hr>', $str);

// then iterate over this array
foreach ($records as $record) {

    // strip tags and trim enclosing whitespace
    $stripped = trim(strip_tags($record));

    // explode by end-of-line
    $fields = explode(PHP_EOL, $stripped);

    // array walk over each field and trim whitespace
    array_walk($fields, function(&$field) {
        $field = trim($field);
    });

    // create row
    $row = array(
        $fields[0], // name
        $fields[1], // phone
        sprintf('%s, %s', $fields[2], $fields[3]), // address
        $fields[6], // web
    );

    // write cleaned array of fields to csv
    fputcsv($fp, $row);
}

// done
fclose($fp);

Where $str is the page data you are parsing. Hope this helps.

EDIT

Didn't notice the specific field requirements originally. Updated the example.

Sign up to request clarification or add additional context in comments.

7 Comments

I probably should note that array_walk() uses an anonymous function in the above example, which assumes 5.3 :)
When i used this method, it placed a comma (,) in every row in the .csv file. What happened? $str = file_get_contents("records.html");
hey there. I didn't actually test this :s and there was a typo. Plus it could have worked better. I modified my example above so it should be work better now.
OK i'm trying it out again, THanks
Your answer was very informative and helped me include snippets into my final output, though the fields did not split into the variables as you have them here, for e.g. $field[2] Only $field[0] and that would get the entire line of all the data. Thanks again though!
|
2

Assuming the html that shown above is well formed,my approach to this problem must be in 2 phases. First. Clear a little bit the html text to be more efficient to export or manage the information. Here try to clear the items you want to save and delete those you know you don't want to require in the near future.

$html = preg_replace("|\s{2,}|si"," ",$html); // clear non neccesary spaces
$html = preg_replace("|\n{2,}|si","\n",$html); // convert more return line to only one
$html = preg_replace("|<br />|si","##",$html); // replace those tags with this one

Then you'll have a more clean html to work with similar to this....

1 Stop Signs##
480-961-7446##
500 N. 56th Street##
Chandler, AZ  85226##
Website: www.1stopsigns.com##
##
</p>##<hr>##

Second. Now you can explode the fields or make an implode into a comma separate value to form a csv

// here you'll have the fields to work with into the array called $csv_parts
$csv_parts = explode("##",$html);

// imploding, so there you have the formatted csv similar to 1 Stop Signs,480-961-7446,..
$csv = implode(",",$csv_parts);

Now you'll have a two ways to work with the html for extracting the fields or exporting the csv.


Hope this helps or give you an idea to develop what you need.

1 Comment

This ultimately helped me come to a solution, though i had to do alot of manual editing in Microsoft Excel. Most importantly the preg_replace functions. I replaced them with _ instead of ##
2

By far the easiest way would be to simply take the block, drop everything from the <hr> tag forward then split the string as a string array on the <br /> tags.

2 Comments

This is a very good approach, i have tried all the other answers given here but to no avail, even with my own tweaking. How could i go about using your suggestion?
@Brinard: Darragh's answer is pretty close to perfect and is what I was suggesting. If that isn't working then I'd bet the sample html you posted isn't exactly representative of what you are trying to run against. You might update your question with a link to the page you are trying to parse and let Darragh know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.