Creating a CSV file from an HTML page

Question

I have extracted records from a database and stored them on an HTML page with only text. Each record is stored in a <p> paragraph field and separated by a line break <br /> and a line <hr>. For example:

Company Name<br/>
555-555-555<br />
Address Line 1<br />
Address Line 2<br />
Website: www.example.com<br />

I just need to place these records into a CSV file. I used fputcsv in combination with array() and file_get_contents() but it read my the entire source code of the webpage into a .csv file and alot of data was missing as well. These are multiple records stored in the same format. So after an entire record block as seen above, it is separate by an <hr> line tag. I want to read the company name into the Name column, the Phone number into the Phone column, the addresses into the Address column and the Website into the Website column as shown below.

https://i.sstatic.net/00Gxw.png
How can i do this?

Snippet of the HTML:

            1 Stop Signs<br />
            480-961-7446<br />
500 N. 56th Street<br />
        Chandler, AZ  85226<br />

<br />
                Website: www.1stopsigns.com<br />
            <br />
            </p><br /><hr><br />

It's spaced like this in the source of the HTML.

If you have the database, why don't you generate the csv from there, why parse the html? — Maerlyn
– Maerlyn, Commented Feb 16, 2012 at 23:37
I don't have the database, the credentials are lost, so i had to extract the records and they must be placed into a .csv file — Tower
– Tower, Commented Feb 16, 2012 at 23:38
@Brinard when you say the credentials are lost for the database, is this script still running live on a server? If so, can you gain access to the source code where the database connection string are, and get the username and password from there? — cb1
– cb1, Commented Feb 16, 2012 at 23:52
@cb1: I get the feeling that this is pulling data off of someone elses webpage and not one that Brinard has access to in order to create a marketing list. — ChrisLively
– ChrisLively, Commented Feb 16, 2012 at 23:55
Why would i do that? The data belongs to a client of the company i work for. @cb1 I posted a question on this already and the source code of the script is written in ASP.net where was the credentials weren't sup[lied in the script which was strange — Tower
– Tower, Commented Feb 17, 2012 at 0:05

Darragh Enright · Accepted Answer · 2012-02-17 10:20:51Z

3

Assuming that your data follows a pattern where every record is separated by a <hr> tag and every field within is separated by a <br /> then you should be able to split out the data.

There are loads of ways to do this, but a naive way that might work using explode() might be something like:

// open a file pointer to csv
$fp = fopen('records.csv', 'w');

// first, split each record into a separate array element
$records = explode('<hr>', $str);

// then iterate over this array
foreach ($records as $record) {

    // strip tags and trim enclosing whitespace
    $stripped = trim(strip_tags($record));

    // explode by end-of-line
    $fields = explode(PHP_EOL, $stripped);

    // array walk over each field and trim whitespace
    array_walk($fields, function(&$field) {
        $field = trim($field);
    });

    // create row
    $row = array(
        $fields[0], // name
        $fields[1], // phone
        sprintf('%s, %s', $fields[2], $fields[3]), // address
        $fields[6], // web
    );

    // write cleaned array of fields to csv
    fputcsv($fp, $row);
}

// done
fclose($fp);

Where $str is the page data you are parsing. Hope this helps.

EDIT

Didn't notice the specific field requirements originally. Updated the example.

edited Feb 17, 2012 at 10:20

answered Feb 17, 2012 at 0:03

Darragh Enright

14.2k8 gold badges45 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Darragh Enright Over a year ago

I probably should note that array_walk() uses an anonymous function in the above example, which assumes 5.3 :)

Tower Over a year ago

When i used this method, it placed a comma (,) in every row in the .csv file. What happened? $str = file_get_contents("records.html");

Darragh Enright Over a year ago

hey there. I didn't actually test this :s and there was a typo. Plus it could have worked better. I modified my example above so it should be work better now.

Tower Over a year ago

OK i'm trying it out again, THanks

Tower Over a year ago

Your answer was very informative and helped me include snippets into my final output, though the fields did not split into the variables as you have them here, for e.g. $field[2] Only $field[0] and that would get the entire line of all the data. Thanks again though!

|

Tony · Accepted Answer · 2012-02-17 00:03:04Z

2

Assuming the html that shown above is well formed,my approach to this problem must be in 2 phases. First. Clear a little bit the html text to be more efficient to export or manage the information. Here try to clear the items you want to save and delete those you know you don't want to require in the near future.

$html = preg_replace("|\s{2,}|si"," ",$html); // clear non neccesary spaces
$html = preg_replace("|\n{2,}|si","\n",$html); // convert more return line to only one
$html = preg_replace("|<br />|si","##",$html); // replace those tags with this one

Then you'll have a more clean html to work with similar to this....

1 Stop Signs##
480-961-7446##
500 N. 56th Street##
Chandler, AZ  85226##
Website: www.1stopsigns.com##
##
</p>##<hr>##

Second. Now you can explode the fields or make an implode into a comma separate value to form a csv

// here you'll have the fields to work with into the array called $csv_parts
$csv_parts = explode("##",$html);

// imploding, so there you have the formatted csv similar to 1 Stop Signs,480-961-7446,..
$csv = implode(",",$csv_parts);

Now you'll have a two ways to work with the html for extracting the fields or exporting the csv.

Hope this helps or give you an idea to develop what you need.

answered Feb 17, 2012 at 0:03

Tony

3,49710 gold badges32 silver badges46 bronze badges

1 Comment

Tower Over a year ago

This ultimately helped me come to a solution, though i had to do alot of manual editing in Microsoft Excel. Most importantly the preg_replace functions. I replaced them with _ instead of ##

ChrisLively · Accepted Answer · 2012-02-17 00:02:30Z

2

By far the easiest way would be to simply take the block, drop everything from the <hr> tag forward then split the string as a string array on the <br /> tags.

answered Feb 17, 2012 at 0:02

ChrisLively

88.3k27 gold badges174 silver badges249 bronze badges

2 Comments

Tower Over a year ago

This is a very good approach, i have tried all the other answers given here but to no avail, even with my own tweaking. How could i go about using your suggestion?

ChrisLively Over a year ago

@Brinard: Darragh's answer is pretty close to perfect and is what I was suggesting. If that isn't working then I'd bet the sample html you posted isn't exactly representative of what you are trying to run against. You might update your question with a link to the page you are trying to parse and let Darragh know.

Collectives™ on Stack Overflow

Creating a CSV file from an HTML page

3 Answers 3

7 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related