0

So, I'm working on a project now where I should store webpages inside a database, I'm using crawler4j to crawl and Proxool along with MySQL Java Connector to connect to my database.

When I tested the application I got: com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 'HTMLData'.

The HTMLData column wasTEXT.

When I changed the HTMLData column to LONGTEXT the error was gone, but I'm afraid it might get back in the future.

Any idea on how to do that perfectly so I don't worry about that error (or any other similar error) in the future?

Thanks :)

1
  • I could be wrong, but I'm inclined to think that if you're afraid your going to overshoot the size of a TEXT (or especially a LONGTEXT) column, you might be better off saving these items as static files and just storing the path in the DB. Even if I'm wrong and it's still better to keep them in the database, I'd agree with duffymo that you should re-examine your design. Commented Jun 7, 2010 at 23:12

3 Answers 3

5

In principle, a LONGTEXT field can hold 4GB data however other smaller restrictions probably apply: e.g. from the MySQL documentation, "The largest possible packet that can be transmitted to or from a MySQL 5.1 server or client is 1GB.". I think this effectively means you'll get up to about about 1GB in a LONGTEXT (and even then, you'll have to reconfigure the maximum packet size from its default I think).

Irrespectively of this limit, HTML generally compresses well, so if your frameworks allow this I would suggest you actually consider a LONGBLOB and run the data through a Deflater before storage (and through an Inflator on retrieval).

Sign up to request clarification or add additional context in comments.

Comments

3

LONGTEXT can hold 4,294,967,295 bytes, see http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html

I'd say you don't want to store HTML document bigger then 4GB do you?

(edit, overshot the byte count with 1 byte, 2^32 -1 of course)

2 Comments

But see my answer below -- you may not actually get 4GB into one through the JDBC connector.
My point was more that it was more then enough for HTML, even 1 GB should is way overdone for any reasonable HTML document. Hitting the limit for 65K OK, MEDIUMTEXT should be more then enough, 16MB for the standard max_allowed_packet is already pushing it very for for plain HTML.
1

This doesn't sound like a good design to me. Why do you have to store HTML in a database? IT feels like it couples every tier from view to persistence through and through.

JSPs are dynamic templates for HTML pages; why not just use JSPs?

This is a design worth re-thinking.

4 Comments

They may not be his own pages. Even so, even for crawling/searchbot one could more easily save them as files & only store parsed/relevant data needed in a database.
As Wrikken said it is for crawling. :)
If it is, please don't store the HTML, which is a very hefty percentage garbage. I don't know what you're looking for exactly, but parse the text, title, and whatever you need out & store that. If you like you can keep file backups of the actual HTML downloaded.
If you're doing any kind of serious crawling, storing the whole web page will be very costly - The internet archive requires ~2 petabytes (archive.org/about/faqs.php) to store everything it archives, and that's expensive. You should be processing what you crawl to strip out everything you don't need to minimize your necessary disk space. You can also look into something like Lucene to build indexes of the data you're crawling (lucene.apache.org) which will do a lot of that work for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.