2

I want to check if a file is a plain-text file. I tried the code below:

function IsTextFile(const sFile: TFileName): boolean;
//Created By Marcelo Castro - from Brazil
var
 oIn: TFileStream;
 iRead: Integer;
 iMaxRead: Integer;
 iData: Byte;
 dummy:string;
begin
 result:=true;
 dummy :='';
 oIn := TFileStream.Create(sFile, fmOpenRead or fmShareDenyNone);
 try
   iMaxRead := 1000;  //only text the first 1000 bytes
   if iMaxRead > oIn.Size then
     iMaxRead := oIn.Size;
   for iRead := 1 to iMaxRead do
   begin
     oIn.Read(iData, 1);
     if (idata) > 127 then result:=false;
   end;
 finally
   FreeAndNil(oIn);
 end;
end;

This function works pretty well for text files based on ASCII chars. But text files can also include non-English chars. This function returns FALSE for non-English text files.

Is there any way to check if a file is a text file or a binary file?

7
  • 3
    (Off-topic, but still rather important:) You really should replace your result:=false with Exit(False). If you find that the file is not a text file at char 2, there is not really any need to keep investigating the remaining 998 chars... Commented Jun 12, 2020 at 9:46
  • 5
    "Is there any way to check if a file is a text file or a binary file?" In general, no. It is possible for the same file to be a valid text file and a valid binary file when interpreted in different ways. Commented Jun 12, 2020 at 10:24
  • 4
    @AndreRuebel: Except that UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE text files often have plenty of nulls in them. Commented Jun 12, 2020 at 10:55
  • 2
    With some effort, it is possible to create an algorithm that makes the right guess in most cases. For instance, you can check if the file would be invalid in a particular encoding (then you know it is not a text file in that encoding). You can see if every second byte is null; then it is likely UTF-16. You can try to search for English words. And so on. Commented Jun 12, 2020 at 10:57
  • 5
    I'm sure what @AndreasRejbrand and DavidH say is correct. Personally I would try a simple statistical analysis based on the frequency of occurence of carriage return (#13) and linefeed (#10) characters. If they always appear together, I think it would be good sign that the file contains text. Commented Jun 12, 2020 at 11:23

1 Answer 1

1

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

That's the first answer from here : How can I detect the encoding/codepage of a text file

You also should figure out any binary file can be a text in an uncommun encoding. Also, binary files encoded in Base64 will just bypass any test you will think of, as it is by definition a text representation of a binary stream.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.