5

I want to get the first #include statement from a .cpp file using Python regex as fast as possible.

For example,

/* Copyright: 
This file is 
protected 
#include <bad.h>
*/

// Include files:
#undef A_MACRO
#include <stddef.h>  // defines NULL
#include "logger.h"

// Global static pointer used to ensure a single instance of the class.
Logger* Logger::m_pInstance = NULL; 

should return #include <stddef.h>

I know one way is to remove all comments and then get the first line from the remaining texts. But this seems not to be fast enough since it has to go through the whole file. If I only need the first #include statement, is there any efficient way I can do it using Python regex?

[Update 1] Several folks mentioned it's not a good solution to use regex. I understand this is not a typical use case of regex. But is there a better way to get rid of the leading comments than regex? Any suggestion would be appreciated.

[Update 2] Thanks for the answers. But seems there is no one I am satisfied yet. My requirements are straightforward: (1) avoid going through the whole file to get the first line. (2) Need to handle the leading comments correctly.

13
  • 7
    Using regexes to parse C++ is an even worse idea than using regexes to parse HTML... Commented Aug 7, 2015 at 21:48
  • 4
    @lenz That fails their example and would grab <bad.h> Commented Aug 7, 2015 at 21:50
  • 1
    A solution here would not need to parse C++. Includes are handled by the preprocessor. It would need to strip comments, expand macros (on the off-chance that a macro expansion produces an include statement), and then do a simple pattern-match. Commented Aug 7, 2015 at 21:56
  • 5
    What about #include wrapped in #if 0/#endif? What about one wrapped in #ifdef linux/#endif? What about #define foo <stdio.h>/#include foo? Commented Aug 7, 2015 at 22:05
  • 1
    @JoséTomásTocino This looks relevant: eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang Commented Aug 7, 2015 at 23:36

3 Answers 3

4

You can use a library called CppHeaderParser like this:

import sys
import CppHeaderParser

cppHeader = CppHeaderParser.CppHeader("test.cpp")

print("List of includes:")
for incl in cppHeader.includes:
    print " %s" % incl

For it to work you should do

pip install cppheaderparser

It outputs:

List of includes:
 <stddef.h>  // defines NULL
 "logger.h"

Certainly not the best result, but it's a start.

Sign up to request clarification or add additional context in comments.

3 Comments

Today I learned there's a Python library for literally everything
Nice, but it doesn't handle nested comments, i.e. /* foo /* bar */ baz */ - in case you have to handle those.
@José Tomás Tocino: Thanks for mentioning this useful library. But looks like I have to go through the whole cpp file to get the first line. This is what I tried to avoid since it's not fast enough.
1

What about using the C-preprocessor itself?

If you run gcc -E foo.cpp (where foo.cpp is your sample input file) you will get:

# 1 "foo.cpp"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 326 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "foo.cpp" 2








# 1 "/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/6.1.0/include/stddef.h" 1 3 4

The lines before # 1 "foo.cpp" 2 is boilerplate and can be ignored. (See what your C-preprocessor generates here.)

When you get to # 1 some-other-file ... you know you've hit a #include.

You will get a complete path name (not the way it appears in the #include statement), but you can also deduce where the #include appeared by looking backwards for the last line marker.

In this case the last line marker is # 1 foo.cpp 2 and it appears 9 lines back, so the #include for stddef.h was on line 9 of foo.cpp.

So now you can go back to the original file and grab line 9.

Comments

0

Does it have to be Regex? Code below stops at the first line, handles nested comments, and doesn't break on the // /*This is a comment case.

incomment = False

with open(r'myheader.h') as f:
    for line in f:
        if not incomment:
            line = line.split('//')[0]
            if line.startswith('#include'):
                print line
                break
            if '/*' in line:
                incomment = True
        if '*/' in line:
            incomment = False

5 Comments

a line like: // /*This is a comment will break it.
Thanks. Looks like it's a good point to start with -- easy and fast. It may not cover 100% cases but should be enough for me. Maybe you want to add strip() and protection for ....*/ #include "aaa.h". I can also edit it after I try more test cases.
@stanleyli Glad to help. Ha! The lowest score is the one you're using.
@stanleyli If you're using this response to solve your problem, albeit modified, please mark it as the accepted answer. Thanks!
Thanks. But I finally realized I still have to rely on regex heavily when I code. Situations like /*...*/ #include... /*...*/ #include ... /* ... */ caused lots of headache. Regex may still be the best solution. Upvoted for the initiative.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.