Python Regex in commented code

Question

I am trying to match open source license types in the comment out code in the beginning of most files. However, I am having difficulty for situations where the desired string (e.g. Lesser General Public License) spans two lines. See code below license for example.

 * Copyright (c) Codice Foundation
 * <p/>
 * This is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser
 * General Public License as published by the Free Software Foundation, either version 3 of the
 * License, or any later version.
 * <p/>
 * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
 * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * Lesser General Public License for more details. A copy of the GNU Lesser General Public License
 * is distributed along with this program and can be found at
 * <http://www.gnu.org/licenses/lgpl.html>.
 */

Using a regex lookback is not possible due to the unknown number of spaces in commented code as well as the different comment characters in different languages. Examples of my current regex expressions are included below:

self._cr_license_re['GNU']                            = re.compile('\sGNU\D')
self._cr_license_re['MIT License']                    = re.compile('MIT License|Licensed MIT|\sMIT\D')
self._cr_license_re['OpenSceneGraph Public License']  = re.compile('OpenSceneGraph Public License', re.IGNORECASE)
self._cr_license_re['Artistic License']               = re.compile('Artistic License', re.IGNORECASE)
self._cr_license_re['LGPL']                           = re.compile('\sLGPL\s|Lesser General Public License', re.IGNORECASE)
self._cr_license_re['BSD']                            = re.compile('\sBSD\D')
self._cr_license_re['Unspecified OS']                 = re.compile('free of charge', re.IGNORECASE)
self._cr_license_re['GPL']                            = re.compile('\sGPL\D|(?<!Lesser)\sGeneral Public License', re.IGNORECASE)
self._cr_license_re['Apache License']                 = re.compile('Apache License', re.IGNORECASE)
self._cr_license_re['Creative Commons']               = re.compile('\sCC\D')

I welcome any suggestions on how to tackle this problem using regex in python.

"If there only was a way to glue lines together into a single long string"? — Jongware
– Jongware, Commented Nov 17, 2016 at 21:20
What is the problem? Replace all literal spaces in your 'OpenSceneGraph Public License' (and anywhere) with \s+, that is all. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 17, 2016 at 21:42

Nicolas · Accepted Answer · 2016-11-17 21:26:14Z

1

You could use this regex and replace with a space

\s*\*\s*\/?

This should put the multiline comment on one line, then you can find the license in it.

answered Nov 17, 2016 at 21:26

Nicolas

7,2094 gold badges35 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lmum27 Over a year ago

Good suggestion. However, the regex above did not remove the newline (\n) characters. What eventually worked was: text = fid.read().replace('\n','') fin_text= re.sub('s*\*\s*\/?','',text)

Collectives™ on Stack Overflow

Python Regex in commented code

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related