1

I am trying to match open source license types in the comment out code in the beginning of most files. However, I am having difficulty for situations where the desired string (e.g. Lesser General Public License) spans two lines. See code below license for example.

 * Copyright (c) Codice Foundation
 * <p/>
 * This is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser
 * General Public License as published by the Free Software Foundation, either version 3 of the
 * License, or any later version.
 * <p/>
 * This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
 * even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 * Lesser General Public License for more details. A copy of the GNU Lesser General Public License
 * is distributed along with this program and can be found at
 * <http://www.gnu.org/licenses/lgpl.html>.
 */

Using a regex lookback is not possible due to the unknown number of spaces in commented code as well as the different comment characters in different languages. Examples of my current regex expressions are included below:

self._cr_license_re['GNU']                            = re.compile('\sGNU\D')
self._cr_license_re['MIT License']                    = re.compile('MIT License|Licensed MIT|\sMIT\D')
self._cr_license_re['OpenSceneGraph Public License']  = re.compile('OpenSceneGraph Public License', re.IGNORECASE)
self._cr_license_re['Artistic License']               = re.compile('Artistic License', re.IGNORECASE)
self._cr_license_re['LGPL']                           = re.compile('\sLGPL\s|Lesser General Public License', re.IGNORECASE)
self._cr_license_re['BSD']                            = re.compile('\sBSD\D')
self._cr_license_re['Unspecified OS']                 = re.compile('free of charge', re.IGNORECASE)
self._cr_license_re['GPL']                            = re.compile('\sGPL\D|(?<!Lesser)\sGeneral Public License', re.IGNORECASE)
self._cr_license_re['Apache License']                 = re.compile('Apache License', re.IGNORECASE)
self._cr_license_re['Creative Commons']               = re.compile('\sCC\D')

I welcome any suggestions on how to tackle this problem using regex in python.

2
  • "If there only was a way to glue lines together into a single long string"? Commented Nov 17, 2016 at 21:20
  • What is the problem? Replace all literal spaces in your 'OpenSceneGraph Public License' (and anywhere) with \s+, that is all. Commented Nov 17, 2016 at 21:42

1 Answer 1

1

You could use this regex and replace with a space

\s*\*\s*\/?

This should put the multiline comment on one line, then you can find the license in it.

Sign up to request clarification or add additional context in comments.

1 Comment

Good suggestion. However, the regex above did not remove the newline (\n) characters. What eventually worked was: text = fid.read().replace('\n','') fin_text= re.sub('s*\*\s*\/?','',text)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.