10

I am researching ways, tools and techniques to parse code files in order to support syntax highlighting and intellisence in an editor written in c#.

Does anyone have any ideas/patterns & practices/tools/techiques for that.

EDIT: A nice source of info for anyone interested:

Parsing beyond Context-free grammars ISBN 978-3-642-14845-3

14
  • 1
    Are you trying to parse C# or write a parser in C#? Commented Oct 24, 2010 at 15:38
  • 1
    @Gabe, both. I am trying to write a parser in c# which will parse xml, c# hopefully something else :) Commented Oct 24, 2010 at 15:42
  • 1
    If you want to parse multiple languages, have you looked at ANTLR? Commented Oct 24, 2010 at 15:50
  • 4
    This rather depends on how sophisticated you want it to be. If you want the full Visual Studio experience you'll need a full parser, but if you just want simple keyword/string highlighting (like StackOverflow provides) then you don't want a parser. All you need is a simple tokenizer that can distinguish between strings and identifiers, and a list of keywords. Commented Oct 24, 2010 at 15:58
  • 1
    @sTodorov: Anyways, what I am trying to say is that you need some kind of resilient parser that knows how to backtrack with least effort. Most parsergens like yacc, etc, can be modified for this behavior albeit with different flavors of efficiency. Commented Oct 24, 2010 at 16:30

3 Answers 3

6

My favourite parser for C# is Irony: http://irony.codeplex.com/ - i have used it a couple of times with great success

Here is a wikipedia page listing many more: http://en.wikipedia.org/wiki/Compiler-compiler

Sign up to request clarification or add additional context in comments.

2 Comments

Does Irony support multiple language parsing?
Irony is for creating parsers, so yes - it parses anything you can build a grammar for
3

There are two basic aproaches:
1) Parse the entire solution and everything it references so you understand all the types involved in the code
2) Parse locally and do your best to guess what types etc are.

The trouble with (2) is that you have to guess, and in some circumstances you just can't tell from a code snippet exactly what everything is. But if you're happy with the sort oif syntax highlighting shown on (e.g.) Stack Overflow, then this approach is easy and quite effective.

To do (1) then you need to do one of (in decreasing order of difficulty):

  • Parse all the source code. Not possible if you reference 3rd party assemblies.
  • Use reflection on the compiled code to garner type information you can use when parsing the source.
  • Use the host IDE's (if avaiable - so not applicable in your case!) code element interfaces to provide the information you need

9 Comments

OP wants to parse multiple languages. There's the "small" problem of actually getting working grammars for the languages you want to process. Legacy langauges are hard to do this for, because the standards committees have been decorating them with goo; check out IBM Enterprise COBOL or Fortran 2005. Modern langauges are a little easier but even they have pressure to add stuff; try parsing modern VB.net. I've got 15 years into building parsers using unifed instructure for a wide range of languages (including those I mentioned) and I'm not hardly done yet :-{
@Ira: OP doesn't make it very clear what languages are required, but most of my answer stands equally well for any language. But you're right, it's a very nontrivial problem. Visual Studio Intellisense has been developed for many years by an experienced team, and only really works well in .net languages - beyond basic syntax highlighting, the support is pretty poor in most other languages, which is a good indicator of the difficulty of the problem the OP be attempting to address.
@Ira the feat you are trying to accomplish sounds very serious. I wish you all the success with it. However, what I am researching is mostly support for C#, Ruby, Python, VB. net, java. I can only imagine the difficulties involved with parsing legacy languages
@Jason, I think for now I will concentrate on researching parsing C# and python because of the difference in the structure, e.g. curly brackets and indentation
@sTodorov: I've done all the langauges you've mentioned except for Ruby and that's in progress. If you want to parse these languages fully you need pretty much all that machinery that I've used in some form or another. If all you want is syntax highlighting, you can a good-enough job with just regular expression matching, because syntax highlighting doesn't have be always right to be useful.
|
1

You could take a look at how http://www.icsharpcode.net/ did it. They wrote a book doing just that, Dissecting a C# Application: Inside SharpDevelop, it even has a chapter called

Implement a parser to provide syntax highlighting and auto-completion as users type

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.