1

I can do this in python and it gives me the available submodules/parameters within the function.

In the interpreter, I can do this:

>>> from nltk import pos_tag
>>> dir(pos_tag)
['__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__doc__', '__format__', '__get__', '__getattribute__', '__globals__', '__hash__', '__init__', '__module__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'func_closure', 'func_code', 'func_defaults', 'func_dict', 'func_doc', 'func_globals', 'func_name']

BTW, what's dir(function) call?

How do I know what parameters is necessary to call the function? e.g. in the case of pos_tag, the source code says it needs token, see https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py

def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)

If a docstring is available for the function is there a way to know what is the parameter type that the function is expecting for a specific parameter?, e.g. in the pos_tag case above it's :param tokens: Sequence of tokens to be tagged and :type tokens: list(str) Can these information be gotten when running the interpreter without reading the code?

And lastly, is there a way to know what is the return type?

Just to be clear, I'm not expecting a printout of the docstring but the questions above is so that I can do some sort of type checking later with isinstance(output_object, type)

1 Answer 1

2

Below are the answers to your four questions. I'm afraid some of the things you want to do are not possible in the standard library, unless you want to parse the docstrings yourself.

(1) BTW, what's dir(function) call?

If I understand this question correctly, I believe the docs answer that question here:

If the object has a method named __dir__(), this method will be called and must return the list of attributes. This allows objects that implement a custom __getattr__() or __getattribute__() function to customize the way dir() reports their attributes.

If the object does not provide __dir__(), the function tries its best to gather information from the object’s __dict__ attribute, if defined, and from its type object.

(2) How do I know what parameters is necessary to call the function?

The best way is to use inspect:

>>> from nltk import pos_tag
>>> from inspect import getargspec
>>> getargspec(pos_tag)
ArgSpec(args=['tokens'], varargs=None, keywords=None, defaults=None)  # a named tuple
>>> getargspec(pos_tag).args
['tokens']

(3) If a docstring is available for the function is there a way to know what is the parameter type that the function is expecting for a specific parameter?

Not in the standard library, unless you want to parse the docstring on your own. You probably already know that you can access the docstrings like this:

>>> from inspect import getdoc
>>> print getdoc(pos_tag)
Use NLTK's currently recommended part of speech tagger to
tag the given list of tokens.

    >>> from nltk.tag import pos_tag
    >>> from nltk.tokenize import word_tokenize
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
    'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
    ('.', '.')]

:param tokens: Sequence of tokens to be tagged
:type tokens: list(str)
:return: The tagged tokens
:rtype: list(tuple(str, str))

or this:

>>> print pos_tag.func_code.co_consts[0]

    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))

If you want to try to parse the params and "types" by yourself, you could start with a regex. Clearly, though, I am using the word "type" loosely. Moreover, this approach will only work for docstrings that list their parameters and types in this specific way:

>>> import re
>>> params = re.findall(r'(?<=:)type\s+([\w]+):\s*(.*?)(?=\n|$)', getdoc(pos_tag))
>>> for param, type_ in params:
    print param, '=>', type_

tokens => list(str)

The results of this approach will of course give you params and their corresponding description. You could also check each word in the description by splitting the string and keeping only those words that meet the following requirement:

>>> isinstance(eval(word), type)
True
>>> isinstance(eval('list'), type)
True

But this approach could quickly get complicated, especially when trying to parse the last parameter of pos_tag. Moreover, docstrings will often not have this format at all. So this would likely only work with the nltk, but even then not all the time.

(4) And lastly, is there a way to know what is the return type?

Again, I'm afraid not, unless you want to use the regex example above to comb through the docstring. The return type might very well vary depending on the arg(s) type(s). (Consider any function that will work with any iterable.) If you want to try to extract this information from a docstring (again, in the exact format of the pos_tag docstring), you can try another regex:

>>> return_ = re.search(r'(?<=:)rtype:\s*(.*?)(?=\n|$)', getdoc(pos_tag))
>>> if return_:
    print 'return "type" =', return_.group()

return "type" = rtype: list(tuple(str, str))

Otherwise, the best we can do here is to get the source code (which again, is explicitly what you do not want):

>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)
Sign up to request clarification or add additional context in comments.

1 Comment

@alvas You ask great questions! I love how you're digging deeper into the nltk.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.