I have been searching for a tool that can scan a paragraph and extract the grammar tenses and features (past simple, present continuous, passive voice, indirect question) as it's a recurring question with our students. We have tools to tell us the approximate level, suggested vocabulary and word count, but does this even exist (yet?). Thank you in advance.
You will find that most Natural Language Processing (NLP) tools conceptualize linguistic categories differently from how teachers do (language teaching isn't linguistics, there are often simplifications happening, and schoolbooks get updated more slowly than linguistics evolves).
Examples:
* English verbs have only two tenses: PAST or NONPAST. They can have PERFECTIVE aspect or not. They can have PROGRESSIVE aspect or not. Since these are 3 binary choices, there are at least 8 different ways how English verbs can be realized. I think there'd be less confusion in school if a more linguistically correct version was taught that separates out tense and aspects.
* "Future" or "Present Perfect" (something I still got taught in school) don't exist for a proper linguist.
To build what you suggest, existing tools could be combined, but there would have to be a mapping layer on top of syntactic parsers like Charniak Parser, Collins parser or MaltParser. Another mis-match between grammar in school and linguistics is single versus multiple theories: in school, people usually teach constituent trees, whereas in linguistics phrase structure (constituent) grammar is one theory among many, one alternative (valency and dependency grammar) that does not rely on trees but focuses on the relations between words has recently gained a lot of traction in linguistic circles.
I suggest you check out Spacy [0] for a quick and easy to use Python library providing the above features. The software produced by the Stanford NLP Group is also great [1].
If you do not want to get your hands dirty with code, there are a number of API providers that will offer you the same as the above libraries (TextRazor, Rosette Text Analytics...)
After having faced a similar learning curve, I put what I know into a lengthy document[0] written in 2018 based upon explorations over 2016-17. That will get you deployed and operational quickly by following just the final section. The first section explains key concepts using conventional ideas as the means of introducing NLP jargon. In between covers theory and practice for getting the most out of any tool you're likely to use in the end.
More general tools are probably available today, such as add-ons for Elasticsearch. I'd start looking there. Interesting items came up when searching ddg for: NLP elasticsearch.
[0] http://play.org/articles/introduction-to-natural-language-pr...
> copulae verbs, linking verbs, terms that are often filtered (i.e. stop terms), question terms, time sensitive nouns, amplifiers, clauses, coordinating conjunctions, negations, conditionals (ORs), and contractions
('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')
It should be very easy to deploy Stanza's pipeline as an API endpoint. Here is an example of such a NLP-library-as-API endpoint, albeit with Hugging Face's Transformers, deployed via Cortex: https://github.com/cortexlabs/cortex/blob/master/examples/py...
[0] https://www.ioccc.org/2018/ciura/ [1] https://web.archive.org/web/20200224040340/https://www.ioccc...
Maybe you can give GPT-3 a try.
If you want to go the custom route, the easy way but which consume a lot of processing power, is to use a neural network and necessitate a boring dataset phase construction.
You build a dataset corresponding to your problem. And you learn it with the neural network.
For inspiration you can see my colorify browser extension, which uses a neural network to learn at the same time to split sentences, predict POS tags, predict root of the sentence, predict the parse tree which are then used to decorate the webpage.
What I did was just programmatically build a dataset from the spacy parser to build a custom javascript parser which does what I want. If I wanted to add some additional information that Spacy doesn't provide like grammar tenses and features, I can complete my dataset manually and have the network predict all the decorations at the same time, which allows it to not need a lot of samples because the layers are shared.
You can probably build your dataset faster by interacting with your neural network (active learning).
For the model you can start with something simple like a convolution residual network architecture. And later use some transformers when you want to reach state of the art.
http://moin.delph-in.net/ErgTop
Online demo:
It has all that information in the generated feature structure -- even more than you can view in the web interface. There's a development environment you can download, as well as a headless linux tool called ACE you can use on a server. The ERG is complex, but far and away the most sophisticated tool in this space.
It may not be the exact thing you're looking for but it can probably be helpful to your students.
I would also look at Python NLTK. I've only dabbled in the toolkit, so I'm not sure if it has what you're looking for exactly, but it's worth a look.
I've done extensive work in this area, including developing my own statistical parser from scratch. I'd be happy to chat more about this project, my email is daniel dot burfoot at gmail.com.
$ sudo apt install -y apertium-eng
$ echo "I have been searching for a tool that can scan a paragraph" |apertium eng-disam|grep -v '^;'
""
"prpers" prn subj p1 mf sg
""
"have" vbhaver inf
"have" vbhaver pres
""
"be" vbser pp
""
"search# for" vblex ger SELECT:177
""
"a" det ind sg
""
"tool" n sg
""
"that" cnjsub
"that" prn dem mf sg
"that" prn rel an mf sp
""
"can" vbmod pres SELECT:281
""
"scan" vblex inf SELECT:140
""
"a" det ind sg
""
"paragraph" n sg
"<.>"
"." sent
(grepping out lines with ; since they just show what was not removed by the disambiguator, whereas SELECT/REMOVE are just trace info saying what rules applied. If there are multiple indented lines, then the disambiguator didn't manage to fully disambiguate the analysis.)If you want to e.g. mark passive, it's easy to write a Constraint Grammar rule to do this. Put the following into rules.cg3:
DELIMITERS = sent ;
ADD (&PASSIVE) ("be") # Add the tag "&PASSIVE" to the word with lemma "be"
IF
(1* (pp) # There is a participle to the right
BARRIER (*) - (adv) # with nothing in between except perhaps adverbs
);
and pipe it in after the above pipeline: $ echo "The paper is not signed by me" |apertium eng-disam |grep -v '^;'|vislcg3 -g rules.cg3
""
"the" det def sp
""
"paper" n sg
""
"be" vbser pres p3 sg &PASSIVE
""
"not" adv
""
"sign" vblex pp
"sign" vblex past
"signed" adj
""
"by" pr SELECT:470
""
"prpers" prn obj p1 mf sg
"<.>"
"." sent
( https://wiki.apertium.org/wiki/Constraint_Grammar for more info on CG )
There will always be a gap between "your judgement" and the "judgement baked into a model" -- worse yet, if the model is very general and oriented towards cheap computation and away from expensive people it will have vague and contradictory judgements inside it that make the results meaningless.
That is the language of failure: the structure of success looks like the following.
(1) The system works like a "magic magic marker", that is, you mark up a lot of text (say 20,000 sentences) the way you think it should be marked up. This might be a character-at-a-time or word-at-a-time. Character-at-a-time is real and eternal, word-at-a-time is not real because there is not really such a thing as a "word". (e.g. "red ball" can fill slots that take "ball", you can smash together subwords to make words, for that matter people violate punctuation rules "Amazon.com announced that...", people call themselves n3pg34r, ...) So if you segment the text up front and segment it the wrong way you may throw out essential information and choose to fail.
(2) You need some system to mark up the text manually and efficiently. It is a lot of work. A typical person can make about 2000 or so up/down judgements a day; if a sentence counts for 10 decisions then maybe you can annotate 200 sentences a day. If you can get students to do it and get teachers to review it you might make short work of it.
This annotator
ticks the requirements, but most people find it terribly hard to use and wind up building "easy to use" systems that don't align things right at level (1) and... fail.
Assuming you do (1) and (2) the odds are in your favor, but you have to now
(3) build models; it does not matter if the model is a bunch of rules you cobbled together, or hidden markov, or LSTM, or convolutional. Off the top I would train an LSTM to 'predict the next character' on maybe 100M characters of text, then I would stick a simple model that takes the LSTM state as an input and labels characters at the output (could be SVM, Random Forest, Logit, or 3 layer on NN)
(4) Accept that the system is not going to be perfect, but have the ability to manually patch wrong results, improve the training data over time. I'd say this is a more important practice than any particular approach to (3)
Some tool could give you (1-4) tied up in a bow
claims to. But (2) involves elbow grease that 90% of people aren't going to do. Some of the 10% of people who do that elbow grease will succeed, the other 90% will fail.