Ruby-based web service for speech recognition, using the PocketSphinx gstreamer module.
- 
Ruby 1.8 
- 
Sinatra 
- 
Rack 
- 
Unicorn 
- 
PocketSphinx (NOTE: some features of the server require patched PocketSphinx, see below) 
- 
Some acoustic and language models for PocketSphinx 
- 
Install sphinxbase from SVN (make, make install) 
In cmusphinx/pocketsphinx directory:
wget http://www.phon.ioc.ee/~tanela/ps_gst.patch patch -p0 -i ps_gst.patch
Make sure you have GStreamer devevelopment packages installed. In Debian Squeeze:
apt-get install libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev
And configure, make, make install as usual.
This assumes you have ruby and rubygems installed.
You might want to do this as root:
gem install unicorn gem install sinatra gem install uuidtools gem install json gem install locale
Install ruby-gstreamer package (might vary depending on your distribution):
apt-get install libgst-ruby1.8
English GF-based recognizer also need:
- 
libtext-unidecode-perl 
- 
Phonetisaurus, Phonetisaurus prebuilt model for English (code.google.com/p/phonetisaurus/downloads/detail?name=g014b2b.tgz) 
- 
Python 
Clone the git repository:
git clone git://github.com/alumae/ruby-pocketsphinx-server.git
Before executing, add ‘/usr/local/lib` to the path where GStreamer plugins are looked for:
export GST_PLUGIN_PATH=/usr/local/lib
unicorn -c unicorn.conf.rb config.ru
If you installed Unicorn as a Ruby gem, you might need to execute:
/var/lib/gems/1.8/bin/unicorn -c unicorn.conf.rb config.ru
Test the default configuration (English “turtle” LM), using a raw audio file in the PocketSphinx test directory.
curl -T $(POCKETSPHINX_DIR)/test/data/goforward.raw -H "Content-Type: audio/x-raw-int; rate=16000" "http://localhost:8080/recognize"
Response should be:
{
  "status": 0,
  "hypotheses": [
    {
      "utterance": "go forward ten meters"
    }
  ],
  "id": "15c7a538d0d0c8d7f59e3cc791320953"
}
Unicorn configuration is in file unicorn.conf.rb. See unicorn.bogomips.org/examples/unicorn.conf.rb for more info.
See conf.yaml
Some of the more advanced examples below are specific to the Estonian configuration.
Record a sentence to a wav file, in mono (hit Ctrl-C when done speaking):
rec -c 1 sentence.wav
Send it to the web service:
curl -X POST --data-binary @sentence.wav -H "Content-Type: audio/x-wav" http://localhost:8080/recognize
Output (encoded using json, the example uses Estonian models):
{
  "status": 0,
  "hypotheses": [
    {
      "utterance": [
        "t\u00e4na on v\u00e4ljas \u00fcsna ilus ilm"
      ]
    }
  ],
  "id": "e30f54561135d681599915562d77d240"
}
Record a raw file using arecord:
arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 > sentence2.raw
Send it to web service:
curl -X POST --data-binary @sentence2.raw -H "Content-Type: audio/x-raw-int; rate=16000" http://localhost:8080/recognize
Record a 5 second audio, pipe it to curl, which streams it directly to web service using PUT (and gets almost instant response):
arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000" http://localhost:8080/recognize
Users can use their own grammars to recognize certain sentences. The grammars should be in JSGF format.
Example JSGF (let’s call it robot.jsgf)
#JSGF V1.0; grammar robot; public <command> = (liigu | mine ) [ ( üks | kaks | kolm | neli | viis ) meetrit ] (edasi | tagasi);
NB! Grammars should be in the same charset that the server is using for dictionary, which currently is latin-1 (sorry for that).
You need to upload the JSGF file to somewhere where the server can fetch it, let’s say www.example.com/robot.txt
Now, let the server download and compile it:
curl -vv http://localhost:8080/fetch-lm?url=http://www.example.com/robot.jsgf
This should result in HTTP/1.1 200 OK.
Now you can use the grammar to recognize a sentence that is accepted by the grammar:
arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | \ curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000" http://localhost:8080/recognize?lm=http://www.example.com/robot.jsgf
Result:
{
  "status": 0,
  "hypotheses": [
    {
      "utterance": "mine viis meetrit tagasi"
    }
  ],
  "id": "9e3895e9ee0b5138e73c6fca30f51a58"
}
If you update the grammar on the server, you need to make the /fetch-jsgf request again, as the server doesn’t check for changes every time a recognition request is done (for efficiency reasons).
GF (Grammatical Framework) grammars are supported.
A GF grammar must be compiled into a .pgf file. To upload it to the server, use the fetch-pgf API call, e.g.:
curl "http://bark.phon.ioc.ee/speech-api/v1/fetch-lm?url=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf&lang=Est"
The ‘lang’ attribute (defaults to ‘Est’) specifies input languages of the grammar. Many comma-separated languages can be specified, e.g lang=Est,Est2
To recognize with a GF, use similar request as with JSGF, e.g.:
arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000" "http://localhost:8080/recognize?lm=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf
You can also specify output language(s) that will be used to linearize the raw recognition result, e.g.:
arecord --format=S16_LE --file-type raw --channels 1 --rate 16000 --duration 5 | curl -vv -T - -H "Content-Type: audio/x-raw-int; rate=16000" "http://localhost:8080/recognize?lm=http://kaljurand.github.com/Grammars/grammars/pgf/Calc.pgf&output-lang=App"
Output:
{
 "status": 0,
 "hypotheses": [
   {
     "utterance": "viis minutit sekundites",
     "linearizations": [
       {
         "lang": "App",
         "output": "5 ' IN \""
       },
       {
         "lang": "App",
         "output": "5 min IN s"
       }
     ]
   }
 ],
 "id": "83486feaca30995401ed4a66951a3f23"
}
Multiple output languages can be used, by using comma-separated values: “..&output-lang=App,App2”