Archiv für März 2012
O’Reilly Radar – Insight, analysis, and research about emerging technologies.
…the session with this title was one of the best sessions at this years StrataConf.
My Google Reader feed gave me yesterday a very inspiring use case for the tech cocktail of data mining, language processing & image recognition: Startup Helps Small E-Businesses Stand Even With Amazon, Provides Pricing as a Service.
This could be the next version of the earlier API mashups, these are connecting information in a much more relevant way… and the nice thing about it is that in many cases the business model is part of the package.
For Googlemail you could do it like this:
0) Think of the kind of content you want to be notified of and write down terms which might accompany this type of content in a text/attachment. (like “flight confirmation” might also have fields like booking ID, departure date etc.)
1) if you need immediate user attention you might
1a) use google context sensitive gadgets ( https://developers.google.com/go… ) to identify content related to the type of content you are interested in. You can use a regular expression to match mails / attachments) or
1b) use the Google data API in case you are comfortable with handling in a backend process ( http://code.google.com/intl/de-D… ).
2) You can forward/post the mails/attachments to your web application and notify the user that you processed a kind of content.
In the context gadgets you are constrained in terms of processing to steps which you can do inside a JS-Script/an HTML-page), so regex evaluation is the most convenient solution, though it is not very flexible. (think of changing terms etc.)
When you need a learning model, you might want to use more sophisticated language processing toolkits, but they need a kind of backend processing capabilities, which requires regularly a backend server. (for Python look to www.nltk.org )
In case you are targeting a news content ontology, a book like the (very good ) mentioned "Semantic Web for the working ontologist" ( http://www.amazon.de/Semantic-We… ) is only a part of the story: another crucial part is to manage the – like team-based – process of putting together the ontology.
In this area there are not so many solutions yet (especially when you don't want to train everybody in the team Semantic Web in detail): one notable tool is http://poolparty.biz/ , they focus on ontology & vocabulary creation for subject matter experts without requiring them to jump down to text file editing.
In case you have already a big bag of quality news content, you might also try to "fish" the relevant & specific terms using language processing tools out of the existing content and to put them into your ontology. …this can help you to get the critical basis for content very fast. (re. termfinding you might want to look to the Python-based NLTK.