WSEAS

Plenary Lecture

Natural Language Processing: Key for Next Generation Big Data and Data Science

Professor Emdad Khan
Maharishi University of Management (MUM)
USA
E:mail: ekhan@mum.edu

Abstract: As we know, solving Big Data and Data Science problems in dealing with today’s data and information processing applications is very important. Such applications mainly address structured data. Structured data represent about 20% of total data. Businesses are getting huge benefits by successfully using structured data, especially, in Business Analytics. Many tools and software exist to help process structured data. Structured data is basically numbers at its core. For example, how many times a page was visited, how long someone was on a website site, the Click Through Rate (CTR) and conversions from an advertising campaign. If it can be counted, it can be analyzed. If it can be analyzed, it can be interpreted. This is great. But it’s not going to give us a view into what is actually being said in some conversations or running texts. It will not provide the meaning of the conversation or text. In order to delve into the dialog, and make a good sense out of it, we have to get into the unstructured side of things. Unstructured data are things like text (say, from a survey or from tweets), or video, or a voice recording of a customer’s comments.
Today, unstructured (including some semi-structured) data represent about 80% of the data. This 80/20 split is changing fast with the fastest growing unstructured data. Thus, it is more important to address the problems with unstructured data as it will not only help businesses in a more significant way but also will help all other users including consumers. In fact, it is becoming a necessity to address the problems associated with unstructured and mixed / semi-structured data. Counting and associated analytics as used in structured Big Data and Data Science is not well suited for unstructured data. For example, what type of count or interpretation can be made from customer feedbacks in a running text or from a voice recording of a customer service transaction? How are tweets be interpreted and analyzed? What type of information can be gleaned from customer product reviews? What happens when those reviews are videos?
Clearly, we not only would need to address problems associated with unstructured and mixed data but also need to integrate these in an appropriate way to provide next generation Analytics system. Good news is that many existing analytics tools already started adding part of it which is called “text mining” i.e. finding some key information from running texts. Existing Natural Language Processing (NLP) tools are good enough in processing and mining unstructured text data and find some key information. Such tools use common NLP algorithms / techniques like Sentence Segmentation, Tokenization, Stemming, POS (Part of Speech Tagging), NER (Name Entity Recognition), Parsing and some basic semantics.
However, text mining is at an early stage. Today’s text mining cannot handle complex tasks in NLP. Besides, the type of problems in text mining is different than those in mining structured data. The semantics in structured data can be well defined and processed. SQL query can find desired data from regular transaction oriented databases. OLAP (Online analytic processing) and datamining can be done on data warehouses and data marts. Structured Data processing is a mature field. But in text mining, defining semantics is much more difficult. The query (or search process) is different and getting desired results can be very complex. This is mainly because of the natural questions that users may ask. For example, users may like to ask general questions which may be very subjective, like “what is the key message from most customers about the customer support experience of company xyz?” or “what can we infer from the key news about company xyz during last 3 months?”. This is way different than questions on structured data based analytics which use relatively more deterministic questions like “How did the stock of company xyz performed during last 3 months?”.
Thus, we have much more complex problems to solve to handle future Big Data and Data Science needs. Moreover, the semantics issue is more important for such systems as users would like to ask more natural questions or requests for something that would require making a summary or drawing some inference. And this becomes even more complex with more unstructured multimedia data.
To handle such advanced issues related to future Big Data and Data Science, we would need to address quite a few new things. The key is to use NLP based computing with a highly capable and efficient Semantic Engine. Such a Semantic Engine would address the needs of more capable text mining and natural language processing / understanding. The other key areas we would need are AI, Machine Learning, related advanced algorithms and Natural Language based UI (User Interface) with effective and efficient integration of all these – a complex multidisciplinary area.
This talk will focus on the need to use NLP based computing to effectively address the needs of future Big Data and Data Science. Special attention will be given to natural language semantics as it is the key for NLP based computing. A Semantic Engine using Brain-Like Approach (SEBLA) will be discussed to address the semantics (abstraction, representation, real meaning, and computational complexity) issue, and will show how to use SEBLA to effectively address the key problems mentioned above with some good examples, especially in the Analytics area. SEBLA uses more human-like semantics as opposed to more “mechanical semantics” that existing systems use with Predicate Logic, or Ontology or the like.
It is important to note that NLP based computing is not only needed for future Analytics but also for many other applications. NLP based computing will address most needs of next generation Internet – like requesting a specific information (e.g. show me the pictures of last Saturday party), getting specific answer to a question, much more focused search results (like under 50 results versus today’s millions of hits), completing transactions naturally, getting summary of an article(s) , and Drawing Inference.
NLP based computing will help many more people to enjoy the benefits of the Internet & Information Age in a much more effective and efficient way; thus enabling many more people around the world, especially, in the underdeveloped and developing countries to effectively bridge the Digital and Language Divides in a practical way, and help sustainable global development by focusing on Education, Innovation and Entrepreneurship.

Brief Biography of the Speaker: Dr. Emdad Khan is the Founder of InternetSpeech. He founded the company in 1998 with the vision to develop innovative technology for accessing information on the Internet anytime, anywhere, using just an ordinary telephone and the human voice. As a pioneer in the Internet voice space, Khan is a frequent speaker at Natural Language, Voice-Recognition, Internet applications, bridging the Digital and Language Divides and other academia & industry conferences and trade shows. He holds 23 patents and has published more than 60 journal & conference papers on the advent of Intelligent Internet, content rendering, Natural Language Processing/Understanding, Big Data, Bioinformatics, Software Engineering, Neural Nets, Fuzzy Logic, Intelligent Systems, VLSI and optics. Khan’s acute technical knowledge and keen understanding of emerging markets has played an important role in the development of InternetSpeech’s key products/services including netECHO (Voice Internet that delivers complete Internet access via voice and any phone) and SEBLA (Semantic Engine using Brain-Like Approach).
During his career, Khan invented, defined, developed and deployed worldwide new intelligent software products for micro-controller-based home appliances. He has also created and deployed speech recognition Internet applications. He has 20 years of experience with large semi-conductor companies, including Intel and National. He has also over 8 years of experience in academia.
Khan is active in research. His current major interest is to use brain-like and brain-inspired algorithms to solve some open problems, especially, NLU (Natural Language Understanding) which is very well aligned with InternetSpeech’s next generation products & services to allow users (especially bottom of the pyramid people) to interact with the Internet using their natural language, and thus help their economic, social and other developments. Solving NLU problem using a Semantic Engine has numerous applications including Intelligent Search, Intelligent Information Retrieval, Question & Answer System, Summarization, Big Data, Analytics and more. These can be used in Biology, Economics, Finance, Manufacturing, Agriculture and the like; and are critical for Economic, Social, Cultural and other developments with increased World Peace, with special focus on Education, Innovation and Entrepreneurship.
Khan’s recent interest is also exploring how the human brain uses vital communications, biological alphabets, and associated language to gain understanding of meanings of the communications, a basic necessity to clearly understand how biological systems work. This includes use of Science of Creative Intelligence & Vedic Science.
He is the author of the book “Internet for Everyone: Reshaping the Global Economy by Bridging the Digital Divide”.
He holds a doctorate in computer science, masters of science degrees in electrical engineering and engineering management, and a bachelor of science degree in electrical engineering.
Khan is a faculty at Maharishi University of Management (MUM), USA. Khan is also a visiting Research Professor at the Southern University in Baton Rouge, Louisiana, USA.

Plenary Lecture 2

Quick Links

Login

Bulletin Board