Applied NLP
Applied : we work with real language dat
What is a natural language?
- A means of communication that has envolved naturally in humans through use and repetition without conscious plagnning, e.g., English, Korean, etc.
- 사람들이 말을 하고 반복하면서 진화하는 커뮤니케이션 수단
What is not a natural language?
- Programming language : python
- Formal languages : first order logic
Processing : How to program computers to anlayze large amounts of natural language data.
- Natural language understanding ~ amazon reviews -> we can complement the product, : Retailer refer the amazon reviews in order to complement the product. (상품평을 보고 retailer는 상품을 보완할 수 있다.)
- Natural language genration ~ google translation.
- Speech understanding
Applications
- Keyword search ex. google search
- Spell Checking ex. grammarly
- Chatbot
- Machine Translation
- Dialogue Systems
What is different between Lignuistics and NLP
Linguistics | NLP |
Phonetics, Phonology 음성학 | Speech recognition, synthesis 음성인식, 음성합성 |
Syntax 구문 | Part-of-speech tagging, Parsing |
Lexical Semantics 어휘적 의미론 | Entitiy recognition |
Compositional Semantics 구성 의미론 | Role labeling, Reference resolution Text classification |
Production 제작 | Language generation Summarizaiton Machine translation |
Evolution of NLP
[Rule-based(1950) -> Statistical Learning(1990) -> Deep Learning(2010)]
1. Rule based NLP?
- ex) Who is the prime minister of Indea?
- To make question, you should use (who/when/Is/Does) primary way. (주어진 질문에 답을 알려줌)
- A hand-crafted system of rules that parse text and match patters is used to imitate the human way of building language structures.
- It can have high performance in specific use cases ex) Q&A, but often suffer performance degradation when generallized
- Requires domain knowledge about the language -> 해당 영역에 지식이 있어야 답을 저장해 놓을 수 있다.
2. NLP based on statistical learning : leanring from extracted features.
- Learning is performed based on probabilistic modeling, likelihood maximization, and training classifiers.
- Requires an annotated training dataset along with a suitable method of feature engineering, potentially based on domain knowledge about the laugnage.
- A parametric model is trained followed by evaluation on a test dataset similar, yet different, to the training dataset
- More generalizable than rule-based NLP and applicable to a broader range of applications like machine translation.
3. Deep learning: learning directly from large textual datasets.
- Similar to ML with Key differences:
- feature engineering is automated, as deep networks can learn important features from text
특징공학은 딥 네트워크가 텍스트로부터 중요한 특징을 배우면서 자동적으로 실행됨
-> domain knowlege about langue is minimized. 도메인 지식은 최소한만 필요
- Raw text with minimal pre-processing is fed into the models
- It requires very large datasets to circumvent lack of using human domain knolwege.
- Applicable to more challenging tasks ex) dialogue system
Why do we still use rule-based NLP and tranditional ML:
- Still good for sequence labeling
- Some ideas in deep learning are exteded versions of ealirer methods
- Can help to imporve deep learning-based algorithms.
NLP pipleline
- Data Acquisition
- Text Cleaning
- Pre-Processing
- Features Extraction
- Machine Leanring
- Evaluation
- Deployment
- Monitoring
Ambiguity in Natural Language
Ambiguity
- The shortstop caugh the fly (lexical ambiguity) 어휘적 모호성
- Flying planes can be dangerous (structural/ syntactic ambiguity)
위의 문장의 경우 1) 비행기가 위험할 수 있다. (비행기 자체가 위험)
2) 날고 있는 비행기가 위험할 수 있다. (나는 flying 행위가 위험)
이 두가지로 해석이 되기 때문에 어휘적 모호성이 존재한다.
Local ambiguity VS global ambiguity : Context
- Fat people at accumulates
- The man who hunts ducks out on weekends.
Psycholinguistics : the study of human sentence processing 심리언어학
Preprocessing NLP
- two words or one words? ex) hot dog, New York, Hand-writing->handwriting, He is -> He's
Tokenization : splitting the text into units for processing
- anywhere on the scale chracter to word
- removing extra spaces
- removing stop words
- removing unhelpful words ex) external URL links
- removing unhelpful characters ex) non-alphabetical characters
- EX. Tokenization is the first step in NLP (7 tokens)
Stemming : wordform stripped of some characters 어간 추출
- ex) tokenization -> tokeniz
Lemmatization : the base (or citation) form of a word 표제어 찾기 (그 단어가 문장 속에서 어떤 품사로(POS) 사용되었는지도 판단)
- ex) Tokenization, tokenize -> token
- Went, gone, goes -> go
* Stemming과 Lemmatization의 차이는, 영단어 flies 가 있을 때, stemming은 어근을 알려주지만, Lemmatization은 문장 속에서 flies가 '날다'인지 '파리'인지까지 결정
Stemming | Lemmatization | |
am | am | be |
the going | the go | the going |
having | hav | have |
Machine Learning
- Preprocessing + feature extraction converts input text into feature vectors.
댓글