본문 바로가기
NLP

NLP란 무엇인가 : What is NLP

by 박휴지 : Park Tissue 2021. 10. 23.

Applied NLP

Applied  : we work with real language dat

 

What is a natural language?

- A means of communication that has envolved naturally in humans through use and repetition without conscious plagnning, e.g., English, Korean, etc.

- 사람들이 말을 하고 반복하면서 진화하는 커뮤니케이션 수단

 

What is not a natural language?

- Programming language : python

- Formal languages : first order logic

 

Processing : How to program computers to anlayze large amounts of natural language data.

- Natural language understanding  ~ amazon reviews -> we can complement the product, : Retailer refer the amazon reviews in order to complement the product. (상품평을 보고 retailer는 상품을 보완할 수 있다.)

- Natural language genration ~ google translation.

- Speech understanding

 

Applications

- Keyword search ex. google search

- Spell Checking ex. grammarly

- Chatbot

- Machine Translation

- Dialogue Systems

 

What is different between Lignuistics and NLP

Linguistics NLP
Phonetics, Phonology 음성학 Speech recognition, synthesis 음성인식, 음성합성
Syntax 구문 Part-of-speech tagging,
Parsing
Lexical Semantics 어휘적 의미론  Entitiy recognition
Compositional Semantics 구성 의미론 Role labeling,
Reference resolution
Text classification
Production 제작 Language generation
Summarizaiton
Machine translation

 

Evolution of NLP

[Rule-based(1950) -> Statistical Learning(1990) -> Deep Learning(2010)]

 

1. Rule based NLP?

- ex) Who is the prime minister of Indea? 

- To make question, you should use (who/when/Is/Does) primary way. (주어진 질문에 답을 알려줌)

- A hand-crafted system of rules that parse text and match patters is used to imitate the human way of building language structures.

- It can have high performance in specific use cases ex) Q&A, but often suffer performance degradation when generallized

- Requires domain knowledge about the language -> 해당 영역에 지식이 있어야 답을 저장해 놓을 수 있다.

 

2. NLP based on statistical learning : leanring from extracted features.

- Learning is performed based on probabilistic modeling, likelihood maximization, and training classifiers.

- Requires an annotated training dataset along with a suitable method of feature engineering, potentially based on domain knowledge about the laugnage.

- A parametric model is trained followed by evaluation on a test dataset similar, yet different, to the training dataset

- More generalizable than rule-based NLP and applicable to a broader range of applications like machine translation.

 

 

 

3. Deep learning: learning directly from large textual datasets.

- Similar to ML with Key differences:

- feature engineering is automated, as deep networks can learn important features from text 

특징공학은 딥 네트워크가 텍스트로부터 중요한 특징을 배우면서 자동적으로 실행됨

-> domain knowlege about langue is minimized. 도메인 지식은 최소한만 필요

- Raw text with minimal pre-processing is fed into the models

- It requires very large datasets to circumvent lack of using human domain knolwege.

- Applicable to more challenging tasks ex) dialogue system

 

Why do we still use rule-based NLP and tranditional ML:

- Still good for sequence labeling

- Some ideas in deep learning are exteded versions of ealirer methods

- Can help to imporve deep learning-based algorithms.

 

NLP pipleline

- Data Acquisition

- Text Cleaning

- Pre-Processing

- Features Extraction

- Machine Leanring

- Evaluation

- Deployment

- Monitoring

 

Ambiguity in Natural Language

Ambiguity

- The shortstop caugh the fly (lexical ambiguity) 어휘적 모호성

- Flying planes can be dangerous (structural/ syntactic ambiguity)

위의 문장의 경우 1) 비행기가 위험할 수 있다. (비행기 자체가 위험)

                       2) 날고 있는 비행기가 위험할 수 있다. (나는 flying 행위가 위험)

이 두가지로 해석이 되기 때문에 어휘적 모호성이 존재한다.

 

Local ambiguity VS global ambiguity : Context

- Fat people at accumulates

- The man who hunts ducks out on weekends.

 

Psycholinguistics : the study of human sentence processing 심리언어학

 

Preprocessing NLP

- two words or one words? ex) hot dog, New York, Hand-writing->handwriting, He is -> He's

 

Tokenization : splitting the text into units for processing

- anywhere on the scale chracter to word 

- removing extra spaces

- removing stop words

- removing unhelpful words ex) external URL links

- removing unhelpful characters ex) non-alphabetical characters

- EX. Tokenization is the first step in NLP (7 tokens)

 

Stemming : wordform stripped of some characters 어간 추출

- ex) tokenization -> tokeniz

 

Lemmatization : the base (or citation) form of a word 표제어 찾기 (그 단어가 문장 속에서 어떤 품사로(POS) 사용되었는지도 판단)

- ex) Tokenization, tokenize -> token

- Went, gone, goes -> go

 

* Stemming과 Lemmatization의 차이는, 영단어 flies 가 있을 때, stemming은 어근을 알려주지만, Lemmatization은 문장 속에서 flies가 '날다'인지 '파리'인지까지 결정

 

  Stemming Lemmatization
am am be
the going the go the going
having hav have

 

Machine Learning

- Preprocessing + feature extraction converts input text into feature vectors.

 

 

 

댓글