While summaries are helpful, keywords have different purposes. Keywords capture the most essential aspects that potential renters might be looking for. To extract keywords, we can use NLP techniques such as Named Entity Recognition (NER). This process goes beyond just identifying frequent words. We can extract critical information by considering factors like word co-occurrence and relevance to the domain of rental listings. This information can be a single word, such as ‘luxurious’ (adjective), ‘Ginza’ (location), or a phrase like ‘quiet environment’ (noun phrases) or ‘near to Shinjuku’ (proximity).
3a. Level: Easy — Regex
The ‘find’ function in string operations, along with regular expressions, can do the job of finding keywords. However, this approach requires an exhaustive list of words and patterns, which is sometimes not practical. If an exhaustive list of keywords to look for is available (like stock exchange abbreviations for finance-related projects), regex might be the simplest way to do it.
3b. Level: Intermediate — The Matcher
While regular expressions can be used for simple keyword extraction, the need for extensive lists of rules makes it hard to cover all bases. Fortunately, most NLP tools have this NER capability that is out of the box. For example, Natural Language Toolkit (NLTK) has Named Entity Chunkers, and spaCy has Matcher.
Matcher allows you to define patterns based on linguistic features like part-of-speech tags or specific keywords. These patterns can be matched against the rental descriptions to identify relevant keywords and phrases. This approach captures single words (like, Tokyo) and meaningful phrases (like, beautiful house) that better represent the selling points of a property.
noun_phrases_patterns = [
[{'POS': 'NUM'}, {'POS': 'NOUN'}], #example: 2 bedrooms
[{'POS': 'ADJ', 'OP': '*'}, {'POS': 'NOUN'}], #example: beautiful house
[{'POS': 'NOUN', 'OP': '+'}], #example: house
]# Geo-political entity
gpe_patterns = [
[{'ENT_TYPE': 'GPE'}], #example: Tokyo
]
# Proximity
proximity_patterns = [
# example: near airport
[{'POS': 'ADJ'}, {'POS': 'ADP'}, {'POS': 'NOUN', 'ENT_TYPE': 'FAC', 'OP': '?'}],
# example: near to Narita
[{'POS': 'ADJ'}, {'POS': 'ADP'}, {'POS': 'PROPN', 'ENT_TYPE': 'FAC', 'OP': '?'}]
]
3c. Level: Advanced — Deep Learning-Based Matcher
Even with Matcher, some terms may not be captured by rule-based matching due to the context of the words in the sentence. For example, the Matcher might miss a term like ‘a stone’s throw away from Ueno Park’ since it won’t pass any predefined patterns, or mistake “Shinjuku Kabukicho” as a person (it’s a neighborhood, or LOC).
In such cases, deep-learning-based approaches can be more effective. By training on a large corpus of rental listing with associated keywords these model learn the semantic relationships between words. This makes this method more adaptable to evolving language use and can uncover hidden insights.
Using spaCy, performing deep-learning-based NER is straightforward. However, the major building block for this method is usually the availability of the labeled training data, as also the case for this exercise. The label is a pair of the target terms and the entity name (example: ‘a stone throw away’ is a noun phrase — or as shown in picture: Shinjuku Kabukicho is a LOC, not a person), formatted in a certain way. Unlike rule-based where we describe the terms into noun, location, and others from the built-in functionality, data exploration or domain expert are needed to discover the target terms that we want to identify.
Part 2 of the article will discuss this technique of discovering themes or labels from the data for topic modeling using clustering, bootstrapping, and other methods.