Identifying and Ranking Talented Candidates Using AI | by Chris Turner | Apr, 2025


Apziva AI Residency Program

As an AI Resident at Apziva, I’ve had the opportunity to work on a range of real-world machine learning applications, collaborating with industry partners to tackle complex challenges. The program emphasizes hands-on experience in applying AI to solve practical problems, from predictive analytics to automation. This project focuses on optimizing the talent sourcing process using machine learning to improve candidate ranking and selection.

Project Background

Finding top talent for technology companies is a complex challenge that goes beyond simply matching keywords on a resume. It requires a deep understanding of job roles, identifying standout candidates, and efficiently filtering through large pools of potential hires. Traditionally, this process involves significant manual effort, making it both time-consuming and prone to human bias.

To address these challenges, I aimed to develop a machine learning-powered pipeline that could automate and refine candidate selection. While the client currently sources candidates semi-automatically, the focus here is on evaluating and ranking candidates based on their suitability for a given role as well as past hiring data. This approach allows recruiters to fine-tune results by selecting standout candidates and re-ranking the list accordingly.

Goals

  • Predict Candidate Fit: Develop an AI model to assess how well a candidate matches a given role based on available data.
  • Automated Ranking System: Rank candidates based on a fitness score, reducing the need for manual sorting.
  • Re-Ranking with Feedback: Implement a mechanism where human reviewers can “star” a candidate, triggering a re-ranking of the list to prioritize similar high-fit candidates.
  • Filter Out Irrelevant Candidates: Identify and remove individuals who should not be in the selection pool in the first place.
  • Set a Flexible Cut-Off Point: Determine a reliable threshold for candidate selection while ensuring high-potential candidates are not overlooked.

Data Overview & Exploration

This dataset consists of 104 rows and 5 columns, with a mix of numerical and categorical data. The structure of the dataset is as follows.

  • id (int64): Unique identifier for each record.
  • job_title (object): The job title associated with each record.
  • location (object): The geographical location of the individual.
  • connection (object): Connection status, typically indicating professional networking statistics on social media.
  • fit (float64): This column contains only missing values.

The dataset contains no missing values except for the fit feature column, which has 100% missing values (104 out of 104).

Looking at unique values and missing values:

+---+------------+-----------+--------------+----------------+----------------------------------------------------------------------------------------------------------+
| | Column | Data Type | Unique Count | Missing Values | Most Frequent Value |
+---+------------+-----------+--------------+----------------+----------------------------------------------------------------------------------------------------------+
| 0 | id | int64 | 104 | 0 | 1 |
| 1 | job_title | object | 52 | 0 | 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional |
| 2 | location | object | 41 | 0 | Kanada |
| 3 | connection | object | 33 | 0 | 500+ |
| 4 | fit | float64 | 0 | 104 | |
+---+------------+-----------+--------------+----------------+----------------------------------------------------------------------------------------------------------+

Looking at the job_title feature for each candidate we can see that out of 104 candidate rows in the dataframe there are 52 unique values meaning that there are many candidate job_title ferature values that are duplicates. We can see as well that the fit feature is completely empty with 104 missing values.

Looking further in into the job_title feature we can see that many of the values for job_title are shared among many of the candidates:

+----+-------------------------------------------------------------------------------------------------------------+---------+-----------------------------+
| | job_title | Count | Indexes |
|----+-------------------------------------------------------------------------------------------------------------+---------+-----------------------------|
| 0 | 2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional | 7 | [0, 13, 14, 18, 30, 43, 56] |
| 1 | Advisory Board Member at Celal Bayar University | 4 | [4, 22, 34, 47] |
| 2 | Aspiring Human Resources Management student seeking an internship | 2 | [26, 28] |
| 3 | Aspiring Human Resources Professional | 7 | [2, 16, 20, 32, 45, 57, 96] |
| 4 | Aspiring Human Resources Specialist | 5 | [5, 23, 35, 48, 59] |
| 5 | HR Senior Specialist | 5 | [7, 25, 37, 50, 60] |
| 6 | Human Resources Coordinator at InterContinental Buckhead Atlanta | 4 | [12, 42, 55, 64] |
| 7 | Native English Teacher at EPIK (English Program in Korea) | 5 | [1, 15, 19, 31, 44] |
| 8 | People Development Coordinator at Ryan | 6 | [3, 17, 21, 33, 46, 58] |
| 9 | SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR | 4 | [11, 41, 54, 63] |
| 10 | Seeking Human Resources HRIS and Generalist Positions | 4 | [9, 39, 52, 61] |
| 11 | Seeking Human Resources Opportunities | 2 | [27, 29] |
| 12 | Student at Chapman University | 4 | [10, 40, 53, 62] |
| 13 | Student at Humber College and Aspiring Human Resources Generalist | 7 | [6, 8, 24, 36, 38, 49, 51] |
+----+-------------------------------------------------------------------------------------------------------------+---------+-----------------------------+

Using a Matplotlib word cloud we can get a visual representation of how often words occur across all the job_title entries:

To develop an effective ranking system for job candidates, I followed a structured approach:

Data Cleaning & Transformation

  • Converted the connection feature into integers, setting all “500+” values to 600.
  • Standardized the location feature to follow a consistent format. Entries were normalized to either “city, state”, “city”, “state”, or “country”, depending on user input.

Data Augmentation

  • Expanded the dataset to introduce more diversity in job titles, ensuring better model generalization and allowing for better evaluation of models.

Supervisory Signal: Keyword and Starred Candidate History

  • Keyword Selection: The keyword variable represents the search term or job title used to guide the candidate ranking process. It acts as the supervisory signal, helping the model prioritize candidates whose job titles are most relevant to the specified keyword or phrase.
  • Creating Starred Candidate History: The starred_candidates_history dictionary stores the IDs of candidates previously marked as relevant (starred) for different keywords. The starred_ids variable retrieves the list of starred candidate IDs associated with the current keyword, and the num_starred variable counts how many candidates have been starred for this search term, providing insight into the historical supervision data available for ranking.

Feature Engineering

  • Computed cosine similarity between job titles and predefined keyword across five different model or vectorizer types to create fit features for each model type and candidate.
  • Created a similarity_to_starred feature, measuring job title cosine similarity to the average embedding of previously selected (starred) candidates.
  • Introduced an is_starred feature, labeling starred candidates as 1 and non-starred as 0 to be used as the target variable on various model testing runs.

Model Testing & Evaluation

  • Experimented with various learning-to-rank models, ranging from pointwise, pairwise, and listwise approaches to tree-based models and neural networks.
  • Evaluated models based on ranking metrics to determine the most effective configuration.
potential_talents
├── data
│ ├── potential-talents - Aspiring human resources - seeking human resources.csv
│ ├── potential-talents - Aspiring human resources - seeking human resources - appended.csv
├── notebooks
│ ├── data_exploration.ipynb
│ └── rank_candidates.ipynb
├── src
│ ├── __init__.py
│ ├── utils.py
│ ├── feature_engineering_utils.py
│ └── prediction_evaluation.py
├── requirements.txt
└── README.md

The first feature that need transformation is the connection feature representing the amount of connections a given candidate has on social media. The connection feature had some values represented as “500+” for candidates with more than 500 connections. To make sure the feature was all integers, and the values listed as “500+” were changed to all be 600 as it is more than 500 and the actual value for these candidates is unknown. The feature was then normalized to be between 0 and 1.

from sklearn.preprocessing import MinMaxScaler

# tranform connection feature from number strings to actual ints
df["connection"] = df["connection"].str.strip().replace("500+", "600").astype(int)

# Normalize connections to range [0,1]
scaler = MinMaxScaler()
df["connection"] = scaler.fit_transform(df[["connection"]])

The location feature for our dataset shows inconsistencies as well so I developed a clean_location() function to call on the feature and transform the incomplete data into city, state, country format where the information is available otherwise into state/country or country format if that was all the information that was entered.

# Transform the location feature to city, state, country where possible

import re
import pandas as pd

def clean_location(location):
# Handle NaN or empty values
if pd.isna(location) or not location.strip():
return "Unknown"

# Dictionary to standardize country names and full location replacements
replacements =
"Kanada": "Canada",
"Amerika Birleşik Devletleri": "United States",
"İzmir, Türkiye": "Izmir, Turkey",
"USA": "United States",

# Apply replacements first
location = replacements.get(location, location)

# Replace '/' with ',' for consistency
location = location.replace("/", ", ")

# Remove 'Area' suffix and handle "Greater [City] Area"
location = re.sub(r"\s*Area$", "", location) # e.g., "Houston, Texas Area" → "Houston, Texas"
location = re.sub(r"^Greater\s+", "", location) # e.g., "Greater New York City" → "New York City"

# Dictionary for adding missing state or country
city_to_state =
# Cities without state/country
"New York": "New York, New York, United States",
"Houston": "Houston, Texas, United States",
"Denton": "Denton, Texas, United States",
"Atlanta": "Atlanta, Georgia, United States",
"Chicago": "Chicago, Illinois, United States",
"Austin": "Austin, Texas, United States",
"San Francisco": "San Francisco, California, United States",
"San Jose": "San Jose, California, United States",
"Los Angeles": "Los Angeles, California, United States",
"Lake Forest": "Lake Forest, California, United States",
"Virginia Beach": "Virginia Beach, Virginia, United States",
"Baltimore": "Baltimore, Maryland, United States",
"Gaithersburg": "Gaithersburg, Maryland, United States",
"Highland": "Highland, California, United States",
"Milpitas": "Milpitas, California, United States",
"Torrance": "Torrance, California, United States",
"Long Beach": "Long Beach, California, United States",
"Bridgewater": "Bridgewater, Massachusetts, United States",
"Lafayette": "Lafayette, Indiana, United States",
"Cape Girardeau": "Cape Girardeau, Missouri, United States",
"Katy": "Katy, Texas, United States",
"Izmir": "Izmir, Turkey",

# Regions or metro areas
"New York City": "New York, New York, United States",
"San Francisco Bay": "San Francisco, California, United States",
"Philadelphia": "Philadelphia, Pennsylvania, United States",
"Boston": "Boston, Massachusetts, United States",
"Atlanta": "Atlanta, Georgia, United States",
"Chicago": "Chicago, Illinois, United States",
"Los Angeles": "Los Angeles, California, United States",
"Grand Rapids, Michigan": "Grand Rapids, Michigan, United States",
"Dallas/Fort Worth": "Dallas, Texas, United States",
"Raleigh-Durham, North Carolina": "Raleigh, North Carolina, United States",
"Jackson, Mississippi": "Jackson, Mississippi, United States",
"Monroe, Louisiana": "Monroe, Louisiana, United States",
"Baton Rouge, Louisiana": "Baton Rouge, Louisiana, United States",
"Myrtle Beach, South Carolina": "Myrtle Beach, South Carolina, United States",
"Chattanooga, Tennessee": "Chattanooga, Tennessee, United States",
"Kokomo, Indiana": "Kokomo, Indiana, United States",
"Las Vegas, Nevada": "Las Vegas, Nevada, United States",

# Split the location into parts
parts = [part.strip() for part in location.split(",")]

# Case 1: Single value (could be city, state, or country)
if len(parts) == 1:
loc = parts[0]
if loc in city_to_state:
return city_to_state[loc]
elif loc in replacements:
return replacements[loc]
else:
# Assume it's a country or ambiguous place if not found
return f"loc, Unknown" if loc not in ["United States", "Canada", "Turkey"] else loc

# Case 2: Two parts (e.g., "Houston, Texas" or "Izmir, Turkey")
elif len(parts) == 2:
city, second = parts
if city in city_to_state:
return city_to_state[city]
# If second part is a state or country, format accordingly
if second in ["Texas", "California", "Georgia", "Illinois", "Virginia", "Maryland", "Massachusetts", "Indiana", "Missouri", "Nevada", "New York"]:
return f"city, second, United States"
elif second in ["Turkey", "Canada", "United States"]:
return f"city, second"
else:
return f"city, second, United States" # Default to US if unclear

# Case 3: Three or more parts (e.g., "New York, New York, United States")
else:
if "United States" in parts or "Canada" in parts or "Turkey" in parts:
return ", ".join(parts[:3]) # Keep first three parts if country is present
else:
return f"', '.join(parts[:2]), United States" # Assume US if no country

return location # Fallback

To address the limitations of our dataset, which was relatively small and contained numerous similar job titles, I introduced additional data featuring individuals working in zoo-related fields. This augmentation served as a test to evaluate how well our models could distinguish between HR professionals, computer technology candidates, and those in animal-related professions. Notably, the word “Human” in “Human Resources” creates some semantic overlap with zoo and animal-related job titles, potentially leading to misclassification. The most effective model, therefore, would be one that ranks all the augmented zoo-related data at the top while ensuring that HR professionals are placed afterward, without intermixing them with animal-related job titles for the keyword “zookeeper”. This approach ensures that our ranking system effectively differentiates between unrelated fields and properly organizes candidates based on their professional relevance.

These candidates were added to the dataset csv showing animal and zoo related job titles:

id,job_title,location,connection,fit
105,zookeeper,Chicago,150,
106,zookeeper,Denver,200,
107,zoo attendant,Miami,120,
108,animal husbandry,Seattle,300,
109,vet technician,Houston,250,
110,animal trainer,Portland,180,
111,accountant,New York,400,
112,graphic designer,Los Angeles,320,
113,truck driver,Atlanta,280,
114,zookeeper,San Francisco,220,
115,zookeeper,Boston,190,
116,zoo attendant,Phoenix,260,
117,software engineer,Austin,350,
118,zookeeper,Orlando,170,
119,vet technician,Minneapolis,230,
120,wildlife rehabilitator,Denver,210
121,conservation biologist,Seattle,310
122,marine mammal trainer,San Diego,290
123,exotic animal caretaker,Miami,240
124,wildlife educator,Tampa,200
125,herpetologist,New Orleans,275
126,aquarist,Baltimore,225
127,aviary specialist,Philadelphia,195
128,park ranger,Yellowstone,280
129,wildlife researcher,Anchorage,330
130,primate specialist,Houston,260
131,senior zookeeper,San Jose,210
132,canine behaviorist,Dallas,220
133,animal welfare officer,Sacramento,250
134,wildlife photographer,Boise,180
135,botanist,San Antonio,310
136,entomologist,Las Vegas,270
137,habitat restoration specialist,Portland,290
138,game warden,Jacksonville,300
139,fauna conservationist,Denver,315
140,large animal caretaker,Indianapolis,240
141,marine biologist,Honolulu,325
142,exotic pet specialist,Los Angeles,255
143,wildlife ecologist,Boston,295
144,animal behaviorist,Chicago,265
145,ornithologist,Atlanta,285
146,naturalist,Minneapolis,235
147,fishery biologist,Seattle,320
148,forest ranger,Denver,275
149,endangered species specialist,San Francisco,310
150,veterinary assistant,Phoenix,230

Evaluating Model Performance with Augmented Data

A well-performing model should correctly rank candidates based on their relevance to the keyword in a structured manner. For the testing we will use the word “zookeeper” as the keyword/job title to search the database for candidates. The ideal ranking order would be:

  1. Starred Candidates — These are candidates that have been previously marked as relevant in the starred_candidates_history. Because they have already been identified as fitting selections, they should be ranked at the very top. (IDs: 106, 114, 115)
  2. Unstarred Exact Matches — Candidates who hold the exact job title “zookeeper” but were not previously starred. While they may not have been selected before, their job titles are still the most relevant to the search keyword and match the job_title feature values for all starred candidates for the keyword so should appear on top of the list. (IDs 105, 118 are exact matches)
  3. Related Roles — These candidates work in roles associated with animal care but do not hold the exact title “zookeeper.” They should rank below exact matches but above completely unrelated professions. These candidates have a variety of job_title entries for instance: exotic animal caretaker, wildlife rehabilitator, animal husbandry and animal behaviorist to name a few. These IDs are (IDs 107, 108, 109, 110, 116, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150).
  4. Unrelated Roles — Candidates whose job titles are entirely unrelated to animal care should be ranked at the bottom, as they are not relevant to the “zookeeper” keyword. (IDs 111, 112, 113, 117) as well as the HR and technical roles that exist in the data frame before adding augmented data.

To utilize our newly added data featuring individuals working in zoo-related fields we can now set the keyword and create a sample starred_candidates_history dictionary with starred candidates for out chosen keyword “zookeeper.”

# Supervisory keyword for the job title/search
keyword = "zookeeper"

# Define the starred candidates history (can grow over time)
starred_candidates_history =
"Aspiring human resources": [3, 21, 29, 27],
"Full-stack software engineer": [10],
"zookeeper": [114, 115, 116, 123]

#Get the IDs of starred candidates for the current keyword
# Empty list if keyword not found
starred_ids = starred_candidates_history.get(keyword, [])

# Get the number of starred candidates for the keyword
num_starred = len(starred_ids)

Generating Fit Features for Candidate Ranking

To build an effective ranking system, I computed the cosine similarity scores between each candidate’s job title and the keyword supervisory signal using five different embedding models: TF-IDF, Word2Vec, FastText, GloVe, and BERT. Each of these models captures semantic similarity in different ways:

  • TF-IDF (Term Frequency-Inverse Document Frequency), is a simple statistical approach based on word frequency, making it effective for exact keyword matching but lacking deep semantic understanding.
  • Word2Vec and FastText generate word embeddings based on context, with FastText handling subword information better, making it more robust to misspellings and rare words.
  • GloVe (Global Vectors for Word Representation), is trained on word co-occurrence statistics, capturing global context relationships.
  • BERT (Bidirectional Encoder Representations from Transformers), a transformer-based model, understands words in context, making it the most powerful but also the most computationally expensive.

Cosine similarity measures how similar two vectors are by calculating the cosine of the angle between them. A score of 1 means they are identical, 0 means no similarity, and -1 means they are completely opposite.

The process starts by using a trained model to create word embeddings. Language processing models are first trained on huge amounts of text like books, articles, or websites. If two words often appear in similar places, they get similar vectors. A common example is the words “king” and “queen.” They often show up in similar types of sentences and are surrounded by similar words, so their vectors end up being similar.

When using a trained model to get an embedding for a word, what are we actually getting? What is an embedding? Think of word embeddings as turning words into math. For example, the word “developer” can be turned into a list of numbers like: “developer” → [0.23, 0.11, 0.55, -0.34, …, 0.89]. Each word gets its own list of numbers called a vector. The number of values in the vector depends on how many dimensions the model was trained with, or a number you choose ahead of time. You can think of these numbers as coordinates in a multi-dimensional space, where each word has a specific location. Words that are located near each other in this space tend to have similar meanings.

Below is an illustration of how cosine similarity works. It’s based on the angle between the positions (vector coordinates/embeddings) of two different words or phrases. If the cosine similarity is close to 1, it means the two words or phrases are very similar in meaning, and land in a similar area in the multidimensional space.

Below is an illustration of the cosine similarity between the word embeddings for “developer” and “engineer.” Since these words have similar meanings, especially in the context of job titles, we would expect their vector coordinates to land in similar regions of the multi-dimensional space. As a result, the cosine similarity between them would be closer to 1 than to 0 or -1.

For each model, I calculated cosine similarity scores for the candidate’s job_title feature value and the keyword and stored them as fit features in separate data frames for each model type. To ensure all similarity scores were available for training, I combined the fit features for each model into a single data frame called df_combined.

Here’s a Python-style pseudocode representation of the process for all 5 model types: TF-IDF, Word2Vec, FastText, GloVe and BERT.

from a_model import the_model

# Copy DataFrame
df_model = df.copy()

# Define models
model = the_model()

# Create an empty list to store the BERT embeddings
job_title_embeddings = []

# Go through each job title in the DataFrame
# Append each job_title embedding to the list as you go
for job_title in df_model["job_title"]:
# Get the models embedding for the current job title
embedding = model.encode(job_title)
# Add the embedding to the list
job_title_embeddings.append(embedding)

# Add the list of embeddings as a new "job_title_embedding" column in the DataFrame
df_model["job_title_embedding"] = job_title_embeddings

# Get the keyword embedding ("zookeeper") in outr case
keyword_embedding = model.encode(keyword)

# Create an empty list to store the fit scores
fit_scores = []
# Go through each job title embedding in the DataFrame
for job_embedding in df_copy["job_title_embedding"]:
# Put the job embedding into a list to make it a 2D array
job_embedding_2d = [job_embedding]
# Put the keyword embedding into a list to make it a 2D array
keyword_embedding_2d = [keyword_embedding]
# Calculate the similarity between the job embedding and the keyword embedding
similarity_array = cosine_similarity(job_embedding_2d, keyword_embedding_2d)
# Get the single similarity score from the array
fit_score = similarity_array[0][0]
# Add the score to the list
fit_scores.append(fit_score)

# Add the list of fit scores as a new "fit" column in the DataFrame
df_model["fit"] = fit_scores

The final df_combined data frame combines all newly created fit features with the original features: id, job_title, location, and connection. It includes fit scores from multiple models and are renamed as: fit_TFIDF, fit_W2V, fit_GloVe, fit_FT, and fit_BERT.

Visualizing the results held in df_combined

In order to better understand the relationship between the fit features I wanted to get some visualization of the fit features for the different models. In order to do this, I made pair plots, a correlation heatmap, boxplots and Histograms of the features and their values.

import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style("whitegrid")

# 1. Pairplot to visualize pairwise relationships
sns.pairplot(df_combined[["fit_TFIDF", "fit_W2V", "fit_GloVe", "fit_FT", "fit_BERT"]], corner=True)
plt.suptitle("Pairwise Comparison of Fit Scores", y=1.02)
plt.show()

# 2. Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
df_combined[["fit_TFIDF", "fit_W2V", "fit_GloVe", "fit_FT", "fit_BERT"]].corr(),
annot=True, cmap="coolwarm", fmt=".2f"
)
plt.title("Correlation Heatmap of Fit Scores")
plt.show()

# 3. Boxplot to compare distributions
plt.figure(figsize=(10, 5))
sns.boxplot(data=df_combined[["fit_TFIDF", "fit_W2V", "fit_GloVe", "fit_FT", "fit_BERT"]])
plt.title("Boxplot of Fit Scores by Model")
plt.xticks(rotation=45)
plt.show()

# 4. Histograms to see score distributions
df_combined[["fit_TFIDF", "fit_W2V", "fit_GloVe", "fit_FT", "fit_BERT"]].hist(bins=20, figsize=(12, 8))
plt.suptitle("Histogram of Fit Scores Across Models")
plt.show()

The Pair Plots:

Some observations from the pair plots are that fit_TFIDF consistently missed semantic context as expected. The features fit_W2V for Word2Vec and fit_FT for FastText were perhaps the most similar but seemed to find different semantic meanings for different job_titles so they both could be useful features to retain in perspective models. The Global Vectors fit_GloVe model seemed to extract a little bit more semantic information from the job_titles that Word2Vec and Fast Text. The fit_BERT feature from the BERT model extracted the most semantic meaning from job titles as expected.

The correlation heatmap shows that fit_TFIDF and fit_W2V have a high correlation as well as fit_TFIDF and fit_FT. Also, fit_GloVE and fit_FT are high correlated as well. It’s possible that removing some of the features may be possible without affecting model results.

The boxplots:

The boxplots show the range of values for the middle 50% quartile range of values. We can see fit_TFIDF has a very small range of values. Fit_W2V, fit_GloVe and fit_FT all have somewhat similar range of values and a similar average value for their fit features. The fit_BERT feature has a larger range but shows that it has higher values suggesting that it has extracted a little more semantic meaning from some of the job_titles that was lost in the other models.

The Histograms:

In the histograms we get some information about the distribution of fit values for each model. We can see here the fit_TFIDF has most values at 0 and a very few values at 1. Fit_BERT showed the most diverse distribution of values with more fit values in the middle range 0.4 to 0.6 showing that BERT was able to extract more semantic meaning from different job_title feature values while the rest of the models lay somewhere in the middle.

Creating a similarity_to_starred feature

To create the similarity_to_starred feature, which measures how closely a candidate’s job title matches the average job title of starred candidates I created a function called compute_similarity_to_starred(). It does this by first retrieving the job_title embeddings of all starred candidates using a BERT model. The BERT model was chosen as it seemed to extract the most semantic value from job_title feature values. If there are starred candidates, their embeddings are averaged to create an ideal embedding. If no starred candidates exist, the function uses the embedding of the keyword as a fallback. Then, for each candidate in the dataset, the function computes the cosine similarity between their job_title embedding and the ideal embedding, assigning the resulting similarity score to the similarity_to_starred column. This score helps rank candidates based on how well their job titles align with the most relevant, ideal starred candidate profiles, that were previous hires for the keyword.

The compute_similarity_to_starred()function:

def compute_similarity_to_starred(df, starred_ids, keyword, column_name="similarity_to_starred", inplace=False):
"""
Compute similarity of each candidate's job title to the ideal embedding of starred candidates.

Parameters:
- df (pd.DataFrame): DataFrame containing 'id' and 'job_title' columns.
- starred_ids (list): List of candidate IDs that are starred.
- keyword (str): Keyword to use as fallback if no starred candidates exist.
- column_name (str): Name of the column to store similarity scores (default: "similarity_to_starred").
- inplace (bool): If True, modify the input DataFrame; if False, return a new DataFrame (default: False).

Returns:
- pd.DataFrame: DataFrame with the new similarity column (if inplace=False).
"""

# Flatten starred_ids to ensure scalars
starred_ids = list(chain.from_iterable([x] if not isinstance(x, (list, tuple)) else x for x in starred_ids))

# Load BERT model once
BERT_model = SentenceTransformer('all-MiniLM-L6-v2')

# Step 1: Fetch embeddings of starred candidates
starred_embeddings = []
for candidate_id in starred_ids:
# Find job_title by matching ID (using pandas filtering for efficiency)
match = df[df["id"] == candidate_id]
if not match.empty:
job_title = match["job_title"].iloc[0]
embedding = BERT_model.encode(job_title)
starred_embeddings.append(embedding)

# Step 2: Compute the "ideal" embedding
if starred_embeddings:
# Average the embeddings efficiently using numpy
ideal_embedding = np.mean(starred_embeddings, axis=0)
else:
# Fallback to keyword embedding
ideal_embedding = BERT_model.encode(keyword)

# Step 3: Compute similarity scores for all candidates
new_fit_scores = []
for job_title in df["job_title"]:
job_embedding = BERT_model.encode(job_title)
# Compute cosine similarity (2D arrays for sklearn)
similarity = cosine_similarity([job_embedding], [ideal_embedding])[0][0]
new_fit_scores.append(similarity)

# Step 4: Handle output based on inplace parameter
if inplace:
df[column_name] = new_fit_scores
return None # No return value when modifying in place
else:
# Return a copy of the DataFrame with the new column
df_new = df.copy()
df_new[column_name] = new_fit_scores
return df_new

Creating is_starred feature

In this step, we create a new feature called is_starred, which serves as the target variable for training ranking some of the models. This column is a binary indicator that assigns a value of 1 to candidates who are starred (i.e., relevant for the keyword) and 0 to all others. Unlike similarity-based features, this target variable is purely independent, it does not rely on embedding comparisons or ranking scores but simply reflects whether a candidate was manually marked as relevant. This allows models to learn how to rank candidates based on their features while using human-labeled relevance as ground truth.

# Compute is_starred feature: 
# 1 if candidate is in starred_candidates_history variable for the keyword
# 0 if not candidate is in starred_candidates for the keyword

# Define Target (Purely Independent)
df_combined["is_starred"] = np.where(
df_combined["id"].isin(starred_ids), 1, 0 # Binary relevance without item similarity
)

To identify the best approach for ranking, I experimented with various methods, ranging from manual calculations to machine learning models.

First, I implemented a heuristic method I called Starred-Guided Fit Prediction (SGFP), which combined fit_BERT scores with a similarity_to_starred feature using a weighted geometric mean. I then introduced a small weight adjustment based on the number of connections to account for network effects. This approach aimed to model ranking without using a trained algorithm.

I also attempted various model configurations to rank the candidate list according to the supervisory signal of keyword and the starred candidate’s history and labeled these approaches collectively Starred-Guided Fit Prediction (SGFP). The models fell into three main categories, with some hybrid approaches that combined methods to create custom target variables for reranking.

The three categories of model types:

  • Pointwise: Treating ranking as a regression problem by predicting relevance scores for individual items. I used Random Forest Regressor and other regression techniques.
  • Pairwise: Comparing pairs of items and learning their relative order. I trained models like RankNet and LambdaMART.
  • Listwise: Directly optimizing the ranking of an entire list, incorporating objectives like NDCG and MAP. I tested LambdaRank and neural networks for this.

For additional benchmarking, I tested sorting the candidate data frame by the fit scores from TF-IDF, Word2Vec, GloVe, FastText, BERT, and the similarity_to_starred feature to evaluate how well each method ranked the candidate list.

Implement: Feedback-Enhanced Semantic Re-ranking (FESR)

Using a custom function, I created the fit_BERT_FESR feature by computing a weighted geometric mean of the fit_BERT and similarity_to_starred scores, assigning 40% weight to fit_BERT and 60% to similarity_to_starred.

import pandas as pd
import numpy as np

def geometric_mean(df, col1, col2, w1, w2,
output_col="fit", inplace=False, check_weights=True):
"""
Compute the weighted geometric mean of two columns and
add it as a new feature.

Parameters:
- df (pd.DataFrame): DataFrame containing the input columns.
- col1 (str): Name of the first column (e.g., "fit_BERT").
- col2 (str): Name of the second column (e.g., "similarity_to_starred").
- w1 (float): Weight for the first column (e.g., 0.4).
- w2 (float): Weight for the second column (e.g., 0.6).
- output_col (str): Name of the output column (default: "fit_combined").
- inplace (bool): If True, modify the input DataFrame; if False,
return a new DataFrame (default: False).
- check_weights (bool): If True, warn if weights don’t sum to 1
(default: True).

Returns:
- pd.DataFrame or None: New DataFrame with the combined feature
(if inplace=False), None otherwise.
"""

# Input validation
if not isinstance(df, pd.DataFrame):
raise TypeError("df must be a pandas DataFrame")
if col1 not in df.columns or col2 not in df.columns:
raise KeyError(f"Columns 'col1' and/or 'col2' not found in DataFrame")
if not isinstance(w1, (int, float)) or not isinstance(w2, (int, float)):
raise TypeError("Weights w1 and w2 must be numeric")

# Optional weight check
if check_weights and abs(w1 + w2 - 1.0) > 1e-6:
dont_add_up = " do not sum to 1"
print(f"Warning: Weights w1 (w1) + w2 (w2) = w1 + w2,dont_add_up")

# Compute weighted geometric mean: (col1^w1) * (col2^w2)
fit_combined_scores = (df[col1] ** w1) * (df[col2] ** w2)

# Handle output
if inplace:
df[output_col] = fit_combined_scores
return None
else:
df_new = df.copy()
df_new[output_col] = fit_combined_scores
return df_new

Then, I adjusted the score by incorporating a 10% weight from the connection count, ensuring that highly connected profiles received a slight boost. This approach aimed to refine rankings without relying on a trained model.

# Compute weighted gemoetric mean of "fit_BERT" and "similarity_to_starred" 
# columns

geometric_mean(
df_combined,
col1="fit_BERT",
col2="similarity_to_starred",
w1=0.4,
w2=0.6,
output_col="fit_BERT_FESR",
inplace=True
)

alpha = .9
beta = .1
# Compute final fitness score using weights for amount of connections
df_combined["fit_BERT_FESR"] = (alpha * df_combined["fit_BERT_FESR"] +
beta * df_combined["connection"])

Implement: Starred-Guided Fit Prediction (SGFP)

SGFP are all predications made from different model types using the fit features created earlier as well as different variations of target features to find what configurations show the most promise for ranking the candidate list.

RandomForestRegressor: Regression approach

Using the fit feature values from TF-IDF, Word2Vec, GloVe, FastText, and BERT, along with the connection and similarity_to_starred features, I trained a regression model using RandomForestRegressor(), with the fit_BERT values as the target. This approach combines multiple semantic representations into a practical, predictive ranking system.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"connection",
"similarity_to_starred"
]

# Step 2: Split data into training and target variables
X = df_combined[features]
y = df_combined["fit_BERT"]

# Step 3: Train a RandomForest model (pure regression)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

RandomForestRegressor: Pointwise with continous prediction

Using BERT, TF-IDF, Word2Vec, GloVe, and FastText fit features, combined with connection and similarity_to_starred features, I trained a RandomForestRegressor() with is_starred as the target. This pure pointwise approach generates continuous relevance scores, leveraging multiple semantic representations to rank candidates.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: Split data into training and target variables
X = df_combined[features]
y = df_combined["is_starred"] # 1 for starred 0 for not starred

# Step 3: Train a RandomForest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Step 4: Predict continuous relevance scores
df_combined["rfr_pointwise_continuous_pred"] = model.predict(X)

RandomForestRegressor: Pointwise with Categorical Probability Prediction

Using BERT, TF-IDF, Word2Vec, GloVe, and FastText fit features, alongside connection and similarity_to_starred features, I trained a RandomForestClassifier()with is_starred as the target. This pointwise approach predicts categorical probabilities of relevance (The probability that a candidate belongs to class 1, a starred/ideal candidate), harnessing multiple semantic representations to rank candidates.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: Split data into training and target variables
X = df_combined[features]
y = df_combined["is_starred"] # 1 for starred 0 for not starred

# Step 4: Train a RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Step 5: Predict probabilities (probability of being relevant, class 1)
df_combined["rfr_pointwise_categorical_prob_pred"] = model.predict_proba(X)[:, 1]

RandomForestRegressor: Pointwise Continuous Prediction (Soft Target Regression) Type 1

In this approach, I used TF-IDF, Word2Vec, GloVe, and FastText fit features along with connection counts to train a RandomForestRegressor() model. The target for starred candidates was set to 1.0, while non-starred candidates were assigned a blended score, calculated as the average of fit_BERT and similarity_to_starred. This method combines pointwise regression with continuous prediction, where each candidate is independently assigned a relevance score based on the provided features.


# Step 1: Prepare features
features = ["fit_TFIDF", "fit_W2V", "fit_GloVe", "fit_FT", "connection"]

# Step 2: prepare custom target
# Target: Starred = 1.0, Non-starred = fit_BERT + similarity_to_starred / 2
df_combined["rfr_pointwise_relevance_target"] = np.where(
df_combined["id"].isin(starred_ids),
1.0,
(df_combined["fit_BERT"] + df_combined["similarity_to_starred"]) / 2
)

# Step 3: Split data into training and target variables
X = df_combined[features]
y = df_combined["rfr_pointwise_relevance_target"]

# Step 4: Train a RandomForest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y)

# Step 5: Predict continuous relevance scores
df_combined["rfr_pointwise_continuous_pred_soft_target_regression_1"] = model.predict(X)

RandomForestRegressor: Pointwise Continuous Prediction (Soft Target Regression) Type 2

In this approach, I used the fit_BERT values as the base relevance target for all candidates. For the starred candidates, I overrode this target with a value of 1.0, indicating their higher relevance. This ensures that starred candidates are given top priority in the training process. For non-starred candidates, the fit_BERT value remains as is, reflecting their relative relevance.

I then prepared the features for training, which included fit features from TF-IDF, Word2Vec, GloVe, FastText, along with connection counts and similarity to starred candidates. These features were used to predict the modified relevance target. Using this dataset, I trained a RandomForestRegressor() model on the full data, without splitting for simplicity, to predict continuous relevance scores.

This pointwise regression approach effectively combines feature representations from multiple models with a continuous prediction target, blending semantic and relational features to rank candidates according to their relevance, with starred candidates prioritized as the most relevant.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"connection",
"similarity_to_starred"
]

# Step 2: Prepare target
df_combined["rfr_pointwise_relevance_target"] = df_combined["fit_BERT"]
for candidate_id in starred_ids:
for row_number in range(len(df_combined)):
if df_combined["id"][row_number] == candidate_id:
df_combined.at[row_number, "fit_target"] = 1.0 # starred to 1
break

# Step 3: Split data into training and target variables
X = df_combined[features]
y = df_combined["rfr_pointwise_relevance_target"]

# Step 4: Train a RandomForest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X, y) # Train on full dataset (no split for simplicity)

# Step 5: Predict fit scores using RandomForest
predictions = model.predict(X) # Predict continuous scores
df_combined["rfr_pointwise_continuous_pred_soft_target_regression_2"] = predictions

LightGBM LambdaRank: Listwise Ranking with NDCG Optimization

Using the BERT, TF-IDF, Word2Vec, GloVe, FastText, connection, and similarity_to_starred features, I trained a LightGBM LambdaRank model. In this approach, the relevance target was set to is_starred, and all candidates were treated as part of a single query group. This listwise approach focuses on optimizing NDCG (Normalized Discounted Cumulative Gain) by considering the rank of all candidates within the group. The model was trained with parameters tuned for listwise ranking, including optimization of NDCG at the level of starred candidates. The trained model predicts continuous ranking scores, effectively ordering candidates based on their relevance to starred profiles.

# Step 1: Prepare features 
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: Split data into training and target variables
X = df_combined[features]
y = df_combined["is_starred"]

# Step 3: Define group info (all candidates as one query)
group = [len(df_combined)] # Single group of 104 candidates

# Step 4: Convert to LightGBM Dataset with group info
dtrain = lgb.Dataset(X, label=y, group=group)

# Step 5: Set LambdaRank parameters (listwise approach)
params =
"objective": "lambdarank", # Listwise ranking objective
"metric": "ndcg", # Optimize NDCG
"ndcg_at": [num_starred], # Focus on NDCG@<num_starred>
"learning_rate": 0.1, # Learning rate
"max_depth": 6, # Tree depth
"num_leaves": 31, # Number of leaves in trees
"min_data_in_leaf": 20, # Minimum data per leaf
"feature_fraction": 0.8, # Feature sampling
"bagging_fraction": 0.8, # Data sampling
"bagging_freq": 5, # Frequency of bagging
"random_state": 42,
"verbose": -1 # Suppress training output

# Step 6: Train the LambdaRank model
model = lgb.train(params, dtrain, num_boost_round=100)

# Step 7: Predict ranking scores
df_combined["fit_lambdarank_listwise_pred"] = model.predict(X)

XGB-LambdaMART “rank:pairwise”, “rank:ndcg”, “rank:map”

Using features from BERT, TF-IDF, Word2Vec, GloVe, FastText, connection counts, and similarity_to_starred, I trained three XGB-LambdaMART models with “is_starred” as the relevance target. All candidates were grouped as one query. Three different ranking objectives were applied:

  1. rank:pairwise — Optimizes pairwise order between candidates.
  2. rank:ndcg — Maximizes NDCG (Normalized Discounted Cumulative Gain).
  3. rank:map — Enhances Mean Average Precision (MAP).

Each objective predicts continuous ranking scores, leveraging specific ranking strategies to order candidates based on their relevance to starred profiles.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: Split data into training and target variables
X = df_combined[features]
y = df_combined["is_starred"]

# Step 3: Define group info (all candidates as one query)
group = [len(df_combined)] # Single group of all candidates

# Step 4: Convert to DMatrix with group info
dtrain = xgb.DMatrix(X, label=y)
dtrain.set_group(group)

# Step 5: Define multiple ranking objectives
ranking_objectives = ["rank:pairwise", "rank:ndcg", "rank:map"]

# Step 6: Train and evaluate models for each ranking objective
for objective in ranking_objectives:
print(f"\n🚀 Training LambdaMART with Objective: objective")

params =
"objective": objective, # Set ranking objective
"eval_metric": "ndcg@5", # Optimize NDCG@5
"eta": 0.1, # Learning rate
"max_depth": 6, # Tree depth
"subsample": 0.8, # Subsample ratio
"colsample_bytree": 0.8, # Feature sampling
"random_state": 42

# Train the model
model = xgb.train(params, dtrain, num_boost_round=100)

# Predict ranking scores
df_combined[f"fit_lambdamart_objective_pred"] = model.predict(dtrain)

XGB-LambdaMART: “rank:pairwise”, “rank:ndcg” objectives with hybrid target

Using TF-IDF, Word2Vec, GloVe, FastText, connection counts, and similarity_to_starred features, I trained XGB-LambdaMART models with a hybrid target. The target was derived by quantizing fit_BERT into five ranking levels (0–4), treating all candidates as one query. Two ranking objectives were used:

  1. rank:pairwise for optimizing pairwise comparisons between candidates.
  2. rank:ndcg to maximize NDCG (Normalized Discounted Cumulative Gain).

This hybrid approach predicts continuous scores, combining the relevance derived from BERT with traditional ranking strategies to effectively order candidates based on their relevance to starred profiles.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: Create target: convert fit_BERT into 5 ranking levels (0–4)
fit_target_values = pd.qcut(df_combined["fit_BERT"], q=5,
labels=[0, 1, 2, 3, 4])
df_combined["fit_target"] = fit_target_values
df_combined["fit_target"] = df_combined["fit_target"].astype(int)

# Step 3: Split data into training and target variables
X = df_combined[features]
y = df_combined["fit_target"]

# Step 4: Define group info (all candidates as one query)
group = [len(df_combined)] # Single group of all candidates

# Step 5: Convert to DMatrix with group info
dtrain = xgb.DMatrix(X, label=y)
dtrain.set_group(group)

# Step 6: Define multiple ranking objectives
ranking_objectives = ["rank:pairwise", "rank:ndcg"]

# Step 7: Train and evaluate models for each ranking objective
for objective in ranking_objectives:
print(f"\n🚀 Training LambdaMART with Objective: objective")

params =
"objective": objective, # Set ranking objective
"eval_metric": "ndcg@5", # Optimize NDCG@5 (can be changed)
"eta": 0.1, # Learning rate
"max_depth": 6, # Tree depth
"subsample": 0.8, # Subsample ratio
"colsample_bytree": 0.8, # Feature sampling
"random_state": 42

# Train the model
model = xgb.train(params, dtrain, num_boost_round=100)

# Predict ranking scores
df_combined[f"fit_lambdamart_objective_hybridTarget_pred"] = (
model.predict(dtrain)
)

RankNet: Pairwise Ranking with a Neural Network

Using scaled BERT, TF-IDF, Word2Vec, GloVe, FastText, connection, and similarity_to_starred features, I trained a RankNet neural network with is_starred as the binary target. This pairwise ranking approach leverages a three-layer feedforward network to predict continuous relevance scores, optimizing a custom loss function. The model ranks candidates by ensuring that starred candidates consistently outrank others in the predicted scores.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: Create target
y = df_combined["is_starred"].values # 1 for starred, 0 for others

# Step 3: Split data training and testing
X = df_combined[features].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

# Step 5: Convert full dataset to tensor for final predictions
X_full_tensor = torch.tensor(X, dtype=torch.float32)

# Define RANKNET model
class RankNet(nn.Module):
def __init__(self, input_dim):
super(RankNet, self).__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 1)
self.relu = nn.ReLU()

def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
return self.fc3(x)

# Define Pairwise ranking loss
def pairwise_ranking_loss(outputs, targets, margin=1.0):
starred = outputs[targets == 1]
non_starred = outputs[targets == 0]
if len(starred) == 0 or len(non_starred) == 0:
return torch.tensor(0.0, requires_grad=True)
diff = starred.view(-1, 1) - non_starred.view(1, -1)
loss = torch.relu(margin - diff).mean()
return loss

# Step 6: Initialize model and optimizer
model = RankNet(input_dim=X_train.shape[1])
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define train_model
num_epochs = 100
def train_model():
for epoch in range(num_epochs):
model.train()
optimizer.zero_grad()
outputs = model(X_train_tensor).squeeze()
loss = pairwise_ranking_loss(outputs, y_train_tensor)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch epoch, Loss: loss.item():.4f")

# Step 7: Train model
train_model()

# Define Evaluate model
def evaluate_model():
model.eval()
with torch.no_grad():
y_pred = model(X_test_tensor).squeeze().numpy()
return y_pred

# Step 8: get predictions
y_pred = evaluate_model()

# Define Generate predictions for the full dataset
def predict_full_dataset():
model.eval()
with torch.no_grad():
y_pred_full = model(X_full_tensor).squeeze().numpy()
return y_pred_full

# Step 9: Add predictions to df_combined
y_pred_full = predict_full_dataset()
df_combined["RankNET_pairwise_neural"] = y_pred_full

LambdaRank Listwise Neural net

Using BERT, TF-IDF, Word2Vec, GloVe, FastText, connection, and similarity_to_starred features, I trained a LambdaRank neural network with is_starred as the binary target. This listwise approach optimizes a differentiable approximation of NDCG through a three-layer network. The model predicts continuous scores, ranking candidates in a single list while directly optimizing for ranking quality by considering the full list in a differentiable manner.

# Step 1: Prepare features
features = [
"fit_TFIDF",
"fit_W2V",
"fit_GloVe",
"fit_FT",
"fit_BERT",
"connection",
"similarity_to_starred"
]

# Step 2: prepare target
y = df_combined["is_starred"].values # 1 for starred, 0 for others

# Step 3: Split data training and target
X = df_combined[features].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)
X_full_tensor = torch.tensor(X, dtype=torch.float32)

# Define LambdaRank model (unchanged)
class LambdaRank(nn.Module):
def __init__(self, input_dim):
super(LambdaRank, self).__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 32)
self.fc3 = nn.Linear(32, 1)
self.relu = nn.ReLU()

def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
return self.fc3(x)

# Define listwise ranking loss function
def listwise_ranking_loss(outputs, targets, k=None):
"""
Compute a differentiable listwise loss approximating NDCG.
- outputs: Predicted scores [n_samples], requires gradients
- targets: True relevance labels [n_samples], 0 or 1
- k: Optional cutoff for NDCG@k (e.g., 8); if None, uses full list
Returns: -DCG (negative to minimize)
"""
if outputs is None or targets is None:
raise ValueError("Outputs or targets is None")
if outputs.size(0) != targets.size(0):
output_size = outputs.size(0)
target_size = targets.size(0)
raise ValueError(f"Size mismatch: outputs output_size, targets target_size")
if outputs.size(0) == 0:
return torch.tensor(0.0, requires_grad=True)

outputs = outputs.to(targets.device)
n = outputs.size(0)
if k is None:
k = n
k = min(k, n)

# Compute discounts
ranks = torch.arange(1, k + 1, dtype=torch.float32, device=outputs.device)
discounts = 1.0 / torch.log2(ranks + 1)

# Soft ranking: Use outputs directly with a sigmoid to weight relevance
# Higher outputs contribute more to DCG
weights = torch.sigmoid(outputs) # Shape: [n_samples], differentiable
sorted_weights, indices = torch.sort(weights, descending=True)
sorted_targets = targets[indices][:k] # Still use sort order totruncate

# Compute DCG with predicted weights
dcg = (sorted_targets * sorted_weights[:k] * discounts).sum()

return -dcg # Minimize -DCG to maximize ranking quality

# Step 5: Initialize model and optimizer
model = LambdaRank(input_dim=X_train.shape[1])
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define train_model
num_epochs = 100
def train_model():
for epoch in range(num_epochs):
model.train()
optimizer.zero_grad()
outputs = model(X_train_tensor).squeeze() # Predicted scores
loss = listwise_ranking_loss(outputs, y_train_tensor)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch epoch, Loss: loss.item():.4f")

# Step 6: Train model with listwise loss
train_model()

# Define evaluate_model function
def evaluate_model():
model.eval()
with torch.no_grad():
y_pred_test = model(X_test_tensor).squeeze().numpy()
return y_pred_test

# Step 7: Evaluate model on test set
y_pred_test = evaluate_model()

# Define predict_full_dataset
def predict_full_dataset():
model.eval()
with torch.no_grad():
y_pred_full = model(X_full_tensor).squeeze().numpy()
return y_pred_full

# Step 8: Generate predictions for the full dataset
y_pred_full = predict_full_dataset()
df_combined["lambdaRank_listwise_neural"] = y_pred_full

To measure how effectively each model ranks candidates, I used several ranking metrics that assess both the placement and prioritization of relevant candidates.

Precision@k (P@k) evaluates how many starred (relevant) candidates appear in the top k positions, helping determine whether the model surfaces the best candidates early. A high P@k means the most relevant candidates are consistently ranked near the top.

Mean Rank of Starred Candidates calculates the average position of starred candidates in the ranked list. A lower mean rank indicates that relevant candidates are placed higher, making it easier for users to find them.

Mean Reciprocal Rank (MRR) focuses on the highest-ranked starred candidate by taking the reciprocal of its position. This metric is useful for assessing whether at least one relevant candidate is placed at the very top.

Normalized Discounted Cumulative Gain (NDCG@k) considers both the rank and relevance of candidates, rewarding models that rank starred candidates higher while penalizing those that push them further down. This metric balances overall ranking quality and precision.

By analyzing these metrics, I can compare different models and determine which approach most effectively surfaces the right candidates while maintaining an optimal ranking order.

The function to analyze a given models ranking results is:

def evaluate_reranking(df, starred_ids, rank_col="fit_BERT_reRanked", k=None):
"""
Evaluate reranking quality using starred candidates and return ranked unstarred list.

Parameters:
- df (pd.DataFrame): DataFrame with 'id' and ranking column.
- starred_ids (list): List of starred candidate IDs (previously hired).
- rank_col (str): Column to sort by for ranking (default: "fit_BERT_reRanked").
- k (int or None): Top k positions to evaluate; if None, uses number of starred candidates (default: None).

Returns:
- dict: Evaluation metrics (Precision@k, Mean Rank, MRR, NDCG@k).
- pd.DataFrame: Ranked DataFrame of unstarred candidates.
"""
# Flatten starred_ids if nested
starred_ids = list(chain.from_iterable([x] if not isinstance(x, (list, tuple)) else x for x in starred_ids))

# Convert starred_ids to match df["id"] dtype
id_dtype = df["id"].dtype
if np.issubdtype(id_dtype, np.integer):
starred_ids = [int(x) for x in starred_ids]
elif np.issubdtype(id_dtype, np.object_) or str(id_dtype) == "string":
starred_ids = [str(x) for x in starred_ids]

# Sort by rank_col in descending order
df_sorted = df.sort_values(by=rank_col, ascending=False).reset_index(drop=True)
df_sorted["rank"] = np.arange(1, len(df_sorted) + 1)

# Identify starred candidates
starred_mask = df_sorted["id"].isin(starred_ids)
num_starred = len(set(starred_ids) & set(df_sorted["id"])) # Actual matches in df

# Metrics
top_k_starred = df_sorted["id"].head(k).isin(starred_ids).sum()
precision_at_k = top_k_starred / k if k > 0 else 0

starred_ranks = df_sorted[starred_mask]["rank"]
mean_rank_starred = starred_ranks.mean() if not starred_ranks.empty else float("inf")

first_starred_rank = starred_ranks.min() if not starred_ranks.empty else float("inf")
mrr = 1 / first_starred_rank if first_starred_rank != float("inf") else 0

relevance = np.where(starred_mask, 1, 0)
dcg = sum(rel / np.log2(idx + 2) for idx, rel in enumerate(relevance[:k]))
ideal_relevance = [1] * min(num_starred, k) + [0] * max(0, k - num_starred)
idcg = sum(rel / np.log2(idx + 2) for idx, rel in enumerate(ideal_relevance))
ndcg_at_k = dcg / idcg if idcg > 0 else 0

metrics =
f"precision_at_k": precision_at_k,
"mean_rank_starred": mean_rank_starred,
"mrr": mrr,
f"ndcg_at_k": ndcg_at_k

# Filter out starred candidates
df_unstarred = df_sorted[~starred_mask].drop(columns=["rank"]).reset_index(drop=True)

return metrics, df_unstarred

Top Performing Models for “Zookeeper”

To assess how well our models rank candidates, we must evaluate two key groups: unstarred candidates with an exact job title match (“zookeeper”) and candidates with related job titles. These evaluations help determine whether a model effectively prioritizes the most relevant candidates at the top of the list.

First, we run evaluate_reranking() on a copy of the ranked data framethat includes starred candidates. Then, we remove the starred candidates and reapply the function to test whether the model ranks unstarred but exact-matching candidates (“zookeeper”) at the top. This step is crucial because a model might not rank all starred candidates highest but could still correctly prioritize all “zookeeper” candidates. In such cases, the model performs well, but its metrics may appear lower.

Next, we further refine the test by removing both the starred candidates and keyword exact matching unstarred candidates. This allows us to evaluate how well the model ranks candidates with related job titles, those in biology, animal care, or related fields. A strong model will consistently push these candidates to the top, distinguishing them from less relevant ones (e.g., HR professionals or tech roles). This final test is the most important for determining whether a model effectively identifies the best candidates for the keyword “zookeeper” because it shows the models ability to understand semantic meaning between job_title values that do not share exact matches and to promote semantically similar job_title values to the top of the list.

# Test metrics for original starred candidates, 
# Exact match unstarred candidates
# And candidates with related job_title features

ranking_features = [
"fit_BERT_FESR",
"rfr_pointwise_continuous_pred",
"rfr_pointwise_categorical_prob_pred",
"rfr_pointwise_continuous_pred_soft_target_regression_1",
"rfr_pointwise_continuous_pred_soft_target_regression_2",
"fit_lambdarank_listwise_pred",
"fit_lambdamart_rank:pairwise_pred",
"fit_lambdamart_rank:ndcg_pred",
"fit_lambdamart_rank:map_pred",
"fit_lambdamart_rank:pairwise_hybridTarget_pred",
"fit_lambdamart_rank:ndcg_hybridTarget_pred",
"RankNET_pairwise_neural",
"lambdaRank_listwise_neural"
]

# Loop through ranking features
for rank_col in ranking_features:
print("-------------------------------------------------------------")

print(f"Metrics for feature rank_col")

df_ranking_test = df_combined.copy()

# Evaluate Exact Matches
metrics_exact_matches, df_remaining = evaluate_reranking(
df_ranking_test,
starred_ids=starred_ids,
rank_col=rank_col,
k=len(starred_ids)
)

print(f"\nOriginal Starred candidates metrics:")
for metric, value in metrics_exact_matches.items():
print(f"metric: value:.3f")

df_ranking_test = df_ranking_test[~df_ranking_test['id'].isin(starred_ids)]

ustarred_exact_matches = [105, 118]

# Evaluate Exact Matches
metrics_exact_matches, df_remaining = evaluate_reranking(
df_ranking_test,
starred_ids=ustarred_exact_matches,
rank_col=rank_col,
k=len(ustarred_exact_matches)
)

print(f"\nExact Match unstarred metrics:")
for metric, value in metrics_exact_matches.items():
print(f"metric: value:.3f")

df_ranking_test = df_ranking_test[~df_ranking_test['id']
.isin(ustarred_exact_matches)]

metrics_realted_matches = [107, 108, 109, 110, 116, 119, 120, 121,
122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
149, 150, 135]

# Evaluate Exact Matches
metrics_exact_matches, df_remaining = evaluate_reranking(
df_ranking_test,
starred_ids=metrics_realted_matches,
rank_col=rank_col,
k=len(metrics_realted_matches)
)

print(f"\nRelated Match unstarred metrics:")
for metric, value in metrics_exact_matches.items():
print(f"metric: value:.3f")

print("\n-------------------------------------------------------------\n\n")

-------------------------------------------------------------
Metrics for feature fit_BERT_FESR

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 0.973
mean_rank_starred: 19.108
mrr: 1.000
ndcg_at_37: 0.982

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_continuous_pred

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 0.500
mean_rank_starred: 2.500
mrr: 0.500
ndcg_at_2: 0.387

Related Match unstarred metrics:
precision_at_37: 0.027
mean_rank_starred: 98.919
mrr: 1.000
ndcg_at_37: 0.095

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_categorical_prob_pred

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 0.500
mean_rank_starred: 2.500
mrr: 0.500
ndcg_at_2: 0.387

Related Match unstarred metrics:
precision_at_37: 0.027
mean_rank_starred: 98.919
mrr: 1.000
ndcg_at_37: 0.095

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_continuous_pred_soft_target_regression_1

Original Starred candidates metrics:
precision_at_3: 0.333
mean_rank_starred: 3.333
mrr: 1.000
ndcg_at_3: 0.469

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 1.000
mean_rank_starred: 19.000
mrr: 1.000
ndcg_at_37: 1.000

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_continuous_pred_soft_target_regression_2

Original Starred candidates metrics:
precision_at_3: 0.333
mean_rank_starred: 4.000
mrr: 0.333
ndcg_at_3: 0.235

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 1.000
mean_rank_starred: 19.000
mrr: 1.000
ndcg_at_37: 1.000

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdarank_listwise_pred

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 0.243
mean_rank_starred: 62.378
mrr: 1.000
ndcg_at_37: 0.397

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_pairwise_pred

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 0.243
mean_rank_starred: 56.811
mrr: 1.000
ndcg_at_37: 0.353

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_ndcg_pred

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 0.243
mean_rank_starred: 57.892
mrr: 1.000
ndcg_at_37: 0.336

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_map_pred

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 0.919
mean_rank_starred: 24.027
mrr: 1.000
ndcg_at_37: 0.942

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_rank:pairwise_hybridTarget_pred

Original Starred candidates metrics:
precision_at_3: 0.000
mean_rank_starred: 18.000
mrr: 0.062
ndcg_at_3: 0.000

Exact Match unstarred metrics:
precision_at_2: 0.000
mean_rank_starred: 14.500
mrr: 0.071
ndcg_at_2: 0.000

Related Match unstarred metrics:
precision_at_37: 0.973
mean_rank_starred: 19.027
mrr: 1.000
ndcg_at_37: 0.982

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_rank:ndcg_hybridTarget_pred

Original Starred candidates metrics:
precision_at_3: 0.333
mean_rank_starred: 6.333
mrr: 1.000
ndcg_at_3: 0.469

Exact Match unstarred metrics:
precision_at_2: 0.500
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_2: 0.613

Related Match unstarred metrics:
precision_at_37: 0.973
mean_rank_starred: 19.081
mrr: 1.000
ndcg_at_37: 0.982

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature RankNET_pairwise_neural

Original Starred candidates metrics:
precision_at_3: 0.333
mean_rank_starred: 4.000
mrr: 0.333
ndcg_at_3: 0.235

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 1.000
mean_rank_starred: 19.000
mrr: 1.000
ndcg_at_37: 1.000

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature lambdaRank_listwise_neural

Original Starred candidates metrics:
precision_at_3: 1.000
mean_rank_starred: 2.000
mrr: 1.000
ndcg_at_3: 1.000

Exact Match unstarred metrics:
precision_at_2: 1.000
mean_rank_starred: 1.500
mrr: 1.000
ndcg_at_2: 1.000

Related Match unstarred metrics:
precision_at_37: 0.703
mean_rank_starred: 28.162
mrr: 1.000
ndcg_at_37: 0.777

-------------------------------------------------------------

The evaluation clearly highlights three standout models for ranking related matches: rfr_pointwise_continuous_pred_soft_target_regression_1, rfr_pointwise_continuous_pred_soft_target_regression_2, and RankNET_pairwise_neural. These models achieved perfect scores, with precision_at_37 and ndcg_at_37 both hitting 1.000. This indicates that they consistently ranked the most relevant candidates at the top, making them the strongest options for prioritizing related matches.

Another strong performer was fit_BERT_FESR, which delivered impressive metrics with precision_at_37 at 0.973 and ndcg_at_37 at 0.982. What makes this approach particularly interesting is that it isn’t a trained model in the traditional sense. Instead, it’s a weighted combination of existing metrics: 40% fit_BERT, 60% similarity_to_starred, with an additional 10% weight assigned to connection count. This slight boost for highly connected profiles helps improve ranking quality while avoiding the need for additional training.

The advantage of fit_BERT_FESR is its simplicity. Because it relies on a clear weighting formula rather than a trained model, it’s easier to implement and maintain, requires no additional training data, and avoids the computational cost of model training and inference. Despite these strengths, it still fell just short of the performance achieved by the soft target regression models.

Meanwhile, other models such as fit_lambdamart_map_pred also performed well, with precision_at_37 reaching 0.919 and ndcg_at_37 at 0.942, making it a competitive alternative. However, models like fit_lambdarank_listwise_pred and fit_lambdamart_ndcg_pred lagged behind, with precision_at_37 around 0.243 and ndcg_at_37 under 0.400, making them less effective for ranking related matches.

Ultimately, the results suggest that if the goal is to ensure the highest accuracy in ranking related candidates, rfr_pointwise_continuous_pred_soft_target_regression_1 and 2, along with RankNET_pairwise_neural, are the best choices. Their perfect scores indicate they can reliably prioritize relevant matches. For a more interpretable, low-maintenance alternative, fit_BERT_FESR is an excellent option, balancing strong performance with practical implementation benefits.

Testing the models with keyword “Aspiring Human Resources”

Next, I ran evaluate_reranking()using the keyword “Aspiring Human Resources,” the compute fit features and similarity_to_starred feature as well as is_starred to get metrics for all the model types using this different keyword. Using a few lines of code I was able to get al of the ids in the data frame that are related to “Human resources” or “HR.”

keywords = ["Aspiring", "aspiring", "Human", "human", "HR", "hr"]
unstarred_matches = df[df["job_title"].str.contains("|".join(keywords),
case=False, na=False)]["id"].tolist()

Now we can run the evaluate_metrics() function forst on the starred candidates for the “Aspiring Human Resources” keyword, remove these starred candidates and run the evaluate_metrics() function again to see how higfh it ranked all the candidates that have HR related job titles of which there were 71 candidates.

-------------------------------------------------------------
Metrics for feature fit_BERT_FESR

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 3.750
mrr: 0.500
ndcg_at_4: 0.610

Match but unstarred metrics:
precision_at_71: 0.887
mean_rank_starred: 34.687
mrr: 1.000
ndcg_at_71: 0.958

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_continuous_pred

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 2.750
mrr: 1.000
ndcg_at_4: 0.832

Match but unstarred metrics:
precision_at_71: 0.197
mean_rank_starred: 89.104
mrr: 1.000
ndcg_at_71: 0.341

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_categorical_prob_pred

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 2.750
mrr: 1.000
ndcg_at_4: 0.832

Match but unstarred metrics:
precision_at_71: 0.282
mean_rank_starred: 80.612
mrr: 1.000
ndcg_at_71: 0.443

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_continuous_pred_soft_target_regression_1

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 2.750
mrr: 1.000
ndcg_at_4: 0.832

Match but unstarred metrics:
precision_at_71: 0.887
mean_rank_starred: 35.313
mrr: 1.000
ndcg_at_71: 0.958

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature rfr_pointwise_continuous_pred_soft_target_regression_2

Original Starred candidates metrics:
precision_at_4: 0.500
mean_rank_starred: 19.250
mrr: 0.500
ndcg_at_4: 0.414

Match but unstarred metrics:
precision_at_71: 0.873
mean_rank_starred: 35.448
mrr: 1.000
ndcg_at_71: 0.948

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdarank_listwise_pred

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 2.750
mrr: 1.000
ndcg_at_4: 0.832

Match but unstarred metrics:
precision_at_71: 0.634
mean_rank_starred: 56.537
mrr: 1.000
ndcg_at_71: 0.740

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_pairwise_pred

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 3.250
mrr: 1.000
ndcg_at_4: 0.805

Match but unstarred metrics:
precision_at_71: 0.577
mean_rank_starred: 68.612
mrr: 1.000
ndcg_at_71: 0.651

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_ndcg_pred

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 2.750
mrr: 1.000
ndcg_at_4: 0.832

Match but unstarred metrics:
precision_at_71: 0.577
mean_rank_starred: 68.149
mrr: 1.000
ndcg_at_71: 0.670

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_map_pred

Original Starred candidates metrics:
precision_at_4: 0.750
mean_rank_starred: 2.750
mrr: 1.000
ndcg_at_4: 0.832

Match but unstarred metrics:
precision_at_71: 0.592
mean_rank_starred: 67.358
mrr: 1.000
ndcg_at_71: 0.669

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_rank:pairwise_hybridTarget_pred

Original Starred candidates metrics:
precision_at_4: 0.000
mean_rank_starred: 23.250
mrr: 0.067
ndcg_at_4: 0.000

Match but unstarred metrics:
precision_at_71: 0.831
mean_rank_starred: 41.910
mrr: 1.000
ndcg_at_71: 0.915

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature fit_lambdamart_rank:ndcg_hybridTarget_pred

Original Starred candidates metrics:
precision_at_4: 0.500
mean_rank_starred: 13.250
mrr: 1.000
ndcg_at_4: 0.586

Match but unstarred metrics:
precision_at_71: 0.746
mean_rank_starred: 47.239
mrr: 1.000
ndcg_at_71: 0.851

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature RankNET_pairwise_neural

Original Starred candidates metrics:
precision_at_4: 0.500
mean_rank_starred: 8.250
mrr: 0.500
ndcg_at_4: 0.414

Match but unstarred metrics:
precision_at_71: 0.873
mean_rank_starred: 35.507
mrr: 1.000
ndcg_at_71: 0.946

-------------------------------------------------------------

-------------------------------------------------------------
Metrics for feature lambdaRank_listwise_neural

Original Starred candidates metrics:
precision_at_4: 0.250
mean_rank_starred: 8.750
mrr: 0.333
ndcg_at_4: 0.195

Match but unstarred metrics:
precision_at_71: 0.873
mean_rank_starred: 35.687
mrr: 1.000
ndcg_at_71: 0.946

-------------------------------------------------------------

The evaluation highlights a few standout models for ranking related matches. Notably, rfr_pointwise_continuous_pred_soft_target_regression_1 performed especially well, with precision at 4 of 0.750 and nDCG at 4 of 0.832 for starred candidates, and precision at 71 and nDCG at 71 both reaching 0.887 and 0.958 respectively for matching unstarred results. This balance makes it one of the strongest models overall.

Another strong performer was fit_BERT_FESR, which achieved a solid precision at 71 of 0.887 and nDCG at 71 of 0.958, along with a precision at 4 of 0.750. While its nDCG at 4 was a bit lower at 0.610, its appeal lies in its design. Rather than being a trained model, fit_BERT_FESR is a weighted blend: 40% from fit_BERT, 60% from similarity_to_starred, with an extra 10% added for connection feature value. This approach avoids the cost and complexity of training, while still providing competitive rankings.

The model rfr_pointwise_continuous_pred also posted strong metrics for starred candidates, with a perfect MRR of 1.000 and a high nDCG at 4 of 0.832. However, it underperformed in identifying relevant unstarred candidates, with precision at 71 dropping to 0.197 and nDCG at 71 to 0.341, making it less well-rounded.

Similarly, rfr_pointwise_categorical_prob_pred mirrored the high performance on starred data (nDCG at 4: 0.832) but only improved slightly on unstarred results (nDCG at 71: 0.443).

A few models like fit_lambdamart_map_pred, fit_lambdamart_ndcg_pred, and fit_lambdarank_listwise_pred landed in the middle tier. These models consistently hit nDCG at 4 scores of 0.805–0.832 for starred data and hovered between 0.651–0.740 for nDCG at 71, indicating reasonable overall performance, though not exceptional.

Meanwhile, fit_lambdamart_rank:pairwise_hybridTarget_pred failed to recover relevant starred candidates, with precision at 4 of 0.000 and nDCG at 4 of 0.000, despite ranking unstarred matches fairly well (nDCG at 71: 0.915).

RankNET_pairwise_neural stood out among neural models, with strong unstarred results (nDCG at 71: 0.946) and decent starred rankings. In contrast, lambdaRank_listwise_neural struggled to rank starred items effectively, with a low nDCG at 4 of 0.195, though it matched RankNET’s performance on unstarred items.

Finally, rfr_pointwise_continuous_pred_soft_target_regression_2 posted excellent scores for unstarred data (nDCG at 71: 0.948) but dropped in performance for starred candidates (nDCG at 4: 0.414), suggesting a potential tradeoff in model tuning.

In conclusion, for the best all-around performance, especially in surfacing both starred and unstarred relevant candidates, rfr_pointwise_continuous_pred_soft_target_regression_1 and fit_BERT_FESR emerge as top choices. While fit_BERT_FESR excels in simplicity and maintainability, rfr_pointwise_continuous_pred_soft_target_regression_1 combines consistent strength across all metrics for bothe keywords.

Finding the Right Cutoff for Candidate Filtering:

One of the biggest challenges I faced while building an AI-driven talent sourcing pipeline was figuring out how to filter out irrelevant candidates before ranking even began.

The goal was to determine the point at which job titles started to become less relevant for a given role and use that as the cutoff score. That way, I could systematically remove unqualified candidates after ranking, ensuring a high quality edited list.

I analyzed high-performing models across different keywords to identify patterns in cutoff values using a normalized data frame for all of the successful model features. My goal was to find an empirical method to determine the optimal cutoff for each model or ranking approach.

For Keyword “Zookeeper”

+----+--------------------------------------------------------+----------------+
| | Feature | Cutoff Value |
+====+========================================================+================+
| 0 | fit_BERT_FESR | 0.298055 |
+----+--------------------------------------------------------+----------------+
| 1 | rfr_pointwise_continuous_pred_soft_target_regression_1 | 0.308075 |
+----+--------------------------------------------------------+----------------+
| 2 | rfr_pointwise_continuous_pred_soft_target_regression_2 | 0.284051 |
+----+--------------------------------------------------------+----------------+
| 3 | fit_lambdamart_rank:map_pred | 0.062927 |
+----+--------------------------------------------------------+----------------+
| 4 | fit_lambdamart_rank:pairwise_hybridTarget_pred | 0.218030 |
+----+--------------------------------------------------------+----------------+
| 5 | fit_lambdamart_rank:ndcg_hybridTarget_pred | 0.322340 |
+----+--------------------------------------------------------+----------------+
| 6 | RankNET_pairwise_neural | 0.145078 |
+----+--------------------------------------------------------+----------------+
| 7 | lambdaRank_listwise_neural | 0.196076 |
+----+--------------------------------------------------------+----------------+

For keyword “Aspiring human resources”

With the exception of the model that used a Neural Network we can see that the cutoff value for relevant job_title value to the keyword = “Aspiring human resources” is between 0.36 and 0.46 and if you use the neural network approach the value is much lower at 0.19.

+----+--------------------------------------------------------+----------------+
| | Feature | Cutoff Value |
+====+========================================================+================+
| 0 | fit_BERT_FESR | 0.466085 |
+----+--------------------------------------------------------+----------------+
| 1 | rfr_pointwise_continuous_pred_soft_target_regression_1 | 0.367256 |
+----+--------------------------------------------------------+----------------+
| 2 | rfr_pointwise_continuous_pred_soft_target_regression_2 | 0.374726 |
+----+--------------------------------------------------------+----------------+
| 3 | RankNET_pairwise_neural | 0.196174 |
+----+--------------------------------------------------------+----------------+
| 4 | lambdaRank_listwise_neural | 0.404464 |
+----+--------------------------------------------------------+----------------+

When analyzing different ranking models for keyword-based candidate selection, it became clear that the dataset structure significantly influenced the cutoff values. By comparing two distinct keyword datasets , “Zookeeper” and “Aspiring Human Resources”, I observed notable differences in how models determined relevance. These differences stem from the nature of the job titles associated with each keyword, ultimately affecting how strict or flexible the cutoffs needed to be.

Zookeeper Data: High Variability

The Zookeeper dataset was artificially generated, leading to highly diverse job titles that were semantically related to the field but didn’t always include “Zookeeper” explicitly (e.g., Wildlife Care Specialist, Animal Trainer). This forced ranking models to rely on context rather than direct keyword matching, making cutoff thresholds less strict.

The high variance in job titles within the Zookeeper dataset made it harder for models to define strict boundaries between relevant and non-relevant candidates. As a result, some models assigned very low cutoffs, meaning they were more flexible in considering semantically related job titles rather than requiring an exact keyword match.

For instance, fit_lambdamart_rank:map_pred had a particularly low cutoff of 0.062 for Zookeeper, indicating that even weakly related candidates were included in the ranked list. Similarly, fit_lambdamart_rank:ndcg_hybridTarget_pred had a cutoff of 0.322, which, while higher, still reflects the need for a broader inclusion of job titles due to the dataset’s variability.

Human Resources Data: Structured and Redundant

The HR dataset, sourced from real-world candidates, had consistent and repetitive job titles, often explicitly mentioning “HR” or “Human Resources.” With clear keyword matches and duplicate job titles, ranking models easily identified relevant candidates, resulting in higher and more defined cutoff values.

Because job titles in the HR dataset were clearly labeled and often repetitive, models were able to more easily identify strong matches. This led to higher cutoff values across most models, as the separation between relevant and irrelevant candidates was more distinct.

For example, the fit_BERT_FESR model had a cutoff of 0.466 for HR-related candidates, compared to just 0.298 for Zookeeper candidates. Similarly, lambdaRank_listwise_neural had a cutoff of 0.404 for HR but only 0.196 for Zookeeper.

The evaluation identified two top-performing models/approaches — rfr_pointwise_continuous_pred_soft_target_regression_1, RankNET_pairwise_neural and fit_BERT_FESR which consistently achieved good scores across different keywords. These models reliably placed the most relevant candidates at the top of the ranked list, making them ideal for prioritizing related matches in a sourcing pipeline.

Model Types and Their Inputs

  • The rfr_pointwise_continuous_pred_soft_target_regression models are regression models (using Random Forest Regressors) trained on pointwise features such as cosine similarity, connection count, and title match scores. Their targets are soft target scores, derived from smoothed relevance labels reflecting graded similarity rather than binary matches.
  • RankNET_pairwise_neural is a neural network trained with pairwise ranking loss, using feature vectors for candidate pairs. Its objective is to predict which of the two is more relevant, based on manually or heuristically defined pairwise preferences.
  • fit_BERT_FESR is not a trained model but a weighted feature ensemble, combining: 40% cosine similarity from BERT embeddings between query and job titles and 60% similarity to starred candidates, and a 10% boost for connection count. This makes it highly interpretable, low-maintenance, and very effective (Precision@37 = 0.973, NDCG@37 = 0.982) without requiring any training.

Further keyword-specific evaluation using “Aspiring Human Resources” confirmed the strong generalization of the top models. While the same three models (plus lambdaRank_listwise_neural) stood out, their relative performance shifted slightly depending on the dataset characteristics.

A major challenge in ranking pipelines is filtering out irrelevant candidates before or after ranking. To address this, empirical cutoff scores were determined from the distribution of model scores, tailored to each keyword and model type.

  • For the keyword “Zookeeper” (with high variance and synthetic titles), cutoffs ranged from 0.06 to 0.32, with models needing more flexibility to include semantically related but non-explicit titles.
  • For “Aspiring Human Resources”, where real-world data showed redundancy and clearer labeling, cutoffs were much higher — typically between 0.36 and 0.46 — indicating stronger separability between relevant and irrelevant candidates.

These variations underscore the impact of dataset structure on model scoring. Models performed more confidently on structured, repetitive datasets (like HR) and required more lenient thresholds on diverse or ambiguous ones (like Zookeeper).

Candidate resume to job posting similarity

The resume is one of the richest sources of candidate data we have. It tells a story, not just about where someone has worked, but how they’ve progressed, what they’re good at, what they care about, and how they talk about their own value.

So why limit ourselves to just job titles?

With some smart parsing and NLP, we can unlock a much deeper understanding of candidate fit and build models that reflect how real hiring decisions are made. Most resumes follow a common structure — and that’s a good thing. It gives us an entry point to extract and quantify much more nuanced features.

We can break resumes into sections like: Experience, Education, Skills, Certifications, About Me / Summary sections. Each of these can be analyzed individually to create targeted feature scores.
For example:

  • experience_match_score: How well do the candidate’s job descriptions align with the responsibilities in the job posting?
  • skills_match_score: Are the required (or nice-to-have) skills present in the resume, and in what context?
  • relevant_certification_score: Does the resume mention certifications that are relevant to the role?
  • summary_alignment_score: Does the candidate’s self-description match the values, culture, or mission of the company?

Each of these features can be distilled down to a score that gets fed into the ranking model, and together, they paint a much more complete picture of candidate fit than a single job title ever could.

Resume to resume similarity: candidate to starred candidates

Here’s another powerful idea: if you already starred candidates in your pipeline, let’s say they’re labeled as “starred” you can use their resumes as benchmarks.

For every new resume, compute a similarity score to starred resumes for all of the resume sections. This is resume-to-resume matching, using embedding vectors or document-level semantic similarity techniques.

This allows your model to pick up on patterns that might not be obvious in keywords alone. If your top candidates all have certain phrasing or experiences in common, the model can start to learn that signal and prioritize similar candidates in future rankings.

By turning resume content into structured feature scores, we can add several important layers to your model:

  • Semantic depth: Not just what someone did, but how closely it aligns with what you’re looking for.
  • Career context: Promotions, job changes, and patterns over time.
  • Cultural fit: Is the way someone presents themselves in sync with the company’s voice or mission?
  • Comparative ranking: How does this person stack up against our best candidates so far?

All of this helps to reduce noise and create a model that feels a lot more like a seasoned recruiter, one who’s seen thousands of resumes and knows what a great one looks like.

Addressing Bias in Starred Candidate Models and Solutions for Fairer Hiring

However, it’s important to recognize a potential pitfall in this approach. If there’s bias in the hiring of candidates that are introduced into the starred_candidates_history variable, that bias will inevitably be carried forward in the resume-to-resume NLP model. Since the model relies on comparing new resumes to those already labeled as “starred,” any biases in those starred candidates, whether unconscious preferences for certain demographics, educational backgrounds, or career trajectories will likely influence the rankings of future candidates. This could perpetuate systemic biases, rather than reducing them.

One way to counteract this would be to work alongside HR departments and DEI (Diversity, Equity, and Inclusion) teams to build an ideal candidate profile. The ideal candidate resume could be created as a collaborative effort, considering key competencies, skills, and experiences that are truly relevant to the role, while also reflecting a commitment to diversity and inclusion. By comparing all incoming resumes to this ideal profile rather than to previously starred candidates, we would help ensure a more objective and equitable model, reducing the risk of reinforcing historical biases. This approach could create a hiring process that’s more focused on what really matters in a candidate’s potential, rather than on patterns that might not be representative or inclusive.

This project offered an exciting opportunity to explore how NLP and data science can be used to support smarter, fairer hiring decisions. By diving deep into resume data and automating candidate ranking, we were able to surface meaningful patterns, reduce bias, and suggest more holistic ways to evaluate talent. With targeted feature engineering and scalable modeling, this approach can help hiring teams save time, make better decisions, and focus their energy where it counts most, on people.

As an AI Resident at Apziva, this project pushed me to think critically about the intersection of language, bias, and decision-making. It reinforced the value of aligning technical models with real-world needs, especially in high-impact domains like hiring.

Curious to learn more or explore a collaboration? I’d love to connect, feel free to reach out on LinkedIn or check out this project here on GitHub. Let’s build something meaningful together. Looking forward to connecting!

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here