This is the most important step while working with the data. The estimation accuracy is directly proportional to the clean data. It eliminates the unnecessary data which is pointless or can drastically fluctuate the estimation. So lets start cleaning the data.
1. Removing score and making it bipolar(positive and negative)
Here, as we want to make prediction regarding the positive or negative review, we will change the score caed to positive if above 3.5 and else negative.(If we have score, we can just predict is using if else condition without Machine Learning)
#removing data with score 3(to simplify the prediction)
file = pd.read_sql_query("""select * from Reviews where Score != 3""", file)##converting score to polarity preferences
def conv(x):
if x<3:
return 'negative'
else:
return 'positive'
score = file['Score']
solution = score.map(conv)
file['Score'] = solution
2. Removing data which are impractical
Data which contains Helpfulnessnumerator greater then the helpfulnessdenominator is impractical and might be a manual error. Also, many reviews at one timestamp by same user is also not possible, so selecting one and discarding other reviews.
##removing reviews containing total reviews less then positive reviews
file = file[file.HelpfulnessNumerator<=file.HelpfulnessDenominator]##dropping duplicates w.r.t productid and timestamp
file = file.drop_duplicates(subset = {'ProductId', 'TimeStamp'}, keep = 'first', inplace = False)
3. Sorting the values
Sorting the values according to the ProductId
##sorting according to product id
file = file.sort_values('Product_Id', axis=0, ascending= True)
You can find many more ways to clean your data for further usage. The more you analyze the data, the more you find ways to clean it.
As we revised earlier, Machine Learning is stated as Mathematics that enables computer applications to learn without being explicitly programmed. So in a nutshell, ML is all about maths containing numbers and formulas. Every algorithm is built on a maths or physics concept.
But to build these algorithms, we need numbers data right? But we are dealing with the reviews written in a human language(English). So what should happen?
Converting all the reviews into vectors might help us in building the mathematical approach to it like planes, vectors, magnitude, relationships and much more.
By getting vectors,The similar words are closely plotted with each other and sparsely plotted by the different ones. So, we can graphically apply those vectors in the n-dimension area and create a plane distinguishing all the positive points from the negative ones.