Hello World!.
#Data cleaning and preparation
#importing the libraries
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates from matplotlib.dates
import DateFormatter
#Handle date time conversions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import seaborn as sns
#use white grid plot background from seaborn
#sns.set(font_scale = 1.5, style= "whitegrid")
import random </code>
#Assigning the keyword 'lottery'
lottery = pd.read_csv('new649.csv',
parse_dates = ['DRAW DATE'],
index_col = ['DRAW DATE'],
na_values =[999.9])
lottery.head()
#let check to see if there are any null values
lottery.isnull().any().any()
thanksfully, the answer is FALSE
lottery.count()
#let look at the shape of the data
lottery.shape
#having a look at the 11 columns in the dataset
lottery.columns
lottery.info()
Looking at the data info, seems only the draw date is under an object catergories , while the rest are integers
Now removing the un_needed columns from the dataframe. such as “Product”, “DRAW NUMBER”, “SEQUENCE NUMBER”
#USING THE drop_column function
drop_cols = ['PRODUCT',
'DRAW NUMBER', 'SEQUENCE NUMBER']
#assigning a new keyword to read the new dataframes from the dataset
clean_lottery = lottery.drop(drop_cols, axis = 1)
clean_lottery.head()
the product, draw number and sequence number columns are dropped
looking at the new dataset , there are only the numbers drawns and bonus. so i thought, there should be a sum of the all the numbers drawns
Here, a sum column will be added to the dataframe containing only the sum total of the drawn numbers from 1 to 6, including the bonus number
Now, we could do this in the old school way of having to type a long quote of codes like this below :
clean_lottery["TOTAL NUMBER"] = clean_lottery['NUMBER DRAWN 1']
+ clean_lottery['NUMBER DRAWN 2'] + clean_lottery['NUMBER DRAWN 3']
+ clean_lottery['NUMBER DRAWN 4'] + clean_lottery['NUMBER DRAWN 5']
+ clean_lottery['NUMBER DRAWN 6'] + clean_lottery['BONUS NUMBER']
clean_lottery.head()
But before checking out the other means of getting the sum total of the number drawn, first we have to remove the dataframe ‘TOTAL NUMBER’ from the dataset
TOTAL = clean_lottery.pop('TOTAL NUMBER')
TOTAL.head(10)
#Now to find out if it's really the dataframe has being popped out
clean_lottery.head(10)
It's sure has !
Now shall we proceed ?…
first we assign the clean_lottery dataset to a keyword column_list of the dataframe to sum up
#assigning the clean_lottery dataset to a keyword column_list of the dataframe to sum up
column_list = list(clean_lottery)
And then add the new dataframe to the dataset, assigning it to the column list with the ‘sum’ function to get the figures of the number drawns
clean_lottery["TOTAL NUMBER"] = clean_lottery[column_list].sum(axis = 1)
clean_lottery.head(20)
Now you see, and if you check the figures of this, comparing it with the former total number above , you will see it's the same… So we have it a more comfortable method
So we have the TOTAL NUMBER of the drawn numbers, but then looking at the figures , they are not in an order form, so let have they put in an order form using the sort_value function
clean_lottery = clean_lottery.sort_values(by = 'TOTAL NUMBER', ascending = False)
clean_lottery.head(20)
VOILA!!!, Now we have them in an order state from the biggest to the smallest. Speaking of smallest , how about we have a look
clean_lottery['TOTAL NUMBER'].max()
clean_lottery['TOTAL NUMBER'].min()
48!! unbelievable…
let have a look at the Statistics Summary of the dataset
clean_lottery.describe().transpose()
clean_lottery['TOTAL NUMBER'].mode()
#shall we see how many times 170 occured
clean_lottery['TOTAL NUMBER'].value_counts().to_frame().head(10)
Looking the output above, some total number occurred more than once. could it be that some drawn numbers from 1 to 6 with the bonus number occurred the same numbers to have given the same total number ?.
Let have a look at the dataframe from number drawn 1 to bonus number for any duplicate
#checking for duplicate
clean_lottery[['NUMBER DRAWN 1',
'NUMBER DRAWN 2', 'NUMBER DRAWN 3',
'NUMBER DRAWN 4', 'NUMBER DRAWN 5',
'NUMBER DRAWN 6', 'BONUS NUMBER']].duplicated().head(20)
Surprisingly, it's all came out as False.
So what could have made the TOTAL NUMBER figures duplicate?, let have a llok at the duplicated dataframes in the TOTAL NUMBER
#checking the duplicated Numbers in the TOTAL NUMBER dataframe and outputting them with the full dataframe from the Date to the bonus number dataset
dep_lottery = clean_lottery[clean_lottery.duplicated(['TOTAL NUMBER'])]
dep_lottery.head(15)
Turn out the duplicated Total numbers are made up of different combinations of numbers from Number drawn 1 to the Bonus number, but ending up duplicating the TOTAL NUMBER
# Data Visualization
So far so good, it's time to do some visualiztion
#lottery dates and time
lotteryDate = clean_lottery['1984-02-25':'1984-11-12']
lotteryDate.head()
fig, ax = plt.subplots(figsize=(12, 12))
#add x-axis and y-axis
ax.bar(lotteryDate.index.values, lotteryDate[‘NUMBER DRAWN 1’],
color = ‘red’)
#set title and labels for axes
ax.set(xlabel = ‘Date’,
ylabel = ‘NUMBER DRAWN 1’, title= ‘Lotto’)
plt.show()
fig, ax = plt.subplots(figsize=(12, 12))
ax.bar(lotteryDate.index.values, lotteryDate[‘NUMBER DRAWN 2’],
color = 'purple')
#set title and labels for axes
ax.set(xlabel = ‘Date’,
ylabel = 'NUMBER DRAWN 1',
title= 'Lotto')
plt.tight_layout()
plt.show()