Hello World!.

#Data cleaning and preparation

#importing the libraries

%matplotlib inline 
import os 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib.dates as mdates from matplotlib.dates
import DateFormatter

#Handle date time conversions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import  seaborn as sns


#use white grid plot background from seaborn 
#sns.set(font_scale = 1.5, style= "whitegrid")

import random </code>

 #Assigning the keyword 'lottery'

lottery = pd.read_csv('new649.csv',

       parse_dates = ['DRAW DATE'],

       index_col = ['DRAW DATE'],

       na_values =[999.9])

lottery.head()

#let check to see if there are any null values

lottery.isnull().any().any()
thanksfully, the answer is FALSE
lottery.count()

#let look at the shape of the data

lottery.shape

#having a look at the 11 columns in the dataset

lottery.columns

lottery.info()
Looking at the data info, seems only the draw date is under an object catergories , while the rest are integers
Now removing the un_needed columns from the dataframe. such as “Product”, “DRAW NUMBER”, “SEQUENCE NUMBER”

#USING THE drop_column function

drop_cols = ['PRODUCT',
'DRAW NUMBER', 'SEQUENCE NUMBER']

#assigning a new keyword to read the new dataframes from the dataset

clean_lottery = lottery.drop(drop_cols, axis = 1)
clean_lottery.head()
the product, draw number and sequence number columns are dropped
looking at the new dataset , there are only the numbers drawns and bonus. so i thought, there should be a sum of the all the numbers drawns
Here, a sum column will be added to the dataframe containing only the sum total of the drawn numbers from 1 to 6, including the bonus number
Now, we could do this in the old school way of having to type a long quote of codes like this below :
clean_lottery["TOTAL NUMBER"] = clean_lottery['NUMBER DRAWN 1'] 
+ clean_lottery['NUMBER DRAWN 2'] + clean_lottery['NUMBER DRAWN 3'] 
+ clean_lottery['NUMBER DRAWN 4'] + clean_lottery['NUMBER DRAWN 5']
+ clean_lottery['NUMBER DRAWN 6'] + clean_lottery['BONUS NUMBER']
clean_lottery.head()

But before checking out the other means of getting the sum total of the number drawn, first we have to remove the dataframe ‘TOTAL NUMBER’ from the dataset

TOTAL = clean_lottery.pop('TOTAL NUMBER')
TOTAL.head(10)

#Now to find out if it's really the dataframe has being popped out

clean_lottery.head(10)

It's sure has !

Now shall we proceed ?…

first we assign the clean_lottery dataset to a keyword column_list of the dataframe to sum up

#assigning the clean_lottery dataset to a keyword column_list of the dataframe to sum up

column_list =  list(clean_lottery)

And then add the new dataframe to the dataset, assigning it to the column list with the ‘sum’ function to get the figures of the number drawns

clean_lottery["TOTAL NUMBER"] = clean_lottery[column_list].sum(axis = 1)
clean_lottery.head(20)

Now you see, and if you check the figures of this, comparing it with the former total number above , you will see it's the same… So we have it a more comfortable method

So we have the TOTAL NUMBER of the drawn numbers, but then looking at the figures , they are not in an order form, so let have they put in an order form using the sort_value function

clean_lottery = clean_lottery.sort_values(by = 'TOTAL NUMBER', ascending = False)
clean_lottery.head(20)

VOILA!!!, Now we have them in an order state from the biggest to the smallest. Speaking of smallest , how about we have a look

clean_lottery['TOTAL NUMBER'].max()

clean_lottery['TOTAL NUMBER'].min()

48!! unbelievable…

let have a look at the Statistics Summary of the dataset

clean_lottery.describe().transpose()

clean_lottery['TOTAL NUMBER'].mode()

#shall we see how many times 170 occured

clean_lottery['TOTAL NUMBER'].value_counts().to_frame().head(10)

Looking the output above, some total number occurred more than once. could it be that some drawn numbers from 1 to 6 with the bonus number occurred the same numbers to have given the same total number ?.

Let have a look at the dataframe from number drawn 1 to bonus number for any duplicate

#checking for duplicate

clean_lottery[['NUMBER DRAWN 1', 
'NUMBER DRAWN 2', 'NUMBER DRAWN 3',
'NUMBER DRAWN 4', 'NUMBER DRAWN 5', 
'NUMBER DRAWN 6', 'BONUS NUMBER']].duplicated().head(20)

Surprisingly, it's all came out as False.

So what could have made the TOTAL NUMBER figures duplicate?, let have a llok at the duplicated dataframes in the TOTAL NUMBER

#checking the duplicated Numbers in the TOTAL NUMBER dataframe and outputting them with the full dataframe from the Date to the bonus number dataset

dep_lottery = clean_lottery[clean_lottery.duplicated(['TOTAL NUMBER'])]
dep_lottery.head(15)

Turn out the duplicated Total numbers are made up of different combinations of numbers from Number drawn 1 to the Bonus number, but ending up duplicating the TOTAL NUMBER

# Data Visualization

So far so good, it's time to do some visualiztion

#lottery dates and time

lotteryDate = clean_lottery['1984-02-25':'1984-11-12']
lotteryDate.head()

fig, ax = plt.subplots(figsize=(12, 12))

#add x-axis and y-axis

ax.bar(lotteryDate.index.values, lotteryDate[‘NUMBER DRAWN 1’],

color = ‘red’)

#set title and labels for axes

ax.set(xlabel = ‘Date’,

ylabel = ‘NUMBER DRAWN 1’, title= ‘Lotto’)

plt.show()

fig, ax = plt.subplots(figsize=(12, 12))

ax.bar(lotteryDate.index.values, lotteryDate[‘NUMBER DRAWN 2’],

   color = 'purple')

#set title and labels for axes

ax.set(xlabel = ‘Date’,

  ylabel = 'NUMBER DRAWN 1',

  title= 'Lotto')

plt.tight_layout()

plt.show()