Page 67 - ISMCON souvenir 2021
P. 67
ISMSCON - 2021
“Research” - a way of systematically searching and studying materials and sources to establish facts and
reach new conclusions, is not complete without data collection, statistical analysis and interpretation of the
collected data; this includes steps to organize, summarize and communicate the collected information in
a meaningful way. This is a vital skill needed for researchers and professionals from many domains like:
Economics, Machine learning, Data Mining, Health-care, etc. Biostatistics, an applied branch of Statistics,
is the Science that helps: (i) in managing medical uncertainties; (ii) researchers to decide on treatments
or to identify factors contributing to diseases. This is extensively used in Epidemiology, which is the basic
science of Public Health that uses Statistics and Research Methodologies to arrive at conclusions about
diseases within certain population and identifies the causes or risks of certain diseases. Biostatistics
also helps in designing the clinical trials and draw conclusions. It is important for the investigator and
the interpreting clinician to understand the basics of Biostatistics for two reasons: (1) to choose the right
statistical test based on the nature of data (2) to understand if an analysis is carried out rightfully. The
Biostatistical Analysis serves as a key to conduct new Clinical Research; hence, it has become one
of the foundations of evidence-based clinical practice. Biostatistics involves a complete understanding
of: (a) Variable types (b) Distribution of data (c) Hypothesis testing (d) Statistical tests (e) Measures of
association (f) Regression analysis and (g) Diagnostic tests. The work area of Biostatisticians focuses
on Epidemiology, Clinical Trials, Systematic Review and Meta-Analysis, Observational and Complex
Interventional Studies, Population Genetics, Statistical Genetics and Systems Biology, wherein, they
involve Designing, Conducting, Analyzing, Calculating Sample Size, Measuring Random Errors and
Interpreting the Statistical Significance of the results. Biostatistics serves as a boon to medical research
by preventing frauds in clinical trials; investing proposed medical treatments; assessing the relative
benefits of competing therapies; establishing optimal treatment combinations; reducing misclassifications;
improving knowledge of diseases and helping in identifying new treatments and medical devices.
Keywords : Statistical analysis, Biostatistics, Epidemiology, Public health, Clinical trials.
OS33: APPLICATIONS OF MACHINE LEARNING MODELS
USING SEER DATABASE: A REVIEW
Kiruthika G , Vasna Joshua b
a
Affiliation:
a PhD Scholar, Madras University, Chennai.
b Scientist-C, ICMR-National Institute of Epidemiology, Chennai.
kiruthikabiostat@gmail.com
Keywords: machine learning, SEER program, review, cancer,
The Surveillance, Epidemiology, and End Results (SEER) Program is an authoritative source for cancer
statistics in the United States. It aims at reducing the burden of cancer on the U.S. population. There
are a number of SEER registries starting from the year 1975. The latest registry contains information
about 11,865,152 cancer cases from 2000 to 2018. There is information about the demographic profile,
behavior of the patient, cancer stage, cancer type, cause of death, insurance details and other site
specific details for the cases. Such a huge cancer dataset can be utilized for to solve various public health
problems related to cancer.
Considering the vastness of the dataset, machine learning is an effective tool that can be used. Supervised
machine learning models can be used for prediction purposes whereas unsupervised models could be used
to identify unknown patterns in data. More specifically, unsupervised models are used for feature selection
which means selecting the most important features in the data that will work as predictors in prediction
modelling. Feature engineering involves tasks such as feature transformations and aggregations which
are essential to handle problems like multicollinearity and to improve the performance of the models.
When the data volume is huge in terms of both number of records and features, feature engineering gets
complex. Machine learning techniques are used for feature engineering to overcome such complexity.
CONFERENCE SOUVENIR 65

