【SAS Series】 The First Step for SAS Beginners: Core SAS Rules and the Clinical Trial Dataset Used in This Series
"Before learning SAS, understand the rules and the data—otherwise every step afterward will feel like stepping on LEGO bricks."
This article guides you through SAS from the ground up. We start by understanding the essential syntax rules, then introduce the simulated clinical trial dataset used throughout this series. This isn’t a textbook—it’s a beginner‑friendly guide shaped from my own trial‑and‑error journey. Here’s what we’ll cover:
- What SAS’s four major functions actually do:
- A plain‑language explanation of data access, data management, data analysis, and data presentation
- Helping you see SAS as a data‑processing factory
- The fundamentals of SAS syntax rules:
- How to name variables without making SAS angry
- What missing values look like and why semicolons matter
- A simple explanation of the difference between RUN and QUIT
- The division of labor between DATA and PROC statements:
- DATA: organizing and preparing data
- PROC: analyzing data
- The two faces of a dataset:
- Descriptor portion: the dataset’s ID card
- Data portion: the actual observations
- An introduction to the simulated clinical trial dataset used in this series:
- The purpose of the DM, EX, CM, VS, and AE tables
- The logic behind each table’s variables and common mistakes
- Helping you feel more grounded when practicing data cleaning later
Table of Contents
Introduction
The Four Core Functions of SAS: What Does SAS Actually Do?
Basic Rules of SAS Syntax
Dataset Structure: Descriptor Portion vs. Data Portion
Clinical Trial Dataset Used in This Series
Conclusion & What’s Next
FAQ
Introduction
It’s been a while, but in the previous SAS article, we completed the SAS Studio account setup and login. Since it’s been some time, here’s the link again in case you want a quick refresher XD.
【SAS Series】 Introduction to SAS Studio and Account Setup
Once you finish creating your SAS Studio account, that moment marks your official entry into the world of SAS.
In this article, I want to walk with you through two things:
The first is building a solid foundation of SAS’s basic rules. If you overlook these rules, you’ll encounter small but annoying issues later when writing code. Understanding them early gives you a clearer sense of direction.
The second is spending some time introducing the simulated clinical trial dataset used throughout the upcoming articles. Starting from the next article, you’ll grow alongside this dataset.
By the way, this entire series will use SAS Studio as the main working environment. All demonstrations, screenshots, and data processing will be done inside SAS Studio.
Some users may be using SAS Base or SAS EG. These mainly differ in interface appearance, but the underlying concepts are the same and transferable.
So everything you learn in SAS Studio—concepts, knowledge, and analysis methods—can also be applied in SAS EG and SAS Base.
Just follow the rhythm of the articles, step by step.
The Four Core Functions of SAS: What Does SAS Actually Do?
What brought you to SAS? For me, it started in a biostatistics class where we had to “run analyses” using SAS based on what we learned from the textbook.
I remember feeling completely unfamiliar with the interface and coding style. It felt distant, confusing, and honestly a bit overwhelming.
After a long period of exploration, I finally became comfortable with SAS—comfortable enough to make a living with it.
That early frustration is exactly why I wanted to write this article.
I fully understand the sense of distance you might feel when first encountering SAS.
But SAS’s core functions are actually very intuitive. You can understand them with four simple ideas:
Data Access
To do anything, you need data.
Whether the data comes from clinical trials, environmental monitoring, or any observed phenomenon, SAS can analyze it.
SAS supports many data formats—CSV, Excel, databases, and more—allowing you to import data into the system in various ways.
So “data access” simply means bringing data into SAS.
Data Management
You may receive multiple datasets, low‑quality data, or inconsistent values. Before performing statistical analysis, you must clean, transform, merge, split, categorize, impute, and validate the data.
Some cleaning steps rely on statistical theory (e.g., values beyond several standard deviations), while others rely on domain knowledge (e.g., values below detection limits).
Data management depends not only on SAS coding skills but also on your sensitivity to different types of data.
For me, data management is one of SAS’s strongest—and most interesting—areas.
Cleaning a dataset well enough for analysis is a true test of a SAS programmer’s knowledge and skill.
Data Analysis
The reason we do the first two steps is ultimately to perform statistical analysis.
This is the part textbooks focus on most—descriptive statistics (mean, SD, CI) and inferential statistics (hypothesis testing, regression, multivariate analysis, time series), all of which SAS can handle.
But remember: good analysis depends on high‑quality data.
So data analysis isn’t more important than data management—they’re equally essential.
Data Presentation
After analysis, we often need to present the results.
Sometimes the analysis isn’t the final step—you may need to use the results for further data processing.
A common example is calculating a mean and then using that mean in additional analysis.
So beyond tables and charts, you may need to export Excel files, CSV files, or even structured datasets for downstream use.
This isn’t SAS’s most emphasized strength, but mastering data export is a powerful skill.
In short, SAS is a complete data‑processing factory: data comes in → cleaned → analyzed → output.
Basic Rules of SAS Syntax
After reading the above, you’re probably eager to start coding—but not so fast. Before writing code, let’s clarify the essential syntax rules and important points that will accompany you throughout your SAS journey.
Variable Naming Rules in SAS
Variables in SAS store the values of each observation. SAS variable names may include:
- Letters
- Numbers
- Underscores (_)
-
And they must follow these rules:
- Cannot begin with a number
- Maximum length of 32 characters
Valid examples:
Age
Visit_Date
_AEcount
Height_cm
Invalid examples:
$Amount
Drug-Name
123Value
Note: SAS itself uses underscore‑prefixed names, such as “_N_” in statistical outputs. Avoid naming conflicts.
Missing Values in SAS
Missing values in character variables appear as blank spaces, while missing numeric values appear as a dot (“.”).
This distinction is extremely helpful during data cleaning and validation.
The Importance of Semicolons
Each SAS statement must end with a semicolon. Without it, SAS will “eat” the next line and produce errors in the log.
Here’s an example where missing semicolons cause issues:
DATA test;
INPUT name $ student $ before after;
CARDS;
voice S3 4 7
rice S1 3 12
;
RUN;
PROC SORT DATA=test /* Missing semicolon */
BY name student /* Missing semicolon */
RUN;
The log will show many red error messages.
RUN vs. QUIT
RUN starts the execution of a DATA or PROC step. SAS interprets RUN as “this block is complete—execute it.”
Often, even if you forget RUN, SAS will insert it automatically when it encounters the next block of code.
But if the code block is at the very end of the program, SAS cannot look ahead and therefore won’t auto‑insert RUN.
Recent versions of SAS are more forgiving and may auto‑insert RUN for large programs.
DATA test;
INPUT name $ student $ before after;
CARDS;
voice S3 4 7
rice S1 3 12
;
PROC SORT DATA=test;
BY name student;
RUN;
Here, the DATA step runs even without RUN because SAS finds the next PROC step.
QUIT, on the other hand, is used to end certain PROC steps—such as PROC SQL—that remain open waiting for more commands.
Not all PROCs require QUIT, only those that stay active.
In short: RUN executes, QUIT ends.
The Two Main Characters in SAS: DATA and PROC
SAS programs are built from two main blocks: DATA steps and PROC steps.
DATA steps handle data preparation—reading raw data, creating variables, filtering observations, and producing new datasets.
PROC steps analyze the prepared data—statistics, reports, and more.
Understanding their roles is the first step to mastering SAS.
/* DATA step */
DATA test;
INPUT name $ student $ before after;
CARDS;
voice S3 4 7
rice S1 3 12
;
RUN;
/* PROC step */
PROC SORT DATA=test;
BY name student;
RUN;
Dataset Structure: Descriptor Portion vs. Data Portion
In SAS, the dataset is the core of analysis. It contains variables and the values of each observation.
Without involving descriptive statistics, a dataset has two types of information:
Descriptor Portion
The descriptor portion describes what the dataset contains, giving you a quick overview of its structure.
It includes:
- Dataset name
- Creation date
- Number of variables
- Number of observations (rows)
- Variable types, lengths, and formats
You can view this information using PROC CONTENTS. Here’s an example output:
Data Portion
The data portion contains the actual observations. The first row lists variable names, and each row below represents one observation.
Here’s an example dataset:
Clinical Trial Dataset Used in This Series
Throughout this SAS learning series, we’ll cover data import, data cleaning, and statistical analysis.
To help you practice directly, we need to introduce the dataset used in the upcoming articles.
I previously worked as a SAS programmer in the clinical trial industry. The EDC systems I used most were IBMCD and later Zelta.
For convenience, this article simulates a clinical trial dataset based on Zelta’s structure and workflow.
This dataset was generated entirely using Copilot with a predefined trial background. It contains no real‑world data.
This allows you to learn SAS while also getting a glimpse of what clinical trial data may look like.
However, since I’ve been away from the clinical trial industry for about two years, Zelta may have changed. Real‑world datasets vary depending on trial complexity and EDC systems.
Our focus remains on learning SAS.
Here is the dataset for download:
Simulated Clinical Trial Dataset
The simulated dataset is intentionally simpler than real clinical trial data.
Overview of the Trial
This is a simulated real‑world study of an oral medication for Herpes Zoster (shingles), referred to as Drug A.
There are 60 subjects, divided into Drug A and placebo groups.
The dataset includes five major components:
- DM (Demographics)
- EX (Exposure)
- CM (Concomitant Medication)
- VS (Visit‑based efficacy)
- AE (Adverse Events)
This simulated clinical trial focuses on patients diagnosed with shingles (Herpes Zoster). It observes symptom improvement after taking Drug A, tracks adverse events, and compares outcomes between the treatment and placebo groups.
Knowing this background will make the upcoming SAS analysis exercises feel more meaningful.
To help you practice data cleaning, the dataset intentionally includes small errors such as:
- Dates recorded in the wrong order
- Missing or negative doses
- Overlapping or duplicated adverse event records
These issues are lightweight, designed for practice in later articles.
Next, let’s briefly introduce each dataset and its logic.
DM (Demographics)
Demographics data records each subject’s basic information at trial entry. It includes subject ID, sex, age, race, and enrollment date.
Common comorbidities such as hypertension and diabetes are also recorded, along with height and weight, which may affect drug metabolism or dosing.
EX (Exposure)
Exposure data records the actual treatment received. Each subject is assigned to Drug A or placebo, indicated by the variable Drug_Name.
Doses for Drug A range from 250 to 1000 mg, while placebo doses are 0 mg. All treatments are oral.
Start and end dates of dosing are also included. Some records intentionally contain missing or negative doses for practice in quality checks.
In real trials, such errors often arise from data entry and must be resolved through queries.
CM (Concomitant Medication)
Concomitant medication data records drugs taken during the trial alongside the study treatment. For shingles, these may include painkillers, antivirals, anti‑inflammatories, or vaccination history.
Start and end dates are included. Such records are important because they may affect efficacy or safety outcomes.
Some entries intentionally contain errors, such as end dates earlier than start dates, reflecting common real‑world issues.
VS (Visit‑based efficacy)
Visit data (VS) records outcomes at scheduled visits. In this trial, each subject has four visits: Day 3, Day 7, Day 14, and Day 28.
At each visit, efficacy indicators are recorded: skin lesion severity, pain score, and quality‑of‑life questionnaire results.
This allows observation of symptom improvement over time and comparison between Drug A and placebo groups.
AE (Adverse Events)
Adverse event (AE) data records all events occurring during the trial. AE data is central to safety analysis, and serious events may require reporting.
Each AE record includes the event term, start and end dates, severity, and relationship to treatment.
For teaching purposes, the dataset includes overlapping and duplicate events, as well as date errors—common issues in quality checks.
Conclusion & What’s Next
By now, you’ve accomplished two things:
- Learned the basic rules of SAS syntax
- Got familiar with the simulated clinical trial dataset used in this series
In the next article, we’ll learn how to import data into SAS and see the simulated dataset inside SAS Studio.
That’s when you’ll truly start “hands‑on” coding.
Step by step, you’ll discover SAS isn’t as intimidating as it seems—and you’ll gain a growing sense of achievement.
See you in the next article!
FAQ
I’m just starting SAS—what should I learn first?
Focus on syntax rules (variable naming, semicolons, RUN/QUIT) before diving into coding.
Why introduce the clinical trial dataset first?
Because all exercises use it. Knowing the dataset upfront prevents you from coding in the dark.
What’s the difference between DATA and PROC statements?
DATA prepares the dataset, PROC analyzes it. They work together to run successfully.
What’s the purpose of the descriptor portion?
It’s like the dataset’s ID card—quickly showing variable types, lengths, and counts.
After reading this article, can I start coding SAS?
Yes—next article we’ll teach you how to import data into SAS, giving you clear direction.





