In Stata, comparing data across rows based on specific conditions is a common task, especially when dealing with panel data or repeated measures. This article addresses a frequent challenge in hospital data analysis: how to compare rows within patient groups (mrn
) and drop observations based on time windows relative to specific events. We will explore how to effectively “Compare 2 Rows Based On Condition Stata” to solve this problem.
Understanding the Problem: Time-Window Based Observation Filtering
Imagine you are working with hospital encounter data where each row represents a patient visit. You need to identify and potentially remove certain types of encounters (let’s say, routine visits marked by id3==0
) if they do not fall within a specific time window around a “special hospitalization” (marked by id3==1
) for the same patient (mrn
).
Specifically, the requirement is:
For each patient (
mrn
), identify encounters whereid3==0
. Then, check if the discharge date (dcdate
) of theseid3==0
encounters falls within a window defined by:
- 3 days before the admission date of any encounter with
id3==1
for the same patient.- 30 days after the discharge date of any encounter with
id3==1
for the same patient.If the
dcdate
of anid3==0
encounter is outside this time window (defined by anyid3==1
encounter for the samemrn
), then thatid3==0
encounter should be dropped.
This task involves comparing rows based on conditions within groups (mrn
), considering dates and flags (id3
). Stata’s by
prefix and conditional statements are essential tools for this type of analysis.
Stata Code Solution: Step-by-Step Implementation
To solve this problem in Stata, we can use a combination of sorting, by
prefix, and conditional logic. Here’s a step-by-step Stata code solution:
First, let’s input the sample data provided:
clear input mrn date dcdate minus3 plus30 id3
1633188 19200 19205 19197 19235 1
1633188 19308 19311 19305 19341 1
1633188 19378 19381 19375 19411 1
1633188 19423 19424 19420 19454 1
1633188 19289 19297 . . 0
1633188 19434 19435 19431 19465 0
1633188 19323 19323 . . 0
1633188 19361 19361 . . 0
1633188 19246 19246 . . 0
1633188 19253 19256 . . 0
1633188 19324 19325 19321 19355 0
1633188 19348 19354 . . 0
1633188 19375 19376 . . 0
1635367 19390 19393 19387 19423 1
1635367 19432 19434 19429 19464 1
1635367 19466 19470 19463 19500 1
1635367 19522 19522 . . 0
1635367 19430 19431 . . 0
1635367 19551 19551 . . 0
1635367 19604 19604 . . 0
1635367 19229 19233 . . 0
1635367 19185 19185 . . 0
1635367 19200 19200 . . 0
1635367 19585 19589 . . 0
1635367 19228 19228 . . 0
1635367 19633 19633 . . 0
1635367 19515 19515 . . 0
1635367 19592 19592 . . 0
1635367 19257 19257 . . 0
1635367 19618 19618 . . 0
1635367 19208 19208 . . 0
1635367 19508 19508 . . 0
1635367 19218 19218 . . 0
1635367 19458 19458 . . 0
1635367 19253 19253 . . 0
1635367 19479 19479 . . 0
1635367 19647 19650 . . 0
1635367 19640 19640 . . 0
1635367 19577 19577 . . 0
1635367 19222 19222 . . 0
1635367 19578 19578 . . 0
1635367 19528 19528 . . 0
1635367 19638 19638 . . 0
1635367 19600 19600 . . 0
1635367 19250 19250 . . 0
1635367 19590 19590 . . 0
1635367 19561 19561 . . 0
1635367 19688 19688 . . 0
1635367 19624 19624 . . 0
1635367 19204 19204 . . 0
1635367 19269 19269 . . 0
1635367 19613 19613 . . 0
1635367 19542 19542 . . 0
1635367 19519 19519 . . 0
1635367 19526 19526 . . 0
1635367 19556 19556 . . 0
1635367 19673 19673 . . 0
1635367 19424 19424 . . 0
1635367 19564 19564 . . 0
1635367 19463 19463 . . 0
1635367 19631 19631 . . 0
1635367 19417 19417 . . 0
1635367 19682 19682 . . 0
1635367 19675 19675 . . 0
1635367 19500 19500 . . 0
1635367 19625 19625 . . 0
1635367 19485 19485 . . 0
1635367 19192 19192 . . 0
1635367 19215 19215 . . 0
1635367 19662 19662 . . 0
1635367 19501 19505 . . 0
1635367 19493 19493 . . 0
1642544 20063 20075 20060 20105 1
1642544 19634 19634 . . 0
1642544 19837 19837 . . 0
1642544 19830 19830 . . 0
1642544 20003 20003 . . 0
1642544 19855 19855 . . 0
1642544 20124 20124 . . 0
1642544 19697 19697 . . 0
1642544 19998 19998 . . 0
1642544 19816 19816 . . 0
1642544 19642 19642 . . 0
1642544 19828 19828 . . 0
1642544 19824 19824 . . 0
1642544 19640 19640 . . 0
1642544 19813 19813 . . 0
1642544 19823 19823 . . 0
1642544 19607 19612 . . 0
1642544 20105 20110 . . 0
1642544 20013 20013 . . 0
1642544 20111 20111 . . 0
1642544 19830 19830 . . 0
1642544 19977 19977 . . 0
1642544 19823 19823 . . 0
1642544 20100 20103 . . 0
1642544 19760 19760 . . 0
1642544 20045 20045 . . 0
1642544 19824 19824 . . 0
1642544 19797 19802 . . 0
1642544 19893 19893 . . 0
end
destring, replace format date %td label variable date "Date of admission" format dcdate %td label variable dcdate "Date of discharge" format minus3 %td label variable minus3 "Three days less date of admission" format plus30 %td label variable plus30 "Thirty days after date of discharge"
Now, let’s implement the logic to drop observations.
Step 1: Sort the data
First, sort your data by mrn
and date
. This is crucial for using the by mrn:
prefix effectively.
sort mrn date
Step 2: Identify time windows based on id3==1
For each patient (mrn
), we need to find the earliest date minus 3 days (minus3
) and the latest date plus 30 days (plus30
) associated with any id3==1
encounter. We can use egen
with min()
and max()
functions along with the if
condition.
by mrn: egen min_minus3 = min(minus3) if id3 == 1
by mrn: egen max_plus30 = max(plus30) if id3 == 1
This code does the following:
by mrn:
: Performs the following operation separately for each uniquemrn
.egen min_minus3 = min(minus3) if id3 == 1
: Creates a new variablemin_minus3
. For eachmrn
, it finds the minimum value ofminus3
but only considers observations whereid3
is equal to 1. If there are multipleid3==1
encounters, it takes the minimum of theirminus3
dates. If there are noid3==1
encounters for a givenmrn
,min_minus3
will be system missing (.
).egen max_plus30 = max(plus30) if id3 == 1
: Similarly, createsmax_plus30
, taking the maximum value ofplus30
for observations withid3==1
within eachmrn
.
Step 3: Drop observations based on the time window condition
Now, we can drop the observations where id3==0
and their discharge date (dcdate
) is not within the range defined by min_minus3
and max_plus30
. We will use the inrange()
function for this comparison.
drop if id3 == 0 & !inrange(dcdate, min_minus3, max_plus30)
This command does:
drop if id3 == 0 & ...
: Drops observations only ifid3
is equal to 0 AND the following condition is true.!inrange(dcdate, min_minus3, max_plus30)
: This is the core condition.inrange(dcdate, min_minus3, max_plus30)
: Checks ifdcdate
is within the range frommin_minus3
tomax_plus30
(inclusive).!
: Negates the result ofinrange()
. So, it becomes true ifdcdate
is outside the range.
Therefore, the entire drop if
command drops observations where id3==0
and their dcdate
is not within the calculated time window for their respective mrn
.
Complete Stata Code:
Putting it all together, the complete Stata code to achieve the desired row comparison and dropping is:
clear input mrn date dcdate minus3 plus30 id3
1633188 19200 19205 19197 19235 1
1633188 19308 19311 19305 19341 1
1633188 19378 19381 19375 19411 1
1633188 19423 19424 19420 19454 1
1633188 19289 19297 . . 0
1633188 19434 19435 19431 19465 0
1633188 19323 19323 . . 0
1633188 19361 19361 . . 0
1633188 19246 19246 . . 0
1633188 19253 19256 . . 0
1633188 19324 19325 19321 19355 0
1633188 19348 19354 . . 0
1633188 19375 19376 . . 0
1635367 19390 19393 19387 19423 1
1635367 19432 19434 19429 19464 1
1635367 19466 19470 19463 19500 1
1635367 19522 19522 . . 0
1635367 19430 19431 . . 0
1635367 19551 19551 . . 0
1635367 19604 19604 . . 0
1635367 19229 19233 . . 0
1635367 19185 19185 . . 0
1635367 19200 19200 . . 0
1635367 19585 19589 . . 0
1635367 19228 19228 . . 0
1635367 19633 19633 . . 0
1635367 19515 19515 . . 0
1635367 19592 19592 . . 0
1635367 19257 19257 . . 0
1635367 19618 19618 . . 0
1635367 19208 19208 . . 0
1635367 19508 19508 . . 0
1635367 19218 19218 . . 0
1635367 19458 19458 . . 0
1635367 19253 19253 . . 0
1635367 19479 19479 . . 0
1635367 19647 19650 . . 0
1635367 19640 19640 . . 0
1635367 19577 19577 . . 0
1635367 19222 19222 . . 0
1635367 19578 19578 . . 0
1635367 19528 19528 . . 0
1635367 19638 19638 . . 0
1635367 19600 19600 . . 0
1635367 19250 19250 . . 0
1635367 19590 19590 . . 0
1635367 19561 19561 . . 0
1635367 19688 19688 . . 0
1635367 19624 19624 . . 0
1635367 19204 19204 . . 0
1635367 19269 19269 . . 0
1635367 19613 19613 . . 0
1635367 19542 19542 . . 0
1635367 19519 19519 . . 0
1635367 19526 19526 . . 0
1635367 19556 19556 . . 0
1635367 19673 19673 . . 0
1635367 19424 19424 . . 0
1635367 19564 19564 . . 0
1635367 19463 19463 . . 0
1635367 19631 19631 . . 0
1635367 19417 19417 . . 0
1635367 19682 19682 . . 0
1635367 19675 19675 . . 0
1635367 19500 19500 . . 0
1635367 19625 19625 . . 0
1635367 19485 19485 . . 0
1635367 19192 19192 . . 0
1635367 19215 19215 . . 0
1635367 19662 19662 . . 0
1635367 19501 19505 . . 0
1635367 19493 19493 . . 0
1642544 20063 20075 20060 20105 1
1642544 19634 19634 . . 0
1642544 19837 19837 . . 0
1642544 19830 19830 . . 0
1642544 20003 20003 . . 0
1642544 19855 19855 . . 0
1642544 20124 20124 . . 0
1642544 19697 19697 . . 0
1642544 19998 19998 . . 0
1642544 19816 19816 . . 0
1642544 19642 19642 . . 0
1642544 19828 19828 . . 0
1642544 19824 19824 . . 0
1642544 19640 19640 . . 0
1642544 19813 19813 . . 0
1642544 19823 19823 . . 0
1642544 19607 19612 . . 0
1642544 20105 20110 . . 0
1642544 20013 20013 . . 0
1642544 20111 20111 . . 0
1642544 19830 19830 . . 0
1642544 19977 19977 . . 0
1642544 19823 19823 . . 0
1642544 20100 20103 . . 0
1642544 19760 19760 . . 0
1642544 20045 20045 . . 0
1642544 19824 19824 . . 0
1642544 19797 19802 . . 0
1642544 19893 19893 . . 0
end
destring, replace format date %td label variable date "Date of admission" format dcdate %td label variable dcdate "Date of discharge" format minus3 %td label variable minus3 "Three days less date of admission" format plus30 %td label variable plus30 "Thirty days after date of discharge"
sort mrn date
by mrn: egen min_minus3 = min(minus3) if id3 == 1
by mrn: egen max_plus30 = max(plus30) if id3 == 1
drop if id3 == 0 & !inrange(dcdate, min_minus3, max_plus30)
After running this code, Stata will have dropped the id3==0
observations that fall outside the time windows defined by the id3==1
observations for each patient, effectively comparing rows based on the specified condition.
Addressing the User’s Concerns: _n
and _N
The original user expressed concern about using _n
(observation number within group) and _N
(total observations in group) when group sizes vary or when the position of id3==1
instances is not standard.
In this specific solution, we do not directly use _n
or _N
. The by mrn:
prefix handles the grouping automatically, regardless of group size or the number of id3==1
observations within each group. egen min()
and egen max()
functions, used with the by
prefix and the if id3==1
condition, efficiently find the minimum minus3
and maximum plus30
dates for each patient, even if the number of observations and the position of id3==1
rows vary across patients.
The inrange()
function then performs the comparison for each id3==0
row against the calculated min_minus3
and max_plus30
for its corresponding mrn
. This approach is robust and adapts to different group structures without needing to explicitly manage _n
or _N
.
Conclusion: Efficient Conditional Row Comparison in Stata
This article demonstrated how to effectively “compare 2 rows based on condition stata” to filter hospital encounter data based on time windows related to specific hospitalization events. By using by mrn:
, egen with min()
and max()
, and the inrange()
function, we can efficiently and accurately implement complex row comparisons based on conditions within groups, without the complexities of manual indexing or worrying about varying group sizes. This approach provides a powerful and flexible solution for data manipulation tasks in Stata, especially in healthcare and longitudinal data analysis.