andrew brooks

Data science side projects, thoughts & experiments, advanced tips and tricks with data.table, accessing elements from a column of lists, suppressing intermediate output with {}, fast looping with set, using shift for to lead/lag vectors and lists, create multiple columns with := in one statement, assign a column with := named with a character object, method 1: in-line, method 2: using {} and .sd, method 3: super fast mean calculation, speed check, keyby to key resulting aggregate table, using [1] , [.n] , setkey and by for within group subsetting, method 2: quotes and get, data.frame way, data.table way, print data.table with [], hide output from := with knitr, tips and tricks learned along the way.

This is mostly a running list of data.table tricks that took me a while to figure out either by digging into the official documentation , adapting StackOverflow posts, or more often than not, experimenting for hours. I’d like to persist these discoveries somewhere with more memory than my head (hello internet) so I can reuse them after my mental memory forgets them. A less organized and concise addition to DataCamp’s sweet cheat sheet for the basics .

Most, if not all of these techniques were developed for real data science projects and provided some value to my data engineering. I’ve generalized everything to the mtcars dataset which might not make this value immediately clear in this slightly contrived context. This list is not intended to be comprehensive as DataCamp’s data.table cheatsheet is. OK, enough disclaimers!

Some more advanced functionality from data.table creator Matt Dowle here .

1. DATA STRUCTURES & ASSIGNMENT

Columns of lists, summary table (long and narrow).

This could be useful, but is easily achievable using traditional methods.

summary table (short and narrow)

Add all categories of gear for each cyl to original data.table as a list.

This is more nifty. It’s so simple, I find myself using this trick to quickly explore data ad hoc at the command line. Can also be useful for more serious data engineering.

Update 10/29/2015: Per these comments on StackOverlow referencing my post, t[,gearsL:=list(list(unique(gear))), by=cyl] can be more elegantly written as t[,gearsL:=.(list(unique(gear))), by=cyl] . Thanks for pointing out my unnecessarily verbose and unusual syntax! I think I wrote the first thing that worked when I posted this, not realizing the normal .( syntax was equivalent to the outer list.

Extract second element of each list in gearL1 and create row gearL1 . This isn’t that groundbreaking, but explores how to access elements of columns which are constructed of lists of lists. lapply is your friend.

Update 9/24/2015: Per Matt Dowle’s comments, a slightly more syntactically succinct way of doing this:

Calculate all the gear s for all cars of each cyl (excluding the current current row). This can be useful for comparing observations to the mean of groups, where the group mean is not biased by the observation of interest.

Update 9/24/2015: Per Matt Dowle’s comments, this achieves the same as above.

This is actually a base R trick that I didn’t discover until working with data.table. See ?`{` for some documentation and examples. I’ve only used it within the J slot of data.table, it might be more generalizable. I find it pretty useful for generating columns on the fly when I need to perform some multi-step vectorized operation. It can clean up code by allowing you to reference the same temporary variable by a concise name rather than rewriting the code to re-compute it.

Defaults to just returning the last object defined in the braces unnamed.

We can be more explicit by passing a named list of what we want to keep.

Can also write it like this without semicolons.

This is trickier with := assignments… I don’t think := is intended to work when wrapped in { . Assigning multiple columns with := at once does not allow you to use the first columns you create to use building the ones after it, as we did with = inside the { above. Chaining and then dropping unwanted variables is a messy workaround… still exploring this one.

I still haven’t worked much with the loop + set framework. I’ve been able to achieve pretty much everything with := which is more flexible and powerful. However, if you must loop, set is orders of magnitude faster than native R assignments within loops. Here’s a snippet from data.table news a while back:

data.table creators do favor set for some things , like this task which can also be done w/ lapply and .SD . I was actually directed to this solution after I posed this question on StackOverflow. I was also pleased to learn that the functionality I was looking for – applying a function to a subset of columns with .SDcols while preserving the untouched columns – was added as a feature request.

Note this feature is only available in version 1.9.5 (currently on Github, not CRAN) Base R surprisingly does not have great tools for dealing with leads/lags of vectors that most social science statistical software (Stata, SAS, even FAME which I used in my formative data years) come equipped with out of the box.

shift with by

This is useful, but note that that the columns operated on must be atomic vectors or lists. That is they must exist before running computation. Building columns referencing other columns in this set need to be done individually or chained.

This is the advised way to assign a new column whose name you already have determined and saved as a character. Simply surround the character object in parentheses.

This is old (now deprecated) way which still works for now. Not advised.

Calculate a function over a group (using by ) excluding each entity in a second category.

This title probably doesn’t immediately make much sense. Let me explain what I’m going to calculate and why with an example. We want to compare the mpg of each car to the average mpg of cars in the same class (the same # of cylinders). However, we don’t want to bias the group mean by including the car we want to compare to the average in that average.

This assumption doesn’t appear useful in this example, but assume that gear + cyl uniquely identify the cars. In the real project where I faced this problem, I was calculating an indicator related to an appraiser relative to the average of all other appraisers in their zip3. ( cyl was really zipcode and gear was the appraiser’s ID).

0.a Biased mean: simple mean by cyl

However we want to know for each row, what is the mean among all the other cars with the same # of cyl s, excluding that car.

1.a .GRP without setting key

Update 9/24/2015: Per Matt Dowle’s comments, this also works with slightly less code. For my simple example, there was also a marginal speed gain. Time savings relative to the .GRP method will likely increase with the complexity of the problem.

1.b Same as 1.a, but a little faster

Why does this work, 1.b setting key.

{} is used for to suppress intermediate operations.

Building up

No surprises here.

Nested data.tables and by statements

This chunk shows what happens with two by statements nested within two different data.tables. Explanatory purposes only - not necessary for our task. n counts the # of cars in that cyl . N counts the number of cars by cyl and gear .

Calculating “unbiased mean”

This is in a summary table. This would need to be merged back onto dt if that is desired.

Non-function direct way

Using a vectorized approach to calculate the unbiased mean for each combination of gear and cyl . Mechanically, it calculates the “biased average” for all cars by cyl . Then subtract off the share of cars with the combination of gear and cyl that we want to exclude from the average and add that share. Then extrapolate out this pared down mean.

Wrapping up code below into a function

Method 3 is roughly 100x faster than the other two. Great for this narrow task with the vectorization built in, but less generalizable; The other two methods allow any function to be passed.

Without keyby

Categories are not sorted

take highest value of column A when column B is highest by group

Max of qsec for each category of cyl (this is easy)

value of qsec when mpg is the highest per category of cyl

(this is trickier)

value of qsec when mpg is the lowest per category of cyl

Value of qsec when mpg is the median per category of cyl, subset rows within by statement.

V1 is the standard deviation of mpg by cyl . V2 is the standard deviation of mpg for just the first half of mpg .

3. FUNCTIONS

Passing data.table column names as function arguments, method 1: no quotes, and deparse + substitute.

This way seems more data.table-ish because it maintains the practice of not using quotes on variable names in most cases.

However I tend to pass through column names as characters (quoted) and use get each time I reference that column. That can be annoying if you have a long function repeatedly reference column names, but I often need to write such few lines of code with data.table, it hasn’t struck me as terribly unslick, yet.

Beware of scoping within data.table

When you add something to a data.frame within a function that exists in the global environment, it does not affect that object in the global environment unless you return and reassign it as such, or you use the <<- operator.

When we call the function, we see addcol1 in the output. But not addcol2 . That’s because it’s been added to the df in the global environment one level up.

Here is addcol2 , but not addcol .

Unlike data.frame, the := operator adds a column to both the object living in the global environment and used in the function. I think this is because these objects are actually the same object. data.table shaves computation time by not making copies unless explicitly directed to.

So something like this renaming the local version using copy bypasses this behavior, but is likely somewhat less efficient (and elegant). I suspect there’s a cleaner and/or faster way to do this: keep some variables local to the function while persisting and returning other columns.

4. PRINTING

Nothing groundbreaking here, but a small miscellaneous piece of functionality. In data.frame world, wrapping an expression in () prints the output to the console. This also works with data.table, but there is another way. In data.table this is achieved by appending [] to the end of the expression. I find this useful because when I’m exploring at the console, I don’t usually decide to print the output until I’m almost done and I’m already at the end of the expression I’ve written.

It used to be that assignments using the := operator printed the object to console when knitting documents with knitr and rmarkdown . This is actually fixed in data.table v1.9.5. However at the time of my writing, this currently not available on CRAN… only Github. For 1.9.4 users, this StackOverflow post has some hacky solutions. This least impedance approach I found was simply wrapping the expression in invisible . Other solutions alter the way you use data.table which I didn’t like.

  • Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

data.table in R – The Complete Beginners Guide

  • November 13, 2019
  • Selva Prabhakaran

data.table is a package is used for working with tabular data in R. It provides the efficient data.table object which is a much improved version of the default data.frame . It is super fast and has intuitive and terse syntax. If you know R language and haven’t picked up the `data.table` package yet, then this tutorial guide is a great place to start.

data.table conditional assignment

  • Introduction: Why data.table?
  • How to install data.table package
  • Importing Data
  • How to convert data.frame to data.table
  • How to convert data.table to data.frame
  • Filtering rows based on conditions
  • How to select given columns
  • How to select multiple columns using a character vector
  • How to drop columns
  • How to rename columns
  • Creating a new column from existing columns
  • How to create new columns using character vector
  • What does .N and .I do
  • What is .SD and How to write functions inside data.table
  • How to join two or more datatables
  • How to merge multiple data.tables in one shot
  • Pivot Table operations
  • set() – A magic function for fast assignment operations

1. Introduction: Why data.table?

The data.table is an alternative to R’s default data.frame to handle tabular data. The reason it’s so popular is because of the speed of execution on larger data and the terse syntax.

So, effectively you type less code and get much faster speed. It is one of the most downloaded packages in R and is preferred by Data Scientists. It is probably one of the best things that have happened to R programming language as far as speed is concerned.

Though data.table provides a slightly different syntax from the regular R data.frame, it is quite intuitive. So once you get it, it feels obvious and natural that you wouldn’t want to go back the base R data.frame syntax.

By the end of this guide you will understand the fundamental syntax of data.table and the structure behind it. All the core data manipulation functions of data.table, in what scenarios they are used and how to use it, with some advanced tricks and tips as well. data.table is authored by Matt Dowle with significant contributions from Arun Srinivasan and many others .

Related Post: 101 R data.table Exercises

2. How to install data.table package

Installing data.table package is no different from other R packages. Its recommended to run install.packages() to get the latest version on the CRAN repository. However, if you want to use the latest development version, you can get it from github as well.

If you want to revert back to the CRAN version do:

3. Importing Data

The way you work with data.tables is quite different from how you’d work with data.frames. Let’s understand these difference first while you gain mastery over this fantastic package.

The fread() , short for fast read is data.table s version of read.csv() . Like read.csv() it works for a file in your local computer as well as file hosted on the internet. Plus it is atleast 20 times faster. Let’s import the mtcars dataset stored as a csv file.

The imported data is stored directly as a data.table . As you see from the above output, the data.table inherits from a data.frame class and therefore is a data.frame by itself.

So, functions that accept a data.frame will work just fine on data.table as well. Because the dataset we imported was small, the read.csv() ‘s speed was good enough.

However, the speed gain becomes evident when you import a large dataset (millions of rows). To get a flavor of how fast fread() is, run the below code. It creates a 1M rows csv file. Then reads it back again. The time taken by fread() and read.csv() functions gets printed in console.

data.table conditional assignment

That’s about 20x faster. The time difference gets wider when the filesize increases.

4. How to convert data.frame to data.table

You can convert any `data.frame` into `data.table` using one of the approaches:

  • data.table(df) or as.data.table(df)

The difference between the two approaches is: data.table(df) function will create a copy of df and convert it to a data.table . Whereas, setDT(df) converts it to a data.table inplace.

That means, the df itself gets converted to a data.table and you don’t have to assign it to a different object.

As a result, there is no copy made and no duplication of the same data. Let’s reload the mtcars dataframe from R’s default datasets pacakge.

Important : The data.table() does not have any rownames. So if the data.frame has any rownames, you need to store it as a separate column before converting to data.table.

Alternately, use setDT() to convert it to data.table in place.

5. How to convert data.table to data.frame

Conversely, use as.data.frame(dt) or setDF(dt) to convert a data.table to a data.frame .

6. Filtering rows based on conditions

The main difference with data.frame is: data.table is aware of its column names. So while filtering, passing only the columns names inside the square brackets is sufficient.

Data Table Syntax

7. How to select given columns

Now, let see how to subset columns. The most unexpected thing you will notice with data.table is you cant select a column by its numbered position in a data.table. For example, you can expect the following to work in a data.frame.

But this would just return ‘1’ in a data.table

If you want to get that column by position alone, you should add an additional argument, with=FALSE.

The returned output is a 1-column data.table. An alternate way and a better practice is to pass in the actual column name.

Notice here that the ‘mpg’ is not a string as it’s not written within quotes.

8. How to select multiple columns using a character vector

What if the column name is present as a string in another variable (vector)? In that case, you can’t use mpg directly. You need to additionally pass with=FALSE .

The same principle applies if you have multiple columns to be selected.

To make the above command work, you need to pass with=FALSE inside the square brackets

Clear? If you want to select multiple columns directly, then enclose all the required column names within list.

9. How to drop columns

How to drop the mpg , cyl and gear columns alone? Place them in a vector and use the ! in front to drop them. This effectively returns all columns except those present in the vector.

10. How to rename columns

The setnames() function is used for renaming columns. It takes the data.table (or data.frame), current name and new name as arguments and changes the column names in place without any copying of data. Always recommended!

Before moving on, let’s try out a mini challenge in R console. Q: Convert the in-built airquality dataset to a data.table. Then select “Solar.R”, “Wind” and “Temp” for those rows where “Ozone” is not missing. Show Solution

11. Creating a new column from existing columns

You can always create a new column as you do with a data.frame, but, data.table lets you create column from within square brackets. This saves key strokes.

To create multiple new columns at once, use the special assignment symbol as a function.

To select only specific columns, use the list or dot symbol instead.

Now let’s see a special but frequently used case. Let’s suppose you have the column names in a character vector and want to select those columns alone from the data.table. Passing it inside the square brackets don’t work.

12. How to create new columns using character vector

Suppose you want to create a new column but you have the name of that new column in another character vector. How to create the new column without using the actual column name?

For example, you have the new column name in the myvar vector. And, you want to assign some value, say the value of 1 to this column. Doing this will create a new column named ‘myvar’. And not var1 as intended.

To create a column named ‘var1’ instead, keep myvar inside a vector.

Finally, if you want to delete a columns, assign it to NULL .

All the above column names are now deleted. Before moving on, try solving this exercise in your R console. Question: Create a new column called ‘mileage_type’ that has the value ‘high’ if mpg > 20 else has value low . Show Solution

13. Grouping

Now, let’s move on to the second major and awesome feature of R data.table: grouping using by . In base R, grouping is accomplished using the aggregate() function.

It’s a bit cumbersome and hard to remember the syntax.

All the functionalities can be accomplished easily using the ‘by’ argument within square brackets. For example, in mtcars data, how to get the mean mileage for each cylinder type? Answer: Since you want to see the mileage by cyl column, set by = 'cyl' inside the square brackets.

Thats really useful isnt it? You can even add multiple columns to the ‘by’ argument.

14. A slightly complex groupby problem

Now, lets see some really special cases.

How to select the first occurring value of mpg for each unique cyl value That is, instead of taking the mean of mileage for every cylinder, you want to select the first occurring value of mileage. How to do that?

What to do if you want the second value? Just replace the 1 with 2.

And what if you want the last value? You can either use length(mpg) or .N :

15. What does .N and .I do

.N contains the number of rows present. So the following will get the number of rows for each unique value of cyl .

Now, how to create row numbers of items? It can be done using .I variable, short for ‘index’ (I guess). Lets first understand what .I returns.

It returns all the row numbers. Now, how to return the row numbers where cyl=6 ? This can get confusing in the beginning so pay close attention. If you want to get the row numbers of items that satisfy a given condition, you might tend to write like this:

But this returns the wrong answer because, `data.table` has already filtered the rows that contain cyl value of 6. So, what you want to do instead is to write that condition to subset .I alone instead of the whole `data.table`.

The result is same as using the `which()` function, which we used in `data.frames`.

Great! Let’s solve another challenge before moving on.   Q: Compute the number of cars and the mean mileage for each gear type. After a few seconds I will show the answer.   Show Solution

16. Chaining

Data.Table offers unique features there makes it even more powerful and truly a swiss army knife for data manipulation. First lets understand what chaining is.

Using chaining, you can do multiple datatable operatations one after the other without having to store intermediate results. For example, instead of writing two statements you can do it on one. Below code sorts after grouping by cyl :

With chaining, that is, by attaching the square brackets at the end, it’s done in one step.

Actually, chaining is available in dataframes as well, but with features like by , it becomes convenient to use on a data.table.

17. What is .SD and How to write functions inside data.table

Next, lets see how to write functions within a data.table square brackets. Let’s suppose, you want to compute the mean of all the variables, grouped by ‘cyl’.

How to do that?

You can create the columns one by one by writing by hand. Or, you can use the lapply() function to do it all in one go. But `lapply()` takes the data.frame as the first argument. Then, how to use `lapply() inside a data.table? You can use the .SD object as the first argument for lapply() . But, what is the .SD object?

It is nothing but a data.table that contains all the columns of the original datatable except the column specified in ‘by’ argument. So, here is what it looks like.

So, now you can pass this as the first argument in `lapply()`. The 11th column in `.SD` is rownames, so let’s include only the first 10.

Optionally, Instead of subsetting .SD like this, You can specify the columns that should be part of .SD using the .SDCols object

The output now contains only the specified columns.

Now, we have come to the ‘key’ concept for data.tables: Keys

Let’s understand why keys can be useful and how to set it. Setting one or more keys on a data.table enables it to perform binary search , which is many order of magnitudes faster than linear search, especially for large data.

As a result, the filtering operations are super fast after setting the keys. There is a side effect though. By setting a key, the `data.table` gets sorted by that key. So how to set a key? Just use the setkey function.

It’s so fast making it look like nothing happened. But it internally sorted data.table with ‘carname’ as the key.

If you notice, this table is sorted by ‘carname’ variable. To check the keys for a data table, you can use the key() function.

Once the key is set, merging data.tables is very direct. I have distributed few columns of mtcars in the following data.tables.

You can join these two datatables:

This returns dt1 ‘s rows using dt2 based on the key of these data.tables. You can also set multiple keys if you wish.

Now, how to remove the keys? Use setkey() and set it to NULL. But the datatable will not go back to it original row arrangement.

Another aspect of setting keys is the ‘keyby’ argument. Using keyby you can do grouping and set the by column as a key in one go. For example, in this example we saw earlier, you can skip the chaining by using keyby instead of just by .

As a result, the output has the key as cyl.

19. How to join two or more datatables

The data.table package provides a faster implementation of the merge() function. The syntax is pretty much the same as base R’s merge() .

20. How to merge multiple data.tables in one shot

This is bit of a hack by using the Reduce() function to repeatedly merge multiple data.tables stored in a list. Reduce() takes in a function that has to be applied consequtively (which is merge_func in this case) and a list that stores the arguments for function.

21. Pivot Table operations

The dcast.data.table() is the function used for doing pivot table like operations as seen in spreadsheet softwares like Microsoft Office Excel or Google spreadsheets.

The good thing is dcast.data.table() works equally well on data.frame object as well. Let’s create a pivot table showing the mean mileage( mpg ) for Cylinders vs Carburetter ( Carb )

So how to understand the syntax? There are 4 primary arguments:

  • formula: Rows of the pivot table on the left of ‘~’ and columns of the pivot on the right
  • value.var: column whose values should be used to fill up the pivot table
  • fun.aggregate: the function used to aggregate the value.var column.

dcast.data.table() is versatile in allowing multiple columns to be passed to the value.var and allows multiple functions to fun.aggregate as well. Let’s solve a quick exercise based on pivot table. Question 1: Create a pivot table to show the maximum and minimum mileage observed for Carburetters vs Cylinders combination? Question 2: Which carb value has the highest difference between max – min ? Show Solution

22. set() – A magic function for fast assignment operations

The set() command is an incredibly fast way to assign values to a new column. The syntax is: set(dt, i, j, value) , where i is the row number and j is the column number.

As a best practice, always explicitly use integers for i and j, that is, use 10L instead of 10. It is usually used in for-loops and is literally thousands of times faster.

Yes, it is so fast even when used within a for-loop, which is proof that for-loop is not really a bottleneck for speed. It is the underlying data structure related overhead that causes for-loop to be slow, which is exactly what set() avoids. Besides, it works on a data.frame object as well.

It works like magic! Below is an example to illustrate the power of set() taken from official documentation itself. The speed benchmark may be outdated, but, run and check the speed by yourself to believe it.

22. Conclusion

We have covered all the core concepts in order to work with data.table package. To gain more practice, try the 101 R data.table Exercises . I’d be interested to know your comments as well, so please share your thoughts in the comments section below.      

More Articles

  • Data Manipulation

How to reduce the memory size of Pandas Data frame

101 r data.table exercises, 101 python datatable exercises (pydatatable), similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

data.table conditional assignment

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

data.table conditional assignment

Using .SD for Data Analysis

1.1 loading and previewing lahman data, 2.1 column subsetting: .sdcols, 2.2 column type conversion, 2.3 controlling a model’s right-hand side, 2.4 conditional joins, 3.1 group subsetting, 3.2 group optima, 3.3 grouped regression.

This vignette will explain the most common ways to use the .SD variable in your data.table analyses. It is an adaptation of this answer given on StackOverflow.

1. What is .SD ?

In the broadest sense, .SD is just shorthand for capturing a variable that comes up frequently in the context of data analysis. It can be understood to stand for S ubset, S elfsame, or S elf-reference of the D ata. That is, .SD is in its most basic guise a reflexive reference to the data.table itself – as we’ll see in examples below, this is particularly helpful for chaining together “queries” (extractions/subsets/etc using [ ). In particular, this also means that .SD is itself a data.table (with the caveat that it does not allow assignment with := ).

The simpler usage of .SD is for column subsetting (i.e., when .SDcols is specified); as this version is much more straightforward to understand, we’ll cover that first below. The interpretation of .SD in its second usage, grouping scenarios (i.e., when by = or keyby = is specified), is slightly different, conceptually (though at core it’s the same, since, after all, a non-grouped operation is an edge case of grouping with just one group).

To give this a more real-world feel, rather than making up data, let’s load some data sets about baseball from the Lahman database . In typical R usage, we’d simply load these data sets from the Lahman R package; in this vignette, we’ve pre-downloaded them directly from the package’s GitHub page instead.

Readers up on baseball lingo should find the tables’ contents familiar; Teams records some statistics for a given team in a given year, while Pitching records statistics for a given pitcher in a given year. Please do check out the documentation and explore the data yourself a bit before proceeding to familiarize yourself with their structure.

2. .SD on Ungrouped Data

To illustrate what I mean about the reflexive nature of .SD , consider its most banal usage:

That is, Pitching[ , .SD] has simply returned the whole table, i.e., this was an overly verbose way of writing Pitching or Pitching[] :

In terms of subsetting, .SD is still a subset of the data, it’s just a trivial one (the set itself).

The first way to impact what .SD is is to limit the columns contained in .SD using the .SDcols argument to [ :

This is just for illustration and was pretty boring. In addition to accepting a character vector, .SDcols also accepts:

  • any function such as is.character to filter columns
  • the function^{*} patterns() to filter column names by regular expression
  • integer and logical vectors

*see ?patterns for more details

This simple usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:

Column type conversion is a fact of life for data munging. Though fwrite recently gained the ability to declare the class of each column up front , not all data sets come from fread (e.g. in this vignette) and conversions back and forth among character / factor / numeric types are common. We can use .SD and .SDcols to batch-convert groups of columns to a common type.

We notice that the following columns are stored as character in the Teams data set, but might more logically be stored as factor s:

The syntax to now convert these columns to factor is simple:

  • The := is an assignment operator to update the data.table in place without making a copy. See reference semantics for more.
  • The LHS, names(.SD) , indicates which columns we are updating - in this case we update the entire .SD .
  • The RHS, lapply() , loops through each column of the .SD and converts the column to a factor.
  • We use the .SDcols to only select columns that have pattern of teamID .

Again, the .SDcols argument is quite flexible; above, we supplied patterns but we could have also supplied fkt or any character vector of column names. In other situations, it is more convenient to supply an integer vector of column positions or a logical vector dictating include/exclude for each column. Finally, the use of a function to filter columns is very helpful.

For example, we could do the following to convert all factor columns to character :

Lastly, we can do pattern-based matching of columns in .SDcols to select all columns which contain team back to factor :

** A proviso to the above: explicitly using column numbers (like DT[ , (1) := rnorm(.N)] ) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don’t keep smart/strict control over the ordering of when we create the numbered index and when we use it.

Varying model specification is a core feature of robust statistical analysis. Let’s try and predict a pitcher’s ERA (Earned Runs Average, a measure of performance) using the small set of covariates available in the Pitching table. How does the (linear) relationship between W (wins) and ERA vary depending on which other covariates are included in the specification?

Here’s a short script leveraging the power of .SD which explores this question:

Fit OLS coefficient on W, various specifications, depicted as bars with distinct colors.

The coefficient always has the expected sign (better pitchers tend to have more wins and fewer runs allowed), but the magnitude can vary substantially depending on what else we control for.

data.table syntax is beautiful for its simplicity and robustness. The syntax x[i] flexibly handles three common approaches to subsetting – when i is a logical vector, x[i] will return those rows of x corresponding to where i is TRUE ; when i is another data.table (or a list ), a (right) join is performed (in the plain form, using the key s of x and i , otherwise, when on = is specified, using matches of those columns); and when i is a character, it is interpreted as shorthand for x[list(i)] , i.e., as a join.

This is great in general, but falls short when we wish to perform a conditional join , wherein the exact nature of the relationship among tables depends on some characteristics of the rows in one or more columns.

This example is admittedly a tad contrived, but illustrates the idea; see here ( 1 , 2 ) for more.

The goal is to add a column team_performance to the Pitching table that records the team’s performance (rank) of the best pitcher on each team (as measured by the lowest ERA, among pitchers with at least 6 recorded games).

Note that the x[y] syntax returns nrow(y) values (i.e., it’s a right join), which is why .SD is on the right in Teams[.SD] (since the RHS of := in this case requires nrow(Pitching[rank_in_team == 1]) values).

3. Grouped .SD operations

Often, we’d like to perform some operation on our data at the group level . When we specify by = (or keyby = ), the mental model for what happens when data.table processes j is to think of your data.table as being split into many component sub- data.table s, each of which corresponds to a single value of your by variable(s):

In the case of grouping, .SD is multiple in nature – it refers to each of these sub- data.table s, one-at-a-time (slightly more accurately, the scope of .SD is a single sub- data.table ). This allows us to concisely express an operation that we’d like to perform on each sub- data.table before the re-assembled result is returned to us.

This is useful in a variety of settings, the most common of which are presented here:

Let’s get the most recent season of data for each team in the Lahman data. This can be done quite simply with:

Recall that .SD is itself a data.table , and that .N refers to the total number of rows in a group (it’s equal to nrow(.SD) within each group), so .SD[.N] returns the entirety of .SD for the final row associated with each teamID .

Another common version of this is to use .SD[1L] instead to get the first observation for each group, or .SD[sample(.N, 1L)] to return a random row for each group.

Suppose we wanted to return the best year for each team, as measured by their total number of runs scored ( R ; we could easily adjust this to refer to other metrics, of course). Instead of taking a fixed element from each sub- data.table , we now define the desired index dynamically as follows:

Note that this approach can of course be combined with .SDcols to return only portions of the data.table for each .SD (with the caveat that .SDcols should be fixed across the various subsets)

NB : .SD[1L] is currently optimized by GForce ( see also ), data.table internals which massively speed up the most common grouped operations like sum or mean – see ?GForce for more details and keep an eye on/voice support for feature improvement requests for updates on this front: 1 , 2 , 3 , 4 , 5 , 6

Returning to the inquiry above regarding the relationship between ERA and W , suppose we expect this relationship to differ by team (i.e., there’s a different slope for each team). We can easily re-run this regression to explore the heterogeneity in this relationship as follows (noting that the standard errors from this approach are generally incorrect – the specification ERA ~ W*teamID will be better – this approach is easier to read and the coefficients are OK):

A histogram depicting the distribution of fitted coefficients. It is vaguely bell-shaped and concentrated around -.2

While there is indeed a fair amount of heterogeneity, there’s a distinct concentration around the observed overall value.

The above is just a short introduction of the power of .SD in facilitating beautiful, efficient code in data.table !

A gentle introduction to data.table

Operations on rows, select columns, computation on columns, modify / add / delete columns, operations on both rows and columns, aggregation by group, code indentation and reordering, modify a data.table by group, chaining commands, more details about dt[, j], iterate over several columns, selecting columns with .sdcols, to infinity and beyond.

data.table is one of the greatest R package. It provides an enhanced version of base R’s data.frame with syntax and feature enhancements, making data manipulation concise, consistent, efficient, and fun!

This post gives a quick introduction to data.table. The main objective is to present the data.table syntax, showing how to perform basic, but essential, data wrangling tasks.

Let’s start by creating a simple data.table.

How we can select the first two rows and the first two columns of a data.table? Well, that’s simple:

In base R, we can access elements of a matrix or a data.frame using the square brackets indexing method. It’s the same with data.table.

To start, think DT[rows, columns] , also presented in a short way: DT[i, j] . With data.table, most of the things take place within these square brackets. The magic is that you can do much more with the simple command DT[i, j] . In particular, rows and columns can be referred to without using any quotes or $ symbol, making the code very clear.

For example, to select rows using a condition:

When performing operations on rows only, the j element is left empty ( DT[i,] ), as in the first command, or simply ignored ( DT[i] ) as in the second command.

Likewise, we can order the data:

Or sample rows:

The command above can be abbreviated as follows:

.N is an alias for the number of rows. data.table offers several syntactic sugars like this, as we will see below.

Operations on columns

To select a column, we just specify the name in DT[, j] .

The previous command returns a vector, if you want the result to be a data.table, you have to use list() , or more simply the . alias.

To select columns using a vector of column names, one should use the .. prefix. It indicates to search the corresponding vector ‘one level up’ ( i.e. in the global environment).

To apply a function on a column:

By default, the result is returned as a vector. Just like above with column selection, to obtain the result as a data.table, it is necessary to use .() . Doing so, we can also assign colnames to the result. When no column names are provided, they are automatically generated as V1, V2, …

To apply a function on several columns:

Assigning colunm names and indenting the code:

Note that the previous commands create a new data.table. To modify an existing column, or create a new one, use the := operator.

As you can see below, DT has been modified even if we did not assign the result:

Using the data.table := operator modifies the existing object ‘in place’, which has the benefit of being memory-efficient. Memory management is an important aspect of data.table. There is a dedicated vignette to learn more about it.

The principle is the same to modify or add several columns:

It is also possible to use the functional form, which, combined with indentation, offers a more readable alternative:

With a predefined vector of column names, the corresponding object must be put in parentheses.

And finally, to remove columns, we assign them a NULL value:

Obviously, the operations on rows and columns can be combined in DT[i, j] . Operations on j are then performed after the condition in i has been applied.

Combining i and j in a same expression is particularly useful because it allows to modify some values only for rows matching the condition in i , or to create a new column, assigning a given value for matching rows, other rows being left as NA .

Now that you are familiar with DT[i, j] , let’s introduce DT[i, j, by] . by can somewhat be viewed as a “third virtual dimension”. The data can be aggregated by group using a single additional argument: by . That’s it. How could it be more simple?

A condition or a function call can also be used in by .

Aggregating on several columns is just as simple, with a character vector in by :

Or using .(...) in by:

And here is a full DT[i, j, by] command:

Once again, this is just one single argument: by = ... . Much more simple and practical than base::tapply() in my opinion. This is one of the key features that got me hooked on data.table.

Because data.table offers a consise syntax, commands easily fit on a single line. But it is possible to indent the code for more readability and also to reorder the elements ( DT[i, by, j] ).

In the previous commands, by has been used to aggregate data, returning a new data.table as output. It is of course possible to use by when modifying an existing data.table and to return the output for each observation of the groups. For example, to add a column with the number of observations for each group (the .N alias mentioned earlier can also be used in j !):

Here is another example:

Commands can be chained together using DT[ ... ][ ... ] “horizontally”:

Or “vertically”:

Adding an empty [] at the end of a command will print the result. This is useful for example when modifying columns by reference using := .

Let’s recap the main commands we have seen so far:

That’s the beauty of data.table: simplicity and consistency. DT[rows, columns, by] . No quoted column names, no $ symbol, and no new function. The only new thing is := , used to assign column(s) by reference. .() is just an alias for list() and .N is an alias for the number of rows.

When we look at the commands above, it appears that data.table is so expressive that very little code is needed. In fact, with so much little text and a regular alignment, brackets, commas, and symbols somehow stand out. Removing the “uncessary” stuff makes the structure more visible. This structure is a guide to read data.table code. I think data.table is more about understanding than memorizing.

So far DT[, j] has been used to select, modify, summarize, or compute new columns. But it is even more flexible: the j element can be any arbitrary expression, or set of expressions written within curly braces. For example:

data.table conditional assignment

Note that passing several expressions within curly braces is valid base R code to evaluate several commands, but only return the last result:

Also, as long as the j expression returns a list of equal-length elements (or elements of length one), each element of the list will be converted to a column in the resulting data.table. This is important! Keep that in mind, we’ll see the implication in the next section. But note that it also explains why we used the list() alias .() earlier for operations on columns.

Just like .N is an alias refering to the number of rows, .SD is an alias refering to the Subset of Data for each group, excluding the column(s) used in by . Said differently, .SD corresponds to “the current data for the current group (excluding grouping variables)”. It offers a convenient way to iterate over the columns of a data.table.

To better see it, just print it!

If there is no by , then .SD is DT itself.

To run a function over all the columns of a data.table, we can use the following expression:

Let’s take some time to explain it step by step: - iterating over the columns of a data.frame using lapply() is a valid base R expression ( e.g. lapply(mtcars, min) ) - DT[,j] can take any arbitrary expression as mentioned earlier, so lapply(.SD, min) is used as the j expression - .SD is, once again, just an alias for the subset of data for each group - no group is used in this first example, so .SD contains all the DT columns) - iterating over the columns of .SD using lapply obviously returns a list - as described in the previous section, as long as the j expression returns a list (of equal-length or length-one elements), each element of the list will be converted to a column in the resulting data.table - so finally, the command above returns a data.table with the minimum value for each column of .SD

Let’s run the same expression, this time by group:

And of course, we can also select the rows using a DT[i, j, by] command:

It is possible to append the result of the aggregation to the current data.table, the values will then be recycled for each observation of a given group:

By default, .SD contains all the columns that are not provided in by . To run a function on specific columns, use .SDcols to pass a vector of colnames.

A regular expression can also be passed using patterns() :

Alternatively, a function can be provided in .SDcols . This function must return a boolean signalling inclusion/exclusion of the column:

In this post, we have introduced the data.table syntax to perform common data wrangling operations. Nevertheless, we’ve only scratched the surface. data.table offers tens of other impressive features. Here are some more reasons why it deserves to be in your data science toolbox.

data.table has been created in 2006 and is still actively maintained on github, with a responsive, welcoming, and insightful support.

data.table is reliable. A lot of care is given to maintain compatibility in the long run. R dependency is as old as possible and there are no dependencies on other packages, for simpler production and maintenance.

data.table is very reliable. It is a masterpiece of continuous integration and contains more than 9000 tests that you can run locally using test.data.table() to make sure everything works fine with your own settings. There is more test code in data.table than there is code.

while data.table performance is impressive, it is not only for ‘large data’. Why analyzing ‘small data’ should be less convenient than analyzing ‘large data’? data.table is great, even for a six-row dataset like the one used in this post.

data.table offers the fread and fwrite functions, the fastest, most robust and full-featured functions to read or write text files.

data.table offers keys and indices , which are mechanisms that make row subsetting (and joins) blazingly fast.

data.table can perfom the most common data joins as well as advanced joins like non-equi joins or overlap joins.

data.table also include dcast() , melt() , as well as a bunch of utility functions, for efficient and versatile data reshaping .

all these data manipulations are fast (see benchmark ), memory-efficient, and just as easy to perform as the few commands presented in this document.

and last but not least, data.table has a nice and fun logo ! R! R! R!

Happy data.tabling!

data.table conditional assignment

Dataframe Conditional Selection in R

Using the square bracket with conditional selection in the data frame, you can subset the data frame in r and extract the observations you would like to keep or exclude.

You can also subsetting the data frame in r using the subset() function in R.

In this tutorial, we will discuss how to subset rows of data from the data frame using the logical expression in R.

Subset Rows with Equal To condition

Using the equal to (==) operator, you can subset the data frame.

Let’s consider an example to create a data frame in r to subset rows with == operator.

The output of the above data frame is:

Using the above data frame, extract the data for male candidates only using == operator.

In the above R code, the student_info data frame uses $gender variable to compare to the value “M” using == operator.

It subset data frame and displays male candidates data only.

Subset data frame using multiple conditions

You can specify multiple conditions to subset a data frame to extract the data.

Let’s consider the student_info data frame to extract data for female candidates where marks is greater than 80

In the above r code, we have specified multiple conditions like gender is equal to female and marks is greater than 80.

Data frame conditional selection using Relation operator

Using the relation operator in the condition, you can subset a data frame.

Let’s consider the above student_info data frame to get students’ data where age is between the specified age.

In the above R code to subset data frame by rows, it uses relation operator in condition to display students data having the age between 20 and 21.

Subset of Data frame Rows with %in%

Using the %in% operator, you can extract and filter data where the column value is equal to specified values.

The output of the above R code is:

I hope the above article on subset data frame in r using the logical condition is helpful to you.

Using the relation operator with multiple conditions in the data frame, you can filter and extract data.

Leave a Comment Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Datagy logo

  • Learn Python
  • Python Lists
  • Python Dictionaries
  • Python Strings
  • Python Functions
  • Learn Pandas & NumPy
  • Pandas Tutorials
  • Numpy Tutorials
  • Learn Data Visualization
  • Python Seaborn
  • Python Matplotlib

Set Pandas Conditional Column Based on Values of Another Column

  • August 9, 2021 February 22, 2022

Learn how to create a pandas conditional column cover image

There are many times when you may need to set a Pandas column value based on the condition of another column. In this post, you’ll learn all the different ways in which you can create Pandas conditional columns.

Table of Contents

Video Tutorial

If you prefer to follow along with a video tutorial, check out my video below:

Loading a Sample Dataframe

Let’s begin by loading a sample Pandas dataframe that we can use throughout this tutorial.

We’ll begin by import pandas and loading a dataframe using the .from_dict() method:

This returns the following dataframe:

Using Pandas loc to Set Pandas Conditional Column

Pandas loc is incredibly powerful! If you need a refresher on loc (or iloc), check out my tutorial here . Pandas’ loc creates a boolean mask, based on a condition. Sometimes, that condition can just be selecting rows and columns, but it can also be used to filter dataframes. These filtered dataframes can then have values applied to them.

Let’s explore the syntax a little bit:

With the syntax above, we filter the dataframe using .loc and then assign a value to any row in the column (or columns) where the condition is met.

Let’s try this out by assigning the string ‘Under 30’ to anyone with an age less than 30, and ‘Over 30’ to anyone 30 or older.

Let's take a look at what we did here:

  • We assigned the string 'Over 30' to every record in the dataframe. To learn more about this, check out my post here or creating new columns.
  • We then use .loc to create a boolean mask on the Age column to filter down to rows where the age is less than 30. When this condition is met, the Age Category column is assigned the new value 'Under 30'

But what happens when you have multiple conditions? You could, of course, use .loc multiple times, but this is difficult to read and fairly unpleasant to write. Let's see how we can accomplish this using numpy's .select() method.

Using Numpy Select to Set Values using Multiple Conditions

Similar to the method above to use .loc to create a conditional column in Pandas, we can use the numpy .select() method.

Let's begin by importing numpy and we'll give it the conventional alias np :

Now, say we wanted to apply a number of different age groups, as below:

  • <20 years old,
  • 20-39 years old,
  • 40-59 years old,
  • 60+ years old

In order to do this, we'll create a list of conditions and corresponding values to fill:

Running this returns the following dataframe:

Let's break down what happens here:

  • We first define a list of conditions in which the criteria are specified. Recall that lists are ordered meaning that they should be in the order in which you would like the corresponding values to appear.
  • We then define a list of values to use , which corresponds to the values you'd like applied in your new column.

Something to consider here is that this can be a bit counterintuitive to write. You can similarly define a function to apply different values. We'll cover this off in the section of using the Pandas .apply() method below .

One of the key benefits is that using numpy as is very fast, especially when compared to using the .apply() method.

Using Pandas Map to Set Values in Another Column

The Pandas .map() method is very helpful when you're applying labels to another column. In order to use this method, you define a dictionary to apply to the column.

For our sample dataframe, let's imagine that we have offices in America, Canada, and France. We want to map the cities to their corresponding countries and apply and "Other" value for any other city.

When we print this out, we get the following dataframe returned:

What we can see here, is that there is a NaN value associated with any City that doesn't have a corresponding country. If we want to apply "Other" to any missing values, we can chain the .fillna() method:

Using Pandas Apply to Apply a function to a column

Finally, you can apply built-in or custom functions to a dataframe using the Pandas .apply() method.

Let's take a look at both applying built-in functions such as len() and even applying custom functions.

Applying Python Built-in Functions to a Column

We can easily apply a built-in function using the .apply() method. Let's see how we can use the len() function to count how long a string of a given column.

Take note of a few things here:

  • We apply the .apply() method to a particular column,
  • We omit the parentheses "()"

Using Third-Party Packages in Pandas Apply

Similarly, you can use functions from using packages. Let's use numpy to apply the .sqrt() method to find the scare root of a person's age.

Using Custom Functions with Pandas Apply

Something that makes the .apply() method extremely powerful is the ability to define and apply your own functions.

Let's revisit how we could use an if-else statement to create age categories as in our earlier example:

In this post, you learned a number of ways in which you can apply values to a dataframe column to create a Pandas conditional column, including using .loc , .np.select() , Pandas .map() and Pandas .apply() . Each of these methods has a different use case that we explored throughout this post.

Learn more about Pandas methods covered here by checking out their official documentation:

  • Pandas Apply
  • Numpy Select

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials. View Author posts

2 thoughts on “Set Pandas Conditional Column Based on Values of Another Column”

' src=

Thank you so much! Brilliantly explained!!!

' src=

Thanks Aisha!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

data.table Extension of `data.frame`

  • Benchmarking data.table
  • Efficient reshaping using data.tables
  • Frequently Asked Questions about data.table
  • Importing data.table
  • Introduction to data.table
  • Keys and fast binary search based subset
  • Reference semantics
  • Secondary indices and auto indexing
  • Using .SD for Data Analysis
  • address: Address in RAM of a variable
  • all.equal.data.table: Equality Test Between Two Data Tables
  • as.data.table: Coerce to data.table
  • as.data.table.xts: Efficient xts to as.data.table conversion
  • as.matrix: Convert a data.table to a matrix
  • assign: Assignment by reference
  • as.xts.data.table: Efficient data.table to xts conversion
  • between: Convenience functions for range subsets.
  • cdt: data.table exported C routines
  • chmatch: Faster match of character vectors
  • coalesce: Coalescing missing values
  • copy: Copy an entire object
  • data.table: Enhanced data.frame
  • data.table-class: S4 Definition for data.table
  • datatable-optimize: Optimisations in data.table
  • dcast.data.table: Fast dcast for data.table
  • deprecated: Deprecated.
  • duplicated: Determine Duplicate Rows
  • fcase: fcase
  • fifelse: Fast ifelse
  • foverlaps: Fast overlap joins
  • frank: Fast rank
  • fread: Fast and friendly file finagler
  • froll: Rolling functions
  • fsort: Fast parallel sort
  • fwrite: Fast CSV writer
  • groupingsets: Grouping Set aggregation for data tables
  • IDateTime: Integer based date class
  • J: Creates a join 'data.table'
  • last: First/last item of an object
  • last.updated: Number of rows affected by last update
  • like: Convenience function for calling grep.
  • melt.data.table: Fast melt for data.table
  • merge: Merge two data.tables
  • nafill: Fill missing values
  • na.omit.data.table: Remove rows with missing values on columns specified
  • openmp-utils: Set or get number of threads that data.table should use
  • patterns: Obtain matching indices corresponding to patterns
  • print.data.table: data.table Printing Options
  • rbindlist: Makes one data.table from a list of many
  • rleid: Generate run-length type group id
  • rowid: Generate unique row ids within each group
  • setattr: Set attributes of objects by reference
  • setcolorder: Fast column reordering of a data.table by reference
  • setDF: Coerce a data.table to data.frame by reference
  • setDT: Coerce lists and data.frames to data.table by reference
  • setkey: Create key on a data.table
  • setNumericRounding: Change or turn off numeric rounding
  • setops: Set operations for data tables
  • setorder: Fast row reordering of a data.table by reference
  • Browse all...

assign : Assignment by reference In data.table: Extension of `data.frame`

Assignment by reference, description.

Fast add, remove and update subsets of columns, by reference. := operator can be used in two ways: LHS := RHS form, and Functional form . See Usage .

set is a low-overhead loop-able version of := . It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop). See Examples . It can not perform grouping operations.

:= is defined for use in j only. It adds or updates or removes column(s) by reference. It makes no copies of any part of memory at all. Please read vignette("datatable-reference-semantics") and follow with examples. Some typical usages are:

The .Last.updated variable contains the number of rows updated by the most recent := or set calls, which may be useful, for example, in production settings for testing assumptions about the number of rows affected by a statement; see .Last.updated for details.

Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code.

All of the following result in a friendly error (by design) :

For additional resources, please read vignette("datatable-faq") . Also have a look at StackOverflow's data.table tag .

:= in j can be combined with all types of i (such as binary search), and all types of by . This a one reason why := has been implemented in j . Please see vignette("datatable-reference-semantics") and also FAQ 2.16 for analogies to SQL.

When LHS is a factor column and RHS is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods.

Unlike <- for data.frame , the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax , or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it is clearer to readers of your code that you really do intend to change the column type.

data.table s are not copied-on-change by := , setkey or any of the other set* functions. See copy .

DT is modified by reference and returned invisibly. If you require a copy, take a copy first (using DT2 = copy(DT) ).

Advanced (internals):

It is easy to see how sub-assigning to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the data.table can be grown by reference : the list vector of column pointers is over-allocated , see truelength . By defining := in j we believe update syntax is natural, and scales, but it also bypasses [<- dispatch and allows := to update by reference with no copies of any part of memory at all.

Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more powerful and flexible than set() because := is intended to be combined with i and by in single queries on large datasets.

DT[a > 4, b := c] is different from DT[a > 4][, b := c] . The first expression updates (or adds) column b with the value c on those rows where a > 4 evaluates to TRUE . X is updated by reference , therefore no assignment needed.

The second expression on the other hand updates a new data.table that's returned by the subset operation. Since the subsetted data.table is ephemeral (it is not assigned to a symbol), the result would be lost; unless the result is assigned, for example, as follows: ans <- DT[a > 4][, b := c] .

data.table , copy , setalloccol , truelength , set , .Last.updated

Related to assign in data.table ...

R package documentation, browse r packages, we want your feedback.

data.table conditional assignment

Add the following code to your website.

REMOVE THIS Copy to clipboard

For more information on customizing the embed code, read Embedding Snippets .

DataFrames with Conditionals

The use of conditionals allows us to select a subset of rows based on the value in each row. Writing a conditional to select rows based on the data in a single column is straightforward and was used when we selected all of the courses taught by the Statistics department with the following code:

The subset of rows where the Subject is exactly equal to STAT (57 rows).

Complex Conditionals with Multiple Parts

As we want to answer more complex questions, we need increasingly complex conditionals. To help understand how a computer works, you may be familiar with the idea that computers ultimately only think in zeros and ones:

  • When a computer stores a zero, we consider that to be False .
  • When a computer stores a one, we consider that to be True .

When we use conditionals, we are assigning a truth value to every single row in the DataFrame.

  • With our conditional df[df.Subject == "STAT"] , all rows where the Subject data was "STAT" was assigned a truth value of True and kept in the final result; all other rows were labeled False and discarded.

All programming languages allows us to combine conditionals together in two key ways: with an AND ( & ) or with an OR ( | ).

Multiple Conditionals Joined with AND ( & )

When we combine two conditionals, we can ask Python to keep only the result where the first conditional AND the second conditional are both True .

Writing a conditional with multiple parts requires the use of parenthesis around each individual conditional and an operation joining the two conditionals together. For example, using the Course Catalog dataset , we want all of the courses that are taught by Computer Science ( CS ) with a course number less than 300:

Both the first ( Subject is exactly equal to "CS" ) and second ( Number is less than 300 ) conditionals are checked independently. Since an AND ( & ) is used to join these two conditionals, the final truth value is True only when both conditionals are True :

All CS courses with course numbers less than 300 (17 rows).

Python allows us to continue to apply conditionals together infinitely long -- so it's no problem to have three conditionals:

All CS courses with course numbers less than 300 and exactly 3 credit hours (6 rows).

Multiple Conditionals Joined with OR ( | )

Alternatively, Python can combine two conditionals together and keep the result when either the first conditional OR the second conditional is True (this includes when they're both True as well!). There are two major applications when this is useful:

  • Selecting multiple values of data from the same column (ex: all courses in "ARTD" OR "ARTE" OR "ARTF" ).
  • Selecting multiple values from different columns and keeping all matches (ex: all courses in "PSYC" OR courses that are only 1 credit hour).

Selecting Multiple Values of Data from the Same Column

Looking at the first example above, the University of Illinois has a lot of courses in art across many different sub-areas of art including: Art Design ( "ARTD" ), Art Education ( "ARTE" ), Art Foundation ( "ARTF" ), Art History ( "ARTH" ), and Art Studio ( "ARTS" ).

To include ALL courses from all five sub-areas of art listed above, we must join them together with an OR ( | ). Notice that it is necessary to specify each conditional completely each time even though we are always comparing the subject since Python has to evaluate each conditional independently and then combine the results together:

All courses in any subjects ARTD, ARTE, ARTF, ARTH, OR ARTS (221 rows).

Selecting Multiple Values from Different Columns and Keeping All Matches

To be considered a "full-time student" at most universities, you must be enrolled in at least 12 credit hours . If you are only enrolled in 11 credit hours, you may be interested in any course that will bump you up to exactly 12 credit hours (ex: a course worth exactly one credit hour) or a course you may be interested in (ex: something from the psychology ( "PSYC" ) department).

To include ALL of the results of all courses that are either one credit hour OR in the psychology department, we need an OR :

All courses that are exactly one credit hour OR in the psychology department (490 rows).

Combining ANDs and ORs

The most complex conditionals will require a combination of both AND and OR statements. These can get incredibly tricky, but we can remember that Python will always process conditionals by only combining two conditionals together at a time.

Since Python combines only two conditionals together at any given time, it is critical we use parenthesis to ensure we specify the order that we want these conditionals combined. For example, let's explore only junior level (300-399) courses in Chemistry or Physics . To do so:

  • The subject of the course must be CHEM or PHYS .
  • The course number must be greater than or equal to 300 .
  • The course number must also be less than 400 .

Naively writing this conditional results in the following code:

Default Order of Evaluation: AND before OR

If we do not use additional parenthesis, Python will always combine the ANDs first and then the ORs and will do so in left-to-right order. This means that:

The first set of two conditionals combined will be the first AND conditional: (df.Subject == "PHYS") & (df.Number >= 300) . The result contains all courses in PHYS with a number larger than 300.

The second set of two conditionals will be the result from #1 with the second AND : (Result of Step #1) & (df.Number < 400) . The result contains all courses in PHYS with a number from 300-399.

The final set of conditionals will be combined using OR : (df.Subject == "CHEM") | (Result of Step #2) . Since this is an OR , the result is ALL CHEM courses and then only the PHYS courses in the number 300-399.

We can verify our result by running the code:

The output of incorrect logic that does use parenthesis, which includes 500-level PHYS courses (92 rows).

Notice that the code appears correct until we scroll down ! The courses in Chemistry start at 300, but the last five rows show us that the courses in Physics include 500-level courses -- yikes!

Order of Evaluation: Using Parenthesis to Specify Order

Python uses parenthesis in a similar way to basic mathematics where the inner-most operations are done first. In our example, we want to make sure that all Chemistry and Physics courses are combined first, and only then can we limit the range of course numbers to the junior level.

By grouping both of these logical operations together, our new conditional can be thought of as a combination of two complex conditionals:

(df.Subject == "CHEM") | (df.Subject == "PHYS") , selecting only that are Chemistry OR Physics

(df.Number >= 300) & (df.Number < 400) , selecting only courses between 300 AND 399.

Joining these two conditionals together with an AND results in the exact output we expect:

All 300-level courses in chemistry or physics (11 rows).

Example Walk-Throughs with Worksheets

Video 1: dataframe conditionals using the party dataset.

  • Download Blank Worksheet (PDF)

Video 2: DataFrame Conditionals using The Berkeley Dataset

Video 3: DataFrame Conditionals using The Course Catalog Dataset

Practice Questions

data.table conditional assignment

IMAGES

  1. Excel Conditional Formatting tutorial with examples

    data.table conditional assignment

  2. Conditional Formatting with 3 Color Scale in Excel Formula

    data.table conditional assignment

  3. Conditional formatting in Excel

    data.table conditional assignment

  4. Excel Conditional Formatting tutorial with examples

    data.table conditional assignment

  5. Conditional Formatting in Excel

    data.table conditional assignment

  6. Excel Conditional Formatting Tutorial

    data.table conditional assignment

VIDEO

  1. Excel Conditional Formatting I Advance Excel I Excel Tricks & Tips

  2. Data Analytics |Excel |CONDITIONAL FORMATTING |IF|COUNTIF

  3. Excel tip data bars in conditional formatting

  4. Data Analysis with R Data.Table with example & code

  5. VHDL Course: session 9 (Chapter 4: Selected signal assignment)

  6. How to Create Pivot Table with multiple Tables ?

COMMENTS

  1. Using conditional statements in r data.table

    23. I am trying to use data.table to recode a variable based on certain conditions. My original dataset has around 30M records and after all variable creation around 130 variables. I used the methods suggested here: conditional statements in data.table (M1) and also here data.table: Proper way to do create a conditional variable when column ...

  2. r

    I am trying to use conditional statements to obtain some variables in a data table. Here's some simplified data, the code and the results: > dt id trial bet outcome 1: 11 1 1 ...

  3. Advanced tips and tricks with data.table

    Fast looping with set. Using shift for to lead/lag vectors and lists. Create multiple columns with := in one statement. Assign a column with := named with a character object. 2. BY. Calculate a function over a group (using by) excluding each entity in a second category. METHOD 1: in-line. METHOD 2: using {} and .SD.

  4. Assignment by reference

    Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more powerful and flexible than set() because := is intended to be combined with i and by in single queries ...

  5. data.table in R

    setDT(df) The difference between the two approaches is: data.table(df) function will create a copy of df and convert it to a data.table. Whereas, setDT(df) converts it to a data.table inplace. That means, the df itself gets converted to a data.table and you don't have to assign it to a different object.

  6. Using .SD for Data Analysis

    The := is an assignment operator to update the data.table in place without making a copy. See reference semantics for more. The LHS, names(.SD), indicates which columns we are updating - in this case we update the entire .SD. The RHS, lapply(), loops through each column of the .SD and converts the column to a factor.

  7. A gentle introduction to data.table · Home

    data.table is one of the greatest R package. It provides an enhanced version of base R's data.frame with syntax and feature enhancements, making data manipulation concise, consistent, efficient, and fun! This post gives a quick introduction to data.table. The main objective is to present the data.table syntax, showing how to perform basic ...

  8. User Guide

    Apply to different items in a group. Sort groups by aggregated data. Create a value counts column and reassign back to the DataFrame. Shift groups of the values in a column based on the index. Frequency table like plyr in R. Missing functionality. Comparison with R's data.table. Subsetting Rows. Selecting columns.

  9. assign: Assignment by reference in data.table: Extension of data.frame

    A data.table. Or, set() accepts data.frame, too. Optional. In set (), integer row numbers to be assigned value. NULL represents all rows more efficiently than creating a vector such as 1:nrow(x). In set (), integer column number to be assigned value. Value to assign by reference to x[i,j].

  10. Dataframe Conditional Selection in R

    Dataframe Conditional Selection in R. Using the square bracket with conditional selection in the data frame, you can subset the data frame in r and extract the observations you would like to keep or exclude. You can also subsetting the data frame in r using the subset () function in R. In this tutorial, we will discuss how to subset rows of ...

  11. Set Pandas Conditional Column Based on Values of Another Column

    With the syntax above, we filter the dataframe using .loc and then assign a value to any row in the column (or columns) where the condition is met. Let's try this out by assigning the string 'Under 30' to anyone with an age less than 30, and 'Over 30' to anyone 30 or older. df[ 'Age Category'] = 'Over 30'.

  12. assign: Assignment by reference in data.table: Extension of `data.frame`

    Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more powerful and flexible than set() because := is intended to be combined with i and by in single queries ...

  13. DataFrames with Conditionals

    The first set of two conditionals combined will be the first AND conditional: (df.Subject == "PHYS") & (df.Number >= 300). The result contains all courses in PHYS with a number larger than 300. The second set of two conditionals will be the result from #1 with the second AND: (Result of Step #1) & (df.Number < 400).

  14. Conditional formatting in datatable R

    I would like to format the table so it looks like this with the values where the rows in the Rider column match the column names are highlighted (change background colour ideally): I have tried this and a few other options but to no avail.

  15. R

    I need to create a column in a data frame with a string of yrs that will identify each yr as "leap" or "reg" (regular) automatically. This is what I have thus far: Delimit the time period. year<-(2009:2017) Create a data frame with a single column for that time period. prd_df<-data.frame(year)

  16. Apply a condition on specific column data

    Firstly, I have the following table: The column which enclosed by red color display 2 types of account, the value 1 = Free and value 2 = paid (free, paid accounts). I want before rendering the data, apply a condition to change the 1 to free and 2 to paid. that's it. Table initialization: var dataTableY = $('#table').DataTable({. serverSide: true,

  17. Python dash dataTable style_cell_conditional

    I want to set 2 conditions in style_cell_conditional. The code below doesn't work. The code below doesn't work. Error:SyntaxError: expression cannot contain assignment, perhaps you meant "=="?

  18. jQuery DataTables columnDefs based on if condition

    In DataTables, in the columnDefs, can we do if condition statement? Based on variable value, we decide which column to show / hide below. How do I solve the below issue? The below code throws this ...

  19. Python Dash DataTable with data bars conditional formating

    3. You can add multiple conditions based on the day. For example (in this case the bar will be colored in blue in case the day is not Saturday, and in case the day is Saturday then it will be colored in red): 'if': {. 'filter_query': (. '{{{column}}} >= {min_bound}' +.