Common Core State Standards Initiative

  • Standards for Mathematical Practice

The Standards for Mathematical Practice describe varieties of expertise that mathematics educators at all levels should seek to develop in their students. These practices rest on important “processes and proficiencies” with longstanding importance in mathematics education. The first of these are the NCTM process standards of problem solving, reasoning and proof, communication, representation, and connections. The second are the strands of mathematical proficiency specified in the National Research Council’s report Adding It Up : adaptive reasoning, strategic competence, conceptual understanding (comprehension of mathematical concepts, operations and relations), procedural fluency (skill in carrying out procedures flexibly, accurately, efficiently and appropriately), and productive disposition (habitual inclination to see mathematics as sensible, useful, and worthwhile, coupled with a belief in diligence and one’s own efficacy).

Standards in this domain:

Ccss.math.practice.mp1 make sense of problems and persevere in solving them..

Mathematically proficient students start by explaining to themselves the meaning of a problem and looking for entry points to its solution. They analyze givens, constraints, relationships, and goals. They make conjectures about the form and meaning of the solution and plan a solution pathway rather than simply jumping into a solution attempt. They consider analogous problems, and try special cases and simpler forms of the original problem in order to gain insight into its solution. They monitor and evaluate their progress and change course if necessary. Older students might, depending on the context of the problem, transform algebraic expressions or change the viewing window on their graphing calculator to get the information they need. Mathematically proficient students can explain correspondences between equations, verbal descriptions, tables, and graphs or draw diagrams of important features and relationships, graph data, and search for regularity or trends. Younger students might rely on using concrete objects or pictures to help conceptualize and solve a problem. Mathematically proficient students check their answers to problems using a different method, and they continually ask themselves, "Does this make sense?" They can understand the approaches of others to solving complex problems and identify correspondences between different approaches.

CCSS.Math.Practice.MP2 Reason abstractly and quantitatively.

Mathematically proficient students make sense of quantities and their relationships in problem situations. They bring two complementary abilities to bear on problems involving quantitative relationships: the ability to decontextualize —to abstract a given situation and represent it symbolically and manipulate the representing symbols as if they have a life of their own, without necessarily attending to their referents—and the ability to contextualize , to pause as needed during the manipulation process in order to probe into the referents for the symbols involved. Quantitative reasoning entails habits of creating a coherent representation of the problem at hand; considering the units involved; attending to the meaning of quantities, not just how to compute them; and knowing and flexibly using different properties of operations and objects.

CCSS.Math.Practice.MP3 Construct viable arguments and critique the reasoning of others.

Mathematically proficient students understand and use stated assumptions, definitions, and previously established results in constructing arguments. They make conjectures and build a logical progression of statements to explore the truth of their conjectures. They are able to analyze situations by breaking them into cases, and can recognize and use counterexamples. They justify their conclusions, communicate them to others, and respond to the arguments of others. They reason inductively about data, making plausible arguments that take into account the context from which the data arose. Mathematically proficient students are also able to compare the effectiveness of two plausible arguments, distinguish correct logic or reasoning from that which is flawed, and—if there is a flaw in an argument—explain what it is. Elementary students can construct arguments using concrete referents such as objects, drawings, diagrams, and actions. Such arguments can make sense and be correct, even though they are not generalized or made formal until later grades. Later, students learn to determine domains to which an argument applies. Students at all grades can listen or read the arguments of others, decide whether they make sense, and ask useful questions to clarify or improve the arguments.

CCSS.Math.Practice.MP4 Model with mathematics.

Mathematically proficient students can apply the mathematics they know to solve problems arising in everyday life, society, and the workplace. In early grades, this might be as simple as writing an addition equation to describe a situation. In middle grades, a student might apply proportional reasoning to plan a school event or analyze a problem in the community. By high school, a student might use geometry to solve a design problem or use a function to describe how one quantity of interest depends on another. Mathematically proficient students who can apply what they know are comfortable making assumptions and approximations to simplify a complicated situation, realizing that these may need revision later. They are able to identify important quantities in a practical situation and map their relationships using such tools as diagrams, two-way tables, graphs, flowcharts and formulas. They can analyze those relationships mathematically to draw conclusions. They routinely interpret their mathematical results in the context of the situation and reflect on whether the results make sense, possibly improving the model if it has not served its purpose.

CCSS.Math.Practice.MP5 Use appropriate tools strategically.

Mathematically proficient students consider the available tools when solving a mathematical problem. These tools might include pencil and paper, concrete models, a ruler, a protractor, a calculator, a spreadsheet, a computer algebra system, a statistical package, or dynamic geometry software. Proficient students are sufficiently familiar with tools appropriate for their grade or course to make sound decisions about when each of these tools might be helpful, recognizing both the insight to be gained and their limitations. For example, mathematically proficient high school students analyze graphs of functions and solutions generated using a graphing calculator. They detect possible errors by strategically using estimation and other mathematical knowledge. When making mathematical models, they know that technology can enable them to visualize the results of varying assumptions, explore consequences, and compare predictions with data. Mathematically proficient students at various grade levels are able to identify relevant external mathematical resources, such as digital content located on a website, and use them to pose or solve problems. They are able to use technological tools to explore and deepen their understanding of concepts.

CCSS.Math.Practice.MP6 Attend to precision.

Mathematically proficient students try to communicate precisely to others. They try to use clear definitions in discussion with others and in their own reasoning. They state the meaning of the symbols they choose, including using the equal sign consistently and appropriately. They are careful about specifying units of measure, and labeling axes to clarify the correspondence with quantities in a problem. They calculate accurately and efficiently, express numerical answers with a degree of precision appropriate for the problem context. In the elementary grades, students give carefully formulated explanations to each other. By the time they reach high school they have learned to examine claims and make explicit use of definitions.

CCSS.Math.Practice.MP7 Look for and make use of structure.

Mathematically proficient students look closely to discern a pattern or structure. Young students, for example, might notice that three and seven more is the same amount as seven and three more, or they may sort a collection of shapes according to how many sides the shapes have. Later, students will see 7 × 8 equals the well remembered 7 × 5 + 7 × 3, in preparation for learning about the distributive property. In the expression x 2 + 9 x + 14, older students can see the 14 as 2 × 7 and the 9 as 2 + 7. They recognize the significance of an existing line in a geometric figure and can use the strategy of drawing an auxiliary line for solving problems. They also can step back for an overview and shift perspective. They can see complicated things, such as some algebraic expressions, as single objects or as being composed of several objects. For example, they can see 5 - 3( x - y ) 2 as 5 minus a positive number times a square and use that to realize that its value cannot be more than 5 for any real numbers x and y .

CCSS.Math.Practice.MP8 Look for and express regularity in repeated reasoning.

Mathematically proficient students notice if calculations are repeated, and look both for general methods and for shortcuts. Upper elementary students might notice when dividing 25 by 11 that they are repeating the same calculations over and over again, and conclude they have a repeating decimal. By paying attention to the calculation of slope as they repeatedly check whether points are on the line through (1, 2) with slope 3, middle school students might abstract the equation ( y - 2)/( x - 1) = 3. Noticing the regularity in the way terms cancel when expanding ( x - 1)( x + 1), ( x - 1)( x 2 + x + 1), and ( x - 1)( x 3 + x 2 + x + 1) might lead them to the general formula for the sum of a geometric series. As they work to solve a problem, mathematically proficient students maintain oversight of the process, while attending to the details. They continually evaluate the reasonableness of their intermediate results.

Connecting the Standards for Mathematical Practice to the Standards for Mathematical Content

The Standards for Mathematical Practice describe ways in which developing student practitioners of the discipline of mathematics increasingly ought to engage with the subject matter as they grow in mathematical maturity and expertise throughout the elementary, middle and high school years. Designers of curricula, assessments, and professional development should all attend to the need to connect the mathematical practices to mathematical content in mathematics instruction.

The Standards for Mathematical Content are a balanced combination of procedure and understanding. Expectations that begin with the word "understand" are often especially good opportunities to connect the practices to the content. Students who lack understanding of a topic may rely on procedures too heavily. Without a flexible base from which to work, they may be less likely to consider analogous problems, represent problems coherently, justify conclusions, apply the mathematics to practical situations, use technology mindfully to work with the mathematics, explain the mathematics accurately to other students, step back for an overview, or deviate from a known procedure to find a shortcut. In short, a lack of understanding effectively prevents a student from engaging in the mathematical practices.

In this respect, those content standards which set an expectation of understanding are potential "points of intersection" between the Standards for Mathematical Content and the Standards for Mathematical Practice. These points of intersection are intended to be weighted toward central and generative concepts in the school mathematics curriculum that most merit the time, resources, innovative energies, and focus necessary to qualitatively improve the curriculum, instruction, assessment, professional development, and student achievement in mathematics.

  • How to read the grade level standards
  • Introduction
  • Counting & Cardinality
  • Operations & Algebraic Thinking
  • Number & Operations in Base Ten
  • Measurement & Data
  • Number & Operations—Fractions¹
  • Number & Operations in Base Ten¹
  • Number & Operations—Fractions
  • Ratios & Proportional Relationships
  • The Number System
  • Expressions & Equations
  • Statistics & Probability
  • The Real Number System
  • Quantities*
  • The Complex Number System
  • Vector & Matrix Quantities
  • Seeing Structure in Expressions
  • Arithmetic with Polynomials & Rational Expressions
  • Creating Equations*
  • Reasoning with Equations & Inequalities
  • Interpreting Functions
  • Building Functions
  • Linear, Quadratic, & Exponential Models*
  • Trigonometric Functions
  • High School: Modeling
  • Similarity, Right Triangles, & Trigonometry
  • Expressing Geometric Properties with Equations
  • Geometric Measurement & Dimension
  • Modeling with Geometry
  • Interpreting Categorical & Quantitative Data
  • Making Inferences & Justifying Conclusions
  • Conditional Probability & the Rules of Probability
  • Using Probability to Make Decisions
  • Courses & Transitions
  • Mathematics Glossary
  • Mathematics Appendix A

nctm problem solving process standard

Engage your students with effective distance learning resources. ACCESS RESOURCES>>

Practice standards.

Click on the number to find illustrations for each of the eight Standards for Mathematical Practice.

The Standards for Mathematical Practice describe varieties of expertise that mathematics educators at all levels should seek to develop in their students. These practices rest on important “processes and proficiencies” with longstanding importance in mathematics education. The first of these are the NCTM process standards of problem solving, reasoning and proof, communication, representation, and connections. The second are the strands of mathematical proficiency specified in the National Research Council's report Adding It Up: adaptive reasoning, strategic competence, conceptual understanding (comprehension of mathematical concepts, operations and relations), procedural fluency (skill in carrying out procedures flexibly, accurately, efficiently and appropriately), and productive disposition (habitual inclination to see mathematics as sensible, useful, and worthwhile, coupled with a belief in diligence and one's own efficacy).

Check out our elaborations of the practice standards: Grades K-5, Grades 6-8

Exemplars K-12: We set the standards

Just one more step to access this resource!

Get your free sample task today.

Ready to explore Exemplars rich performance tasks? Sign up for your free sample now.

NCTM Process Standards and Exemplars

Special thanks to Deb Armitage, Exemplars consultant and task contributor/editor, for her contribution.

Title I teachers are often challenged with assessing students' mathematical understanding. Traditional worksheets, chapter/unit tests, and norm-referenced tests typically do not provide enough support in discovering what a student truly understands about mathematical concepts. Teachers have found much needed support from the research of the National Council of Teachers of Mathematics (NCTM) and Exemplars.

Like the NCTM standards, Exemplars material places a greater emphasis on the "process standards." By focusing on mathematical problem solving and communication, Title I teachers are able to look more closely at what students' work shows about their mathematical understanding. A stronger emphasis on process encourages teachers to demonstrate and support students in learning a variety of strategies for solving a problem. Exemplars encourages students to show different methods for solving problems. A variety of approaches allows teachers and students to discuss the merits of each strategy. The individual student can then embrace an approach that s/he is comfortable using to determine a correct outcome. By encouraging communication in mathematics, Title I teachers are better able to pinpoint where problems in students' thinking arise and can provide more meaningful feedback as a result. This feedback then encourages students to focus on areas that need improvement.

The process of problem solving and communication helps students gain confidence in their ability to solve problems. Shifting the major emphasis from finding the correct answer (only) and placing it instead on the process and communication of "thinking" improves the mathematical disposition of many students. Increasingly, Title I students become more comfortable with mathematical problems when they have a "tool kit" of possible strategies to consider, an understanding of mathematical language, and a belief that their way of solving a problem is just as valid as a neighbor's strategy.

Logo

Writing About the Problem Solving Process to Improve Problem Solving Performance

  • eTOC Alerts
  • Get Permissions

Problem solving is generally recognized as one of the most important components of mathematics. In Principles and Standards for School Mathematics , the National Council of Teachers of Mathematics emphasized that instructional programs should enable all students in all grades to “build new mathematical knowledge through problem solving, solve problems that arise in mathematics and in other contexts, apply and adapt a variety of appropriate strategies to solve problems, and monitor and reflect on the process of mathematical problem solving” (NCTM 2000, p. 52). But how do students become competent and confident mathematical problem solvers?

Kenneth Williams is interested in problem solving and writing in the mathematics classroom.

Contributor Notes

Cover The Mathematics Teacher

Article Information

Google scholar.

  • Article by Kenneth M. Williams

Article Metrics

NCTM

© 2024 National Council of Teachers of Mathematics (NCTM)

Powered by: PubFactory

  • [66.249.64.20|195.190.12.77]
  • 195.190.12.77

Character limit 500 /500

nctm problem solving process standard

Mathematics for Teaching

This site is NOT about making mathematics easy because it isn't. It is about making it make sense because it does.

NCTM Process Standards vs CCSS Mathematical Practices

The  NCTM  process standards, Adding it Up  mathematical proficiency strands, and Common Core State Standards  for mathematical practices are all saying the same thing but why do I get the feeling that the Mathematical Practices Standards is out to get the math teachers.

The NCTM’s process standards of problem solving, reasoning and proof, communication, representation, and connections describe for me the nature of mathematics . They are not easy to understand especially when you think that school mathematics is about stuffing students with knowledge of content of mathematics. But, over time you find yourselves slowly shifting towards structuring your teaching in a way that students will understand and appreciate the nature of mathematics.

The five strands of proficiency were also a great help to me as a teacher/ teacher-trainer because it gave me the vocabulary to describe what is important to focus on in teaching mathematics.

nctm problem solving process standard

Image from 123RF

  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to email a link to a friend (Opens in new window)

View All Posts

4 thoughts on “ NCTM Process Standards vs CCSS Mathematical Practices ”

Does your “exposure” mean you personally watched Japanese classrooms or were you told/did you read how classes are taught?

Yes, including development of materials with Japanese math educators. We had a project with them. AlSo, the APEC (Asia Pacific…) project about Lesson Study for math which concluded last year devoted each year of the project for each of the process standards. I think the last one was about representation and communication. If you are familiar with Teaching Gap, the TIMSS video study by Stigler an Hiebert, you’ll get a picture of the Japanese math class. Of course, it doesn’t mean it’s happening they all teach that way.

We teachers are under tremendous pressure to be accountable for what students attain, which means insuring students do well on tests. It’s all about ‘data’ and Grade Level Content Expectations now. I fear both teachers and students will blow a fuse.

Really a nice addition in Mathematics Education

Comments are closed.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Social Sci LibreTexts

1.3: Mathematics in Preschool

  • Last updated
  • Save as PDF
  • Page ID 156539

  • Janet Stramel
  • Fort Hays State University

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Children sitting on classroom floor

Preschool children are very active, and any program and teacher must take that into account. Emotionally, preschoolers are inquisitive and explorative. Cognitively, they are in the preoperational stage of development in which they begin to engage in symbolic play and learn to manipulate symbols. However, Piaget noted that children at this stage do not yet understand concrete logic. Children at this stage learn through pretend play.

During the first half of the preoperational stage, children are in the “symbolic function substage.” Children at this stage are generally two- to four-years old. They let one object stand in for another and use symbols and signs, such as numbers. They do this through pretend play; therefore, give your preschool children as much time as possible for imaginative play. This then leads to the “intuitive thought substage” in which children are not logical, but think intuitively. Children at this stage ask many questions and are very curious.

Douglas Clements (2001), suggests that we need preschool mathematics for four reasons: 1) Preschoolers experience mathematics at a basic level, and that needs to be improved, 2) Many children, especially those from minority backgrounds or underrepresented groups, have difficulty in school mathematics and therefore preschool teachers should address those equity issues, 3) Preschoolers do possess informal mathematical abilities and use mathematical ideas in real life, and preschool teachers should capitalize on their interests, and 4) brain research has shown that preschoolers’ brains undergo significant development, their experiences and learning affects the structure and organization of their brains, and preschoolers’ brains grow most as a result of complex activities (Clements, 2001).

Preschool mathematics can be divided into two groups: numerical and measurement. Numerical activities are discrete while measurement activities are continuous. Numerical concepts ask the question, “How many?” and are referred to as discrete quantities because they can be counted.

Mathematics during the preschool years should focus on number, geometry, measurement, algebra and patterns, and problem-solving. At age three, children can hold up a number of fingers to indicate a quantity.

Rote counting is the ability to say the numbers in order and involves the memorization of numbers; meaningful or rational counting is the ability to assign a number to the objects counted. Children at age three can hold up fingers to indicate a quantity and by age four can count to five or ten, and can tell you what comes next. Cardinal numbers say how many of something there are, such as one, two, three, four, five; and they answer the question “How many?” Ordinal numbers tell the position of something in the list, such as first, second, third, fourth, fifth, etc.

There are “rules” for writing and saying number words; it is the base-ten number system that we use (10 digits, 0-9) and we place the greater value on the left. For example, three hundred sixty-two is written as 362. According to Seo and Ginsburg (2004), a child’s ability to write and say numbers does not guarantee their application. Just because a child can say numbers does not mean that they know the quantity associated with that number. Therefore, preschool children begin to put their understanding of “one” to use as they “count up” and develop the meaning of adding one more.

Preschool children must develop the concepts of order and seriation. Order is the ability to count a number of objects once and only once. Seriation is the process of putting objects in a series, for example from smallest to largest. Additionally, young children begin to group objects by their characteristics, such as yellow and blue.

Preschool children can use directional words such as “up and down,” and “over and under” as well as comparing words such as “bigger and smaller,” or “longer and shorter.” Additionally, children at ages three to four recognize and name shapes. Naming a shape is mathematics, it is language arts. Mathematics comes in as students recognize and classify the attributes of those shapes. Additionally, students are beginning to compose and decompose shapes. For example, they may be able to make a square with two triangles.

Measurement

Three-year olds can lay two objects side-by-side and tell which one is longer. By age four, children begin to use non-standard units to measure things. For example, they can tell you how many shoes long a desk is, although they need to use many shoes. They are not yet ready to use one shoe repeatedly.

Children in preschool do not learn to tell time, but they are learning the concept of time. They talk about yesterday, today, and tomorrow.

Algebra and Patterns

Preschool children do algebra by recreating patterns and making their own patterns. Children can recognize, describe, extend, and create patterns from a simple repeating pattern such as “red, blue, red, blue” to a more complex pattern such as “red, red, blue, red, red, blue.” They also notice growing patterns such as “1, 2, 3, 4” or “2, 4, 6, 8.”

Problem-Solving

Problem solving is critical at all levels. Allow students to solve a problem without stepping in too quickly. Children begin to link words and concepts; therefore, teachers can begin to use story problems for teaching mathematics. In kindergarten, the words should be simple and short and by first grade, students begin to write their strategy in the problem-solving process.

Each of these content areas will be further developed in subsequent chapters.

Preschool children in a classroom

Critical Mathematics Concepts for Preschool Children

Activities in the preschool classroom must incorporate the use of manipulatives and hands-on learning, and the main emphasis should be on number sense. Do not use worksheets or independent practice in the preschool classroom, instead plan activities that will develop a strong sense of number and patterns. The following domains are critical concepts as you teach preschool children to be mathematically proficient:

Counting and Cardinality

Number sense is the foundation for success in mathematics and is the first vital skill for preschool children (Resilient Educator, 2021). The ability to count accurately is a part of number sense, but also to see the relationship between numbers, such as addition and subtraction. Children should be able to demonstrate simple counting skills before kindergarten. This includes counting to 20, ordering numbers, identifying how many are in a set without counting (subitizing), and understanding that the quantity does not change regardless of the arrangement of the items. Additionally, preschool children should understand cardinality, in which the last number said is the number of items in the set.

Operations and Algebraic Thinking

Mathematical ideas become “real” when teachers and students use words, pictures, symbols, and objects. Young children are naturally visual and can build those relationships between numbers and the item represented; therefore, teachers must use pictures and objects to clarify that relationship. As children learn to count, they will learn that the number symbol represents the number of items shown.

Patterns are things that repeat in a logical way. Manipulatives can help children sort, count, and see patterns. An AB pattern means that two items alternate, such as red, blue, red, blue, red, etc. ABC patterns means that three items are in the pattern, such as bear, cow, giraffe, bear, cow, giraffe, etc. Students will learn to make predictions about what would come next in a pattern.

Number and Operations in Base 10

Preschool children begin to understand that the number “ten” is made up of “ten ones,” although this is a difficult concept. Teachers should allow children to count on their fingers one to ten.

Showing students the meaning of the words more, less, bigger, smaller, more than, and less than can help young children understand estimation.

Measurement and Data

Finding length, height, and weight using inches, feet, pounds, or non-standard units is measurement. Also in this skill area is measurement of time. Teachers should ask their students to notice objects in their world and compare them, for example, “The stepstool is bigger than the chair. Do you think it will fit under the chair?”

Sorting is a skill that preschool children should do often. One way to sort is by color; another way is by another attribute. Teachers can ask students to count the toys in a basket, and then sort them based on size, color, or their purpose. Check out this excerpt from the book Exploring Math and Science in Preschool by the National Association for the Education of Young Children – Sorting Activities for Preschoolers by William C. Ritz.

Spatial sense is geometry, but at the preschool level it is the ability to recognize shape, size, space, position, direction, and movement. Teachers can talk with children about shapes – count the sides or describe the shape. Furthermore, talk with children about shapes in their world, such as “The pizza is round,” or “The sandwich is a rectangle.”

Calendar Time

Morning calendar time is a daily part of many preschool classrooms. There is a ritual when children sit on the floor and talk about today, look at yesterday, find out about tomorrow, and write out the date. Understanding that time is sequential is critical for young children. They think about before and after, later and earlier, and future and past events. According to Beneke, Ostrosky, and Katz (2008), preschool children generally cannot judge distances or lengths of time. For example, they do not understand that a field trip is in five days and differently than if it is in eight days. And it is different for young children to judge units of time. And although a true understanding of calendar dates comes with maturity, using the calendar to teach other concepts is also valuable time spent in the classroom. For example, vocabulary (month, year, weekend), sequencing (yesterday, today, and tomorrow), and patterns (Monday, Tuesday, Wednesday). They also begin to recognize numbers. Additionally, teachers can use calendar time to teach social skills, colors, letters, and integrate science as they talk about the weather (Beneke, Ostrosky, & Katz, 2008).

Math manipulatives

Manipulatives

Manipulatives are the mainstay of a preschool mathematics classroom (Geist, 2009). Math manipulatives are physical objects that are designed to represent explicitly and concretely mathematical ideas (Moyer, 2001). Students need time to explore and manipulate materials in order to learn the mathematics concept. According to Carol Copple (2004), children should be given many opportunities to manipulate a wide variety of things and teachers should provide children to “mess about.”

One productive belief from the NCTM publication, Principles to Action (2014), states, “Students at all grade levels can benefit from the use of physical and virtual manipulative materials to provide visual models of a range of mathematical ideas.” Students at all grade levels can benefit from manipulatives, but especially at the elementary level. Using manipulatives can

  • provide your students a bridge between the concrete and abstract.
  • serve as models that support students’ thinking.
  • provide another representation.
  • support student engagement.
  • give students ownership of their own learning.

Adapted from “ The Top 5 Reasons for Using Manipulatives in the Classroom .”

Everyday activities can be used to promote mathematics. For example, during snack time children divide up snacks, count plates, and notice the one-to-one correspondence between the number of children and the number of napkins needed.

Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

Markov decision processes (MDPs) are used to model a wide variety of applications ranging from game playing over robotics to finance. Their optimal policy typically maximizes the expected sum of rewards given at each step of the decision process. However, a large class of problems does not fit straightforwardly into this framework: Non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expected value of an arbitrary function of the rewards is maximized. Example functions include the maximum of the rewards or their mean divided by their standard deviation. In this work, we introduce a general mapping of NCMDPs to standard MDPs. This allows all techniques developed to find optimal policies for MDPs, such as reinforcement learning or dynamic programming, to be directly applied to the larger class of NCMDPs. Focusing on reinforcement learning, we show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems. Given our approach, we can improve both final performance and training time compared to relying on standard MDPs.

1 Introduction

Markov decision processes (MDPs) are used to model a wide range of applications where an agent iteratively interacts with an environment during a trajectory. Important examples include robotic control [ Andrychowicz et al. ( 2020a ) ], game playing [ Mnih et al. ( 2015 ) ], or discovering algorithms [ Mankowitz et al. ( 2023 ) ]. At each time step t 𝑡 t italic_t of the MDP, the agent chooses an action based on the state of the environment and receives a reward r t subscript 𝑟 𝑡 r_{t} italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . The agent’s goal is to follow an ideal policy that maximizes

where T 𝑇 T italic_T is the length of the trajectory and the expectation value is taken over both the agent’s probabilistic policy π 𝜋 \pi italic_π and the probabilistic environment. There exist a multitude of established strategies for finding (approximately) ideal policies of MDPs, such as dynamic programming and reinforcement learning [ Sutton & Barto ( 2018 ) ].

A limitation of the framework of MDPs is the restriction to ideal policies that maximize Equation   1 , while a large class of problems cannot straightforwardly be formulated this way. For example, in weakest-link problems, the goal is to maximize the minimum rather than the sum of rewards, e.g. in network routing one wants to maximize the minimum bandwidths along a path [ Cui & Yu ( 2023 ) ]. In finance, the Sharpe ratio, i.e. the mean divided by the standard deviation of portfolio gains, is an important figure of merit of an investment strategy, since maximizing it will yield more risk-averse strategies than maximizing the sum of portfolio gains [ Sharpe ( 1966 ) ]. We therefore require a method to tackle non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expectation value of an arbitrary function of the rewards is maximized. First steps in this direction have already been taken, but are currently limited to specific settings, e.g. by being restricted to only certain MDP solvers, deterministic environments, and restricted classes of non-cumulative objectives (for more details on these approaches see Section   4 ). The main contribution of this paper is twofold:

First, we provide a theoretical framework for a general and easy-to-implement mapping of NCMDPs to corresponding standard MDPs. This allows the direct application of advanced MDP solvers such as reinforcement learning or dynamic programming to tackle also NCMDPs (see Figure   1 ).

Next, we perform numerical experiments focusing on reinforcement learning. We show applications in classical control problems, portfolio optimization, and discrete optimization problems using e.g. the non-cumulative max \max roman_max objective and the Sharpe ratio. Using our framework, we find improvements in both training time and final performance as compared to relying on standard MDPs.

2 Theoretical Analysis

Preliminaries, non-cumulative markov decision processes.

In the following, we denote states and rewards related to NCMDPs by s ~ t subscript ~ 𝑠 𝑡 \tilde{s}_{t} over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and r ~ t subscript ~ 𝑟 𝑡 \tilde{r}_{t} over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , respectively, to distinguish them from MDPs. We define NCMDPs equivalently to MDPs except for their ideal policies maximizing the expectation value of an arbitrary scalar function f 𝑓 f italic_f of the immediate rewards r ~ t subscript ~ 𝑟 𝑡 \tilde{r}_{t} over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT instead of their sum, i.e.

NCMDPs still have a Markovian transition probability distribution but their return, i.e. Equation   2 , depends on the rewards in a non-cumulative and therefore non-Markovian way. Thus, NCMDPs are a generalization of MDPs. Due to this distinction, MDP solvers cannot straightforwardly be used for NCMDPs. To solve this problem, we describe a general mapping from an NCMDP M ~ ~ 𝑀 \tilde{M} over~ start_ARG italic_M end_ARG with rewards r ~ t subscript ~ 𝑟 𝑡 \tilde{r}_{t} over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and states s ~ t subscript ~ 𝑠 𝑡 \tilde{s}_{t} over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to a corresponding MDP M 𝑀 M italic_M with adapted states s t subscript 𝑠 𝑡 s_{t} italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and adapted rewards r t subscript 𝑟 𝑡 r_{t} italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT but the same actions. The mapping is chosen such that the optimal policy of M ~ ~ 𝑀 \tilde{M} over~ start_ARG italic_M end_ARG is equivalent to the optimal policy of M 𝑀 M italic_M . Therefore, a solution for M ~ ~ 𝑀 \tilde{M} over~ start_ARG italic_M end_ARG can readily be obtained by solving M 𝑀 M italic_M with existing MDP solvers, as shown in Figure   1 .

As a first step, we define

As 𝔼 π ⁢ [ ∑ t = 0 T − 1 r t ] = 𝔼 π ⁢ [ f ⁢ ( r ~ 0 , … , r ~ T − 1 ) ] subscript 𝔼 𝜋 delimited-[] superscript subscript 𝑡 0 𝑇 1 subscript 𝑟 𝑡 subscript 𝔼 𝜋 delimited-[] 𝑓 subscript ~ 𝑟 0 … subscript ~ 𝑟 𝑇 1 \mathbb{E}_{\pi}\left[\sum_{t=0}^{T-1}r_{t}\right]=\mathbb{E}_{\pi}\left[f(% \tilde{r}_{0},\dots,\tilde{r}_{T-1})\right] blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_f ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ] , the ideal policy of M 𝑀 M italic_M is also the ideal policy of M ~ ~ 𝑀 \tilde{M} over~ start_ARG italic_M end_ARG . However, the r t subscript 𝑟 𝑡 r_{t} italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are now non-Markovian, since they depend on rewards in the past of the trajectory. Recovering theoretical convergence guarantees available for MDP solvers requires adapting our state space to recover the Markov property of M 𝑀 M italic_M . In principle, methods developed based on past [ Bacchus et al. ( 1996 , 1997 ) ] or future [ Thiébaux et al. ( 2006 ) ] linear temporal logic could be employed to achieve this. However, we argue that for a wide range of functions f 𝑓 f italic_f , a more straightforward approach is possible: The primary challenge is that we need to access the non-cumulative return in a purely Markovian manner. This can be achieved by extending the state in a way that preserves all the information necessary to compute this return. The additional state information x t subscript 𝑥 𝑡 x_{t} italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT needs to be sufficient to compute r t subscript 𝑟 𝑡 r_{t} italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , i.e. the temporal difference of the non-cumulative returns f 𝑓 f italic_f . Concretely, we need to define a vector x t subscript 𝑥 𝑡 x_{t} italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and two functions y 𝑦 y italic_y and z 𝑧 z italic_z , such that the following relations hold:

With the Dirac delta function δ 𝛿 \delta italic_δ , the transition probabilities p 𝑝 p italic_p of M 𝑀 M italic_M are then given in terms of the transition probabilities p ~ ~ 𝑝 \tilde{p} over~ start_ARG italic_p end_ARG of M ~ ~ 𝑀 \tilde{M} over~ start_ARG italic_M end_ARG by

There are several possible choices for x t subscript 𝑥 𝑡 x_{t} italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , y 𝑦 y italic_y , and z 𝑧 z italic_z . For example, it is always possible to choose x t = ( r 0 , … , r t − 1 ) subscript 𝑥 𝑡 subscript 𝑟 0 … subscript 𝑟 𝑡 1 x_{t}=(r_{0},\dots,r_{t-1}) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . However, this is undesirable because it leads to a state size that grows with the trajectory length. Fortunately, it is possible to find x t subscript 𝑥 𝑡 x_{t} italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of a small, constant size for many functions f 𝑓 f italic_f . For example, for f ⁢ ( r ~ 0 , … , r ~ t ) = min ⁡ ( r ~ 0 , … , r ~ t ) 𝑓 subscript ~ 𝑟 0 … subscript ~ 𝑟 𝑡 subscript ~ 𝑟 0 … subscript ~ 𝑟 𝑡 f(\tilde{r}_{0},\dots,\tilde{r}_{t})=\min(\tilde{r}_{0},\dots,\tilde{r}_{t}) italic_f ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_min ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , we can choose r 0 = x 1 = r ~ 0 subscript 𝑟 0 subscript 𝑥 1 subscript ~ 𝑟 0 r_{0}=x_{1}=\tilde{r}_{0} italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT followed by x t = y ⁢ ( x t , r ~ t ) = min ⁡ ( x t , r ~ t ) subscript 𝑥 𝑡 𝑦 subscript 𝑥 𝑡 subscript ~ 𝑟 𝑡 subscript 𝑥 𝑡 subscript ~ 𝑟 𝑡 x_{t}=y(x_{t},\tilde{r}_{t})=\min(x_{t},\tilde{r}_{t}) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_y ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_min ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , and r t = z ⁢ ( x t , r ~ t ) = min ⁡ ( 0 , r ~ t − x t ) subscript 𝑟 𝑡 𝑧 subscript 𝑥 𝑡 subscript ~ 𝑟 𝑡 0 subscript ~ 𝑟 𝑡 subscript 𝑥 𝑡 r_{t}=z(x_{t},\tilde{r}_{t})=\min(0,\tilde{r}_{t}-x_{t}) italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_min ( 0 , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . More examples are shown in Table   1 . By mapping the NCMDP to an MDP, all guarantees available for MDP solvers, e.g. convergence proofs for Q-learning [ Watkins & Dayan ( 1992 ) ], or the policy gradient theorem for policy-based methods [ Sutton et al. ( 1999 ) ], directly apply also to NCMDPs.

We demonstrate the need for the additional state information x t subscript 𝑥 𝑡 x_{t} italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the two-step NCMDP depicted in Figure   2 . Here, the goal is to maximize the expected minimum of the rewards, i.e. min ⁡ ( r ~ 0 , r ~ 1 ) subscript ~ 𝑟 0 subscript ~ 𝑟 1 \min(\tilde{r}_{0},\tilde{r}_{1}) roman_min ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . If r ~ 0 = 1 subscript ~ 𝑟 0 1 \tilde{r}_{0}=1 over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 in the first step, the ideal policy is to choose a 1 subscript 𝑎 1 a_{1} italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the second step, while if r ~ 0 = − 1 subscript ~ 𝑟 0 1 \tilde{r}_{0}=-1 over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 1 the agent should choose a 0 subscript 𝑎 0 a_{0} italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . Therefore, the choice of the ideal action depends on information contained in the past rewards, which is captured by x 1 = r ~ 0 subscript 𝑥 1 subscript ~ 𝑟 0 x_{1}=\tilde{r}_{0} italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . We show the MDP constructed from the NCMDP as described above in the Appendix in Figure   A1 and its state-action value function found by value iteration in Table   A1 .

3 Experiments

From an implementation perspective, our scheme of mapping NCMDPs to MDPs requires minimal effort, since we can treat both the NCMDP and the used MDP solver as black boxes by simply adding a layer between them, as shown in Figure   1 . In addition, our treatment facilitates online learning and does not require any computationally expensive preprocessing of the NCMDP. This opens the door for researchers with specific domain knowledge, who are not necessarily experts in reinforcement learning, to use standard libraries such as stable-baselines3  [ Raffin et al. ( 2021 ) ] to solve their non-cumulative problems. Conversely, it allows reinforcement learning experts to quickly tackle existing environments using our method.

In the following, we focus on reinforcement learning (RL). We provide details on hyperparameters, compute resources, and training curves for all experiments in Appendix   A .

3.1 Classical Control

As a first use case of our method, we train a reinforcement learning agent in the Lunar lander environment of the gymnasium [ Towers et al. ( 2023 ) ] library. The agent controls a spacecraft with four discrete actions corresponding to different engines while being pushed by a stochastic wind. Immediate positive rewards r t subscript 𝑟 𝑡 r_{t} italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are given for landing the spacecraft safely with small negative rewards given for using the engines. A realistic goal when landing a spacecraft is to not let the spacecraft get too fast, e.g. to avoid excessive frictional heating. Therefore, we define an NCMDP where the agent is penalized for its maximum speed during a trajectory, i.e. we try to maximize

where v t subscript 𝑣 𝑡 v_{t} italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the speed of the agent at time t 𝑡 t italic_t and c 𝑐 c italic_c defines a trade-off between minimizing the maximum speed and the other goals of the agent. We train RL agents using Proximal Policy Optimization (PPO) [ Schulman et al. ( 2017 ) ] on the MDP constructed from the NCMDP as described above (MAXVELPPO). To show the trade-off between the sum of the agents’ original cumulative rewards ∑ t = 0 T − 1 r t superscript subscript 𝑡 0 𝑇 1 subscript 𝑟 𝑡 \sum_{t=0}^{T-1}r_{t} ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and their maximum speed increase during a trajectory, we train agents for different c 𝑐 c italic_c . Each value of c 𝑐 c italic_c corresponds to a marker in Figure   3 . We compare our results to an RL agent with a similar but cumulative objective maximizing 𝔼 π ⁢ [ ∑ t = 0 T − 1 ( r t − c ⁢ v t 2 ) ] subscript 𝔼 𝜋 delimited-[] superscript subscript 𝑡 0 𝑇 1 subscript 𝑟 𝑡 𝑐 superscript subscript 𝑣 𝑡 2 \mathbb{E}_{\pi}\left[\sum_{t=0}^{T-1}(r_{t}-cv_{t}^{2})\right] blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_c italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] (SUMVELPPO). As shown in Figure   3 , the non-cumulative MAXVELPPO agent is consistently able to find a better trade-off than the cumulative SUMVELPPO agent. All experiments were performed with 5 agents using different seeds. We plot average results and standard deviation as error bars for both the cumulative reward and the speed increase (mostly hidden behind markers).

Our method presented here could be applied to similar use cases in other classical control problems, such as teaching a robot to reach a goal while minimizing the maximum impact forces on its joints or the forces its motors need to apply.

3.2 Portfolio Optimization with Sharpe Ratio as Objective

Next, we consider the task of portfolio optimization where an agent decides how to best invest its assets across different possibilities. A common measure for a successful investment strategy is its Sharpe ratio [ Sharpe ( 1966 ) ]

𝑡 1 subscript 𝑃 𝑡 subscript 𝑃 𝑡 \tilde{r}_{t}=(P_{t+1}-P_{t})/P_{t} over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the simple returns and P t subscript 𝑃 𝑡 P_{t} italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the portfolio value at time t 𝑡 t italic_t . By dividing through the standard deviation of the simple returns, the agent is discouraged from risky strategies with high volatility. As the Sharpe ratio is non-cumulative, reinforcement learning strategies so far needed to fall back on the approximate differential Sharpe ratio as a reward [ Moody et al. ( 1998 ); Moody & Saffell ( 2001 ) ]. However, using the methods developed in this paper we can directly maximize the exact Sharpe ratio. We perform experiments on an environment as described by Sood et al. ( 2023 ) where an agent trades on the 11 different S&P500 sector indices between 2006 and 2021. The states contain a history of returns of each index and different volatility measures. Actions are the continuous relative portfolio allocations for each day. The experiment described by Sood et al. ( 2023 ) consists of training 5 agents with different seeds over a 5-year period, periodically evaluating their performance on the following year, and testing the best-performing agent in the year after that. Then, the time period is shifted by one year into the future resulting in a total of 10 time periods. We re-implement this experiment by training agents using PPO with the cumulative differential Sharpe ratio (DIFFSHARPE) or the exact Sharpe ratio (SHARPE) as their objective. As depicted in Figure   4 , the SHARPE algorithm significantly outperforms the DIFFSHARPE algorithm in the training years. However, as shown in Table   2 , on the evaluation and test years there is no significant difference indicating over-fitting of the agents’ policies to the years they are trained on. Nonetheless, we expect training on the exact SHARPE ratio to give consistently better results if the problem of over-fitting is solved, e.g. by using realistic stock-market simulators which are for example being developed using Generative Adversarial Networks [ Li et al. ( 2020 ) ]. The experiments were performed with 5 sets of 5 seeds each and the standard deviation of different sets of seeds is reported.

While the Sharpe ratio is most widely adopted in finance, our method opens up the possibility to maximize it also in other scenarios where risk-adjusted rewards are desirable, i.e. all problems where consistent rewards with low variance are more important than a higher cumulative reward. For example, in chronic disease management, maintaining stable health metrics is preferable to sporadic improvements. In emergency or customer service, ensuring predictable response times is often more important than occasional fast responses mixed with slow ones.

3.3 Discrete Optimization Problems

Next, we consider a large class of applications where RL is commonly used: Problems where the agent iteratively transforms a state by its actions to find a state with a lower associated cost. These problems are common in scientific applications such as physics, e.g. to reduce the length of quantum logic circuits [ Fösel et al. ( 2021 ) ], or chemistry, e.g. for molecular discovery [ Zhou et al. ( 2019 ) ]. Another prominent example is the discovery of new algorithms [ Mankowitz et al. ( 2023 ) ]. Intuitively, these problems can be understood as searching for the state with the lowest cost within an equivalence class defined by all states that can be reached from the start state by the agent’s actions.

𝑡 1 \tilde{r}_{t}=c(\tilde{s}_{t})-c(\tilde{s}_{t+1}) over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_c ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) . Additionally, we are interested in the state with the lowest cost found during a trajectory, i.e. the goal is to maximize

We conjecture that maximizing Equation   8 will yield better results than maximizing 𝔼 π ⁢ [ ∑ t = 0 T − 1 r ~ t ] subscript 𝔼 𝜋 delimited-[] superscript subscript 𝑡 0 𝑇 1 subscript ~ 𝑟 𝑡 \mathbb{E}_{\pi}\left[\sum_{t=0}^{T-1}\tilde{r}_{t}\right] blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] due to the following reasons:

The agent does not need to learn an optimal stopping point.

Considering only the rewards up to the minimum found cost might decrease the variance of the gradient estimate.

The agent does not receive negative rewards for trying to escape a local cost minimum during a trajectory and is therefore not discouraged from exploring. This leads to learning difficult optimization strategies requiring an intermittent cost increase more easily.

Peak environment

To facilitate an in-depth analysis, we first consider a toy environment with the cost function depicted in the inset of Figure   5  (a). The cost function was chosen to be simple while still requiring intermittently cost-increasing actions of the optimal policy. Each trajectory lasts 10 steps and the agent’s actions are stepping to the left, right, or doing nothing. To minimize the number of hyperparameters, we use the REINFORCE algorithm [ Williams ( 1992 ) ] with a tabular policy. We compare agents trained with the non-cumulative objective Equation   8 (MAXREINFORCE) with agents that maximize the cumulative rewards (REINFORCE). As shown in Figure   5  (a), the MAXREINFORCE agent trains much faster. Two possible sources of this speed-up are a different, more advantageous direction of the gradient updates and a reduced variance when estimating these gradients. To investigate which is the case, we periodically stop training and run n = 1000 𝑛 1000 n=1000 italic_n = 1000 trajectories with a fixed policy. We then compute the empirical variance of the gradient update inspired by Kaledin et al. ( 2022 ) as

where g → i subscript → 𝑔 𝑖 \vec{g}_{i} over→ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the gradient from a single trajectory derived either by the MAXREINFORCE or the REINFORCE algorithm. As this variance scales with the squared magnitude of the average gradients, we normalize it by this value to ensure a fair comparison between the two algorithms. In Figure   5  (b), we show that in the initial phase of training, the MAXREINFORCE algorithm significantly reduces variance. To compare the direction of the gradients we use the same data to compute the normalized average gradients of both algorithms and show their dot product in Figure   5  (c). We find that the gradients are correlated (i.e. the dot product is bigger than zero) but not the same. Therefore, we conclude that the training speed-up is derived both from lower variance and a better true gradient direction. All results reported are averaged over 10 seeds with standard deviations plotted shaded.

ZX-diagrams

As a real-world use case, we consider the simplification of ZX-diagrams, which are graph representations of quantum processes [ Coecke & Kissinger ( 2017 ) ] with applications e.g. in the compilation of quantum programs [ Duncan et al. ( 2020 ); Riu et al. ( 2023 ) ]. An example of a typical ZX-diagram is shown in the inset of Figure   6  (b). We consider the environment described by Nägele & Marquardt ( 2023 ) , where the cost function of a diagram is given by its node number, the start states are randomly sampled ZX-diagrams, and the actions are a set of local graph transformations. In total, there are 6 actions per node and 6 actions per edge in the diagram. This is a challenging reinforcement learning task that requires the use of graph neural networks to accommodate the changing size of the state and action space. We use PPO to train agents to maximize either Equation   8 (MAXPPO) or the cumulative reward (PPO) with a trajectory length of 20 steps. As shown in Figure   6  (a) the MAXPPO agent initially trains faster than the PPO agent (left inset). This is likely due to the reduced variance and different gradient direction as described above. The PPO agent then shortly catches up, but ultimately requires about twice as many training steps to reach optimal performance as the MAXPPO agent (right inset). We argue that this is because the MAXPPO agent is better at exploring and therefore learning difficult optimization strategies (reason 3 above). This is captured by the entropy of the MAXPPO agent’s policy staying much higher than the entropy of the PPO agent, as shown in Figure   6  (b). All reported results are averaged over 5 seeds with standard deviations plotted shaded.

Quantum error correction

Next, we focus on optimization problems where the starting state is always the same and the optimal stopping point is known. The specific tasks we consider are the search for quantum programs to either find new quantum error correction codes or to prepare logical states of a given quantum error correction code - both critical for the eventual realization of quantum computation [ Terhal ( 2015 ) ]. In both cases, the agent iteratively adds elementary quantum logic gates until the program delivers the desired result. The agents’ policies are encoded by standard multilayer perceptrons. We train agents on five different sized problems of both tasks. For details on the used environment, see Zen et al. ( 2024 ) and Olle et al. ( 2023 ) . To obtain a single performance measure encompassing both final performance and training speed, we continuously evaluate the mean cost improvement of the agents and average it over the training process as suggested by Andrychowicz et al. ( 2020b ) . Intuitively, this measure can be understood as the area under the agent’s training curve. In Table   3 , we report the quotient of this performance measure of the MAXPPO algorithm and the PPO algorithm for both the best of 10 and the mean of 10 trained agents. The different tasks are denoted by three integers, as customary in the quantum error correction community. We find that the MAXPPO algorithm performs similarly to PPO in quantum error correction code discovery (subscript e), and significantly better in logical state preparation (subscript s). We argue that the large degeneracy present in the solution space of the code discovery task diverts the more exploratory MAXPPO, lowering its performance to the level of PPO.

When applying MAXPPO to discrete optimization problems with long trajectories in the many hundreds of steps, we empirically observed an initially slow learning speed. This could be due to the agent initially mostly increasing cost and therefore receiving zero reward for almost the entire trajectory. A possible solution could be to dynamically adjust the trajectory length in these problems going from shorter to longer trajectories during the training process.

4 Related Work

A special case of NCMDPs is first considered by Quah & Quek ( 2006 ) , who adapt Q-learning to the max \max roman_max objective by redefining the temporal difference error of their learning algorithms and demonstrate their algorithms on an optimal stopping problem. However, they do not adapt state space and do not provide theoretical convergence guarantees. Gottipati et al. ( 2020 ) rediscover the same algorithm, apply it to molecule generation, and provide convergence guarantees for their method, while Eyckerman et al. ( 2022 ) consider the same algorithm for application placement in fog environments. Cui & Yu ( 2023 ) finally show a shortcoming of the method used in the above papers: It is guaranteed to converge to the optimal policy only in deterministic settings. Additionally, they provide convergence guarantees of Q-learning in deterministic environments for a larger class of functions f 𝑓 f italic_f , focusing on the min \min roman_min function for network routing applications. Independently, the max \max roman_max function is also used in the field of safety reinforcement learning in deterministic environments both for Q-learning [ Fisac et al. ( 2015 , 2019 ); Hsu et al. ( 2021 ) ] and policy-based reinforcement learning [ Yu et al. ( 2022 ) ]. Moflic & Paler ( 2023 ) use a reward function with parameters that depend on the past rewards of the trajectory to tackle quantum circuit optimization problems that require large intermittent negative rewards, albeit without providing theoretical convergence guarantees. Additionally, Wang et al. ( 2020 ) investigate planning problems with non-cumulative objectives within deterministic settings. They provide provably efficient planning algorithms for a large class of functions f 𝑓 f italic_f by discretizing rewards and appending them to the states. Recently, Veviurko et al. ( 2024 ) showed how to use the max \max roman_max objective also in probabilistic settings by augmenting state space and provide convergence guarantees for both Q-learning and policy-based methods. They then show experiments with their algorithm yielding improvements in MDPs with shaped rewards. However, they do not redefine the rewards, which requires adaptation of the implementation of their MDP solvers. Our method described above reduces to an effectively equivalent algorithm to theirs in the special case of the max \max roman_max objective.

A limitation of all works discussed above is that they require a potentially complicated adaptation of their reinforcement learning algorithms and only consider specific MDP solvers or specific non-cumulative objectives. They are also limited to deterministic settings, except for Veviurko et al. ( 2024 ) , which consider only the max \max roman_max objective.

5 Discussion & Conclusion

In this work, we described a mapping from a decision process with a general non-cumulative objective (NCMDP) to a standard Markov decision process (MDP) applicable in deterministic and probabilistic settings. Its implementation is straightforward and directly enables solving NCMDPs with state-of-the-art MDP solvers, allowing us to show improvements in a diverse set of tasks such as classical control problems, portfolio optimization, and discrete optimization problems. Note that these improvements are achieved without adding a single additional hyperparameter to the solving algorithms.

In further theoretical work, a classification of objective functions f 𝑓 f italic_f with constant-size state adaptions x t subscript 𝑥 𝑡 x_{t} italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT would be desirable. From an applications perspective, there are a lot of interesting objectives with non-cumulative f 𝑓 f italic_f that could not be maximized so far. For example, the geometric mean could be used to maximize average growth rates, or the function f ⁢ ( r ~ 0 , … , r ~ T − 1 ) = δ T ⁢ ∑ t = 0 T − 1 r ~ t , δ ∈ ( 0 , 1 ) formulae-sequence 𝑓 subscript ~ 𝑟 0 … subscript ~ 𝑟 𝑇 1 superscript 𝛿 𝑇 superscript subscript 𝑡 0 𝑇 1 subscript ~ 𝑟 𝑡 𝛿 0 1 f(\tilde{r}_{0},\dots,\tilde{r}_{T-1})=\delta^{T}\sum_{t=0}^{T-1}\tilde{r}_{t}% ,\delta\in(0,1) italic_f ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) = italic_δ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_δ ∈ ( 0 , 1 ) could be used to define an exponential trade-off between trajectory length and cumulative reward in settings where long trajectories are undesirable. We believe that a multitude of other applications with non-cumulative objectives are still unknown to the reinforcement learning (RL) community, and conversely, that researchers working on non-cumulative problems are not aware of RL, simply because these two concepts could not straightforwardly be unified so far. This manuscript offers the exciting possibility of discovering and addressing this class of problems still unexplored by RL.

Code and Data Availability

We open-source all code used to produce the results of this manuscript in the following GitHub repositories: Classical control, portfolio optimization, and peak environment: Nägele ( 2024a ) . ZX-diagrams: Nägele ( 2024b ) . Quantum error correction code discovery: Olle ( 2024 ) . Logical state preparation: Zen ( 2024 ) . In addition, we provide all generated data and agent weights at https://osf.io/ajwmk .

Acknowledgments

This research is part of the Munich Quantum Valley (K-4 and K-8), which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus.

Appendix A Details on experiments

Lunar lander.

For training, we use the PPO implementation of stables-baselines3 [ Raffin et al. ( 2021 ) ]. The hyperparameters and network architecture of both algorithms were chosen as by Raffin ( 2020 ) (which are optimized to give good performance without the velocity penalty), only increasing the batch size and the total training steps to ensure convergence. Training a single agent takes around one hour on a Quadro RTX 6000 GPU with the environment running in parallel on 32 CPUs.

Portfolio optimization

For training, we use the PPO implementation of stables-baselines3 [ Raffin et al. ( 2021 ) ]. The hyperparameters and network architecture of both algorithms were chosen as by Sood et al. ( 2023 ) where they were optimized to give a good performance of the DIFFSHARPE algorithm. Training a single agent takes around one hour on a Quadro RTX 6000 GPU with the environment running in parallel on 10 CPUs.

For training, we use a custom REINFORCE implementation with a tabular policy to keep the number of hyperparameters minimal, updating the agent’s policy after each completed trajectory. We use a learning rate of 2 − 10 superscript 2 10 2^{-10} 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT but also performed experiments scanning the learning rate which leaves qualitative results similar. Training a single agent takes around 20 minutes on a Quadro RTX 6000 GPU with the environment running on a single CPU.

For training, we use the PPO implementation of Nägele & Marquardt ( 2023 ) to facilitate the changing observation and action space. The hyperparameters and network architecture were chosen as by Nägele & Marquardt ( 2023 ) who originally chose them to give good performance of the PPO algorithm. This is the most compute-intensive experiment reported in this work with one training run lasting for 12 hours using two Quadro RTX 6000 GPUs and 32 CPUs.

We estimate the total compute time to reproduce the results of this manuscript to be around 1000 GPU hours and 25000 CPU hours.

  • Andrychowicz et al. (2020a) Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. The International Journal of Robotics Research , 39(1):3–20, 2020a. doi: 10.1177/0278364919887447 . URL https://journals.sagepub.com/doi/10.1177/0278364919887447 .
  • Andrychowicz et al. (2020b) Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, and Marcin Michalski. What matters for on-policy deep actor-critic methods? A large-scale study. In International conference on learning representations , 2020b. doi: 10.48550/arXiv.2006.05990 . URL https://openreview.net/forum?id=nIAxjsniDzg .
  • Bacchus et al. (1996) Fahiem Bacchus, Craig Boutilier, and Adam Grove. Rewarding behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence , pp.  1160–1167, 1996. ISBN 978-0-262-51091-2. URL https://aaai.org/papers/172-AAAI96-172-rewarding-behaviors/ .
  • Bacchus et al. (1997) Fahiem Bacchus, Craig Boutilier, and Adam Grove. Structured solution methods for non-markovian decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence , pp.  112–117, 1997. ISBN 0262510952. URL https://dl.acm.org/doi/abs/10.5555/1867406.1867424 .
  • Coecke & Kissinger (2017) Bob Coecke and Aleks Kissinger. Picturing Quantum Processes: A First Course in Quantum Theory and Diagrammatic Reasoning . Cambridge University Press, 2017. ISBN 9781107104228.
  • Cui & Yu (2023) Wei Cui and Wei Yu. Reinforcement learning with non-cumulative objective. IEEE Transactions on Machine Learning in Communications and Networking , 1:124–137, 2023. doi: 10.1109/TMLCN.2023.3285543 . URL https://ieeexplore.ieee.org/document/10151914 .
  • Duncan et al. (2020) Ross Duncan, Aleks Kissinger, Simon Perdrix, and John Van De Wetering. Graph-theoretic simplification of quantum circuits with the ZX-calculus. Quantum , 4:279, 2020. doi: 10.22331/q-2020-06-04-279 . URL http://quantum-journal.org/papers/q-2020-06-04-279/ .
  • Eyckerman et al. (2022) Reinout Eyckerman, Phil Reiter, Steven Latré, Johann Marquez-Barja, and Peter Hellinckx. Application placement in fog environments using multi-objective reinforcement learning with maximum reward formulation. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium , pp.  1–6, 2022. doi: 10.1109/NOMS54207.2022.9789757 . URL https://ieeexplore.ieee.org/document/9789757 .
  • Fisac et al. (2015) Jaime F Fisac, Mo Chen, Claire J Tomlin, and S Shankar Sastry. Reach-avoid problems with time-varying dynamics, targets and constraints. In Proceedings of the 18th international conference on hybrid systems: computation and control , pp.  11–20, 2015. doi: 10.1145/2728606.2728612 . URL https://dl.acm.org/doi/10.1145/2728606.2728612 .
  • Fisac et al. (2019) Jaime F. Fisac, Neil F. Lugovoy, Vicenç Rubies-Royo, Shromona Ghosh, and Claire J. Tomlin. Bridging Hamilton-Jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA) , pp.  8550–8556, 2019. doi: 10.1109/ICRA.2019.8794107 . URL https://ieeexplore.ieee.org/document/8794107 .
  • Fösel et al. (2021) Thomas Fösel, Murphy Yuezhen Niu, Florian Marquardt, and Li Li. Quantum circuit optimization with deep reinforcement learning. arXiv , 2021. doi: 10.48550/arXiv.2103.07585 . URL https://arxiv.org/abs/2103.07585 .
  • Gottipati et al. (2020) Sai Krishna Gottipati, Yashaswi Pathak, Rohan Nuttall, Raviteja Chunduru, Ahmed Touati, Sriram Ganapathi Subramanian, Matthew E Taylor, and Sarath Chandar. Maximum reward formulation in reinforcement learning. arXiv , 2020. doi: 10.48550/arXiv.2010.03744 . URL https://arxiv.org/abs/2010.03744 .
  • Hsu et al. (2021) Kai-Chieh Hsu, Vicenç Rubies-Royo, Claire J Tomlin, and Jaime F Fisac. Safety and liveness guarantees through reach-avoid reinforcement learning. In Proceedings of Robotics: Science and Systems , July 2021. doi: 10.15607/RSS.2021.XVII.077 . URL https://www.roboticsproceedings.org/rss17/p077.pdf .
  • Kaledin et al. (2022) Maxim Kaledin, Alexander Golubev, and Denis Belomestny. Variance reduction for policy-gradient methods via empirical variance minimization. arXiv , 2022. doi: 10.48550/arXiv.2206.06827 . URL https://arxiv.org/abs/2206.06827 .
  • Li et al. (2020) Junyi Li, Xintong Wang, Yaoyang Lin, Arunesh Sinha, and Michael Wellman. Generating realistic stock market order streams. Proceedings of the AAAI Conference on Artificial Intelligence , 34:727–734, 2020. doi: 10.1609/aaai.v34i01.5415 . URL https://ojs.aaai.org/index.php/AAAI/article/view/5415 .
  • Lu et al. (2022) Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems , 35:16455–16468, 2022. doi: 10.48550/arXiv.2210.05639 . URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/688c7a82e31653e7c256c6c29fd3b438-Abstract-Conference.html .
  • Mankowitz et al. (2023) Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane, Thomas Hubert, Julian Schrittwieser, Demis Hassabis, Pushmeet Kohli, Martin Riedmiller, Oriol Vinyals, and David Silver. Faster sorting algorithms discovered using deep reinforcement learning. Nature , 618(7964):257–263, 2023. doi: 10.1038/s41586-023-06004-9 . URL https://doi.org/10.1038/s41586-023-06004-9 .
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature , 518(7540):529–533, 2015. doi: 10.1038/nature14236 . URL https://doi.org/10.1038/nature14236 .
  • Moflic & Paler (2023) Ioana Moflic and Alexandru Paler. Cost explosion for efficient reinforcement learning optimisation of quantum circuits. In 2023 IEEE International Conference on Rebooting Computing (ICRC) , pp.  1–5. IEEE, 2023. doi: 10.1109/ICRC60800.2023.10386864 . URL https://www.computer.org/csdl/proceedings-article/icrc/2023/10386864/1TJmieJCklW .
  • Moody & Saffell (2001) J. Moody and M. Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks , 12(4):875–889, 2001. doi: 10.1109/72.935097 . URL https://ieeexplore.ieee.org/document/935097 .
  • Moody et al. (1998) John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting , 17(5-6):441–470, 1998. doi: https://doi.org/10.1002/(SICI)1099-131X(1998090)17:5/6<441::AID-FOR707>3.0.CO;2-\# . URL https://onlinelibrary.wiley.com/doi/10.1002/%28SICI%291099-131X%281998090%2917%3A5/6%3C441%3A%3AAID-FOR707%3E3.0.CO%3B2-%23 .
  • Nägele (2024a) Maximilian Nägele. Code for classical control, portfolio optimization, and peak environment. https://github.com/MaxNaeg/ncmdp , 2024a.
  • Nägele (2024b) Maximilian Nägele. Code for ZX-diagrams. https://github.com/MaxNaeg/ZXreinforce , 2024b.
  • Nägele & Marquardt (2023) Maximilian Nägele and Florian Marquardt. Optimizing ZX-diagrams with deep reinforcement learning. arXiv , 2023. doi: 10.48550/arXiv.2311.18588 . URL https://arxiv.org/abs/2311.18588 .
  • Olle (2024) Jan Olle. Code for quantum error correction code discovery. https://github.com/jolle-ag/qdx , 2024.
  • Olle et al. (2023) Jan Olle, Remmy Zen, Matteo Puviani, and Florian Marquardt. Simultaneous discovery of quantum error correction codes and encoders with a noise-aware reinforcement learning agent. arXiv , 2023. doi: 10.48550/arXiv.2311.04750 . URL https://arxiv.org/abs/2311.04750 .
  • Quah & Quek (2006) K.H. Quah and Chai Quek. Maximum reward reinforcement learning: A non-cumulative reward criterion. Expert Systems with Applications , 31(2):351–359, 2006. doi: 10.1016/j.eswa.2005.09.054 . URL https://www.sciencedirect.com/science/article/pii/S0957417405002228 .
  • Raffin (2020) Antonin Raffin. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo , 2020.
  • Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research , 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html .
  • Riu et al. (2023) Jordi Riu, Jan Nogué, Gerard Vilaplana, Artur Garcia-Saez, and Marta P Estarellas. Reinforcement learning based quantum circuit optimization via ZX-calculus. arXiv , 2023. doi: 10.48550/arXiv.2312.11597 . URL https://arxiv.org/abs/2312.11597 .
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv , 2017. doi: 10.48550/arXiv.1707.06347 . URL https://arxiv.org/abs/1707.06347 .
  • Sharpe (1966) William F. Sharpe. Mutual fund performance. The Journal of Business , 39(1):119–138, 1966. ISSN 00219398, 15375374. URL http://www.jstor.org/stable/2351741 .
  • Sood et al. (2023) Srijan Sood, Kassiani Papasotiriou, Marius Vaiciulis, and Tucker Balch. Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization. International Conference on Automated Planning and Scheduling , pp.  21, 2023. URL https://icaps23.icaps-conference.org/papers/finplan/FinPlan23_paper_4.pdf .
  • Sutton & Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction, Second edition . The MIT Press, 2018. ISBN 9780262039246. URL http://incompleteideas.net/book/RLbook2020.pdf .
  • Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems , 12, 1999. URL https://papers.nips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html .
  • Terhal (2015) Barbara M. Terhal. Quantum error correction for quantum memories. Rev. Mod. Phys. , 87:307–346, 2015. doi: 10.1103/RevModPhys.87.307 . URL https://link.aps.org/doi/10.1103/RevModPhys.87.307 .
  • Thiébaux et al. (2006) Sylvie Thiébaux, Charles Gretton, John Slaney, David Price, and Froduald Kabanza. Decision-theoretic planning with non-markovian rewards. Journal of Artificial Intelligence Research , 25:17–74, 2006. URL https://dl.acm.org/doi/abs/10.5555/1622543.1622545 .
  • Towers et al. (2023) Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium, 2023. URL https://zenodo.org/record/8127025 .
  • Veviurko et al. (2024) Grigorii Veviurko, Wendelin Böhmer, and Mathijs de Weerdt. To the max: Reinventing reward in reinforcement learning. arXiv , 2024. doi: 10.48550/arXiv.2402.01361 . URL https://arxiv.org/abs/2402.01361 .
  • Wang et al. (2020) Ruosong Wang, Peilin Zhong, Simon S Du, Russ R Salakhutdinov, and Lin Yang. Planning with general objective functions: Going beyond total rewards. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems , volume 33, pp.  14486–14497, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/a6a767bbb2e3513233f942e0ff24272c-Paper.pdf .
  • Watkins & Dayan (1992) Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning , 8:279–292, 1992. doi: 10.1007/BF00992698 . URL https://link.springer.com/article/10.1007/BF00992698 .
  • Williams (1992) Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning , 8(3):229–256, 1992. doi: 10.1007/BF00992696 . URL https://doi.org/10.1007/BF00992696 .
  • Yu et al. (2022) Dongjie Yu, Haitong Ma, Shengbo Li, and Jianyu Chen. Reachability constrained reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning , volume 162, pp.  25636–25655, 2022. doi: 10.48550/arXiv.2205.07536 . URL https://proceedings.mlr.press/v162/yu22d.html .
  • Zen (2024) Remmy Zen. Code for logical state preparation. https://github.com/remmyzen/rlftqc , 2024.
  • Zen et al. (2024) Remmy Zen, Jan Olle, Luis Colmenarez, Matteo Puviani, Markus Müller, and Florian Marquardt. Quantum circuit discovery for fault-tolerant logical state preparation with reinforcement learning. arXiv , 2024. doi: 10.48550/arXiv.2402.17761 . URL https://arxiv.org/abs/2402.17761 .
  • Zhou et al. (2019) Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N. Zare, and Patrick Riley. Optimization of molecules via deep reinforcement learning. Scientific Reports , 9:10752, 2019. doi: 10.1038/s41598-019-47148-x . URL https://www.nature.com/articles/s41598-019-47148-x .

You are using an outdated browser. Please upgrade your browser to improve your experience.

You have accessed  https://www.emdgroup.com , but for users from your part of the world, we originally designed the following web presence  https://www.merckgroup.com .

Internet Explorer is not supported

You are using an outdated browser. If you would like to use our website, please use Chrome, Firefox, Safari or other browser

Share Disclaimer

By sharing this content, you are consenting to share your data to this social media provider. More information are available in our  Privacy Statement . 

Cookie Disclaimer

We use cookies so that we can offer you the best possible website experience. This includes cookies which are necessary for the operation of the app and the website, as well as other cookies which are used solely for anonymous statistical purposes, for more comfortable website settings, or for the display of personalized content. You are free to decide in the Cookie Settings which categories you would like to permit. Please note that depending on what you select, the full functionality of the website may no longer be available. You may review and change your choices at any time. Further information can be found in our Privacy Statement .

These cookies are necessary for the website to operate. Our website cannot function without these cookies and they can only be disabled by changing your browser preferences.

These cookies enable the provision of advanced functionalities and are used for personalization. The cookies are set in particular in response to your actions and depend on your specific service requests (e.g., pop-up notification choices).

These cookies may be set to learn more about your interests and show you relevant ads on other websites. These cookies work by uniquely identifying your browser and device. By integrating these cookies, we aim to learn more about your interests and your surfing behavior and to be able to place our advertising in a targeted manner.

Disclaimer Image

Publication of Merck KGaA, Darmstadt, Germany. In the United States and Canada the subsidiaries of Merck KGaA, Darmstadt, Germany operate under the umbrella brand EMD.

Select your Market

Show All Results

Utility Engineer

Posted 21 May 2024

Indianapolis, Indiana - United States

Req Id 274804

Work Your Magic with us!  Start your next chapter and join MilliporeSigma.

Ready to explore, break barriers, and discover more? We know you’ve got big plans – so do we! Our colleagues across the globe love innovating with science and technology to enrich people’s lives with our solutions in Healthcare, Life Science, and Electronics. Together, we dream big and are passionate about caring for our rich mix of people, customers, patients, and planet. That's why we are always looking for curious minds that see themselves imagining the unimaginable with us.

This role does not offer sponsorship for work authorization. External applicants must be eligible to work in the US. 

As a Utility Engineer, you will be responsible for providing crucial engineering support and project coordination across all stages of pharmaceutical plant operations, focusing primarily on utility systems. As the system owner for utility equipment, the Utility Engineer is tasked with ensuring the reliability, efficiency, and compliance of utility systems within the pharmaceutical facility. The Utility Engineer will lead project-defined teams comprising personnel from various departments and external contractors. They will be responsible for managing projects and supporting facility projects of varying scales.

  • Monitor the maintenance and operation of HVAC equipment designed to provide temperature, humidity, filtration and pressurization control to an aseptic environment.
  • Monitor building automation system operations.
  • Identify, coordinate, and implement improvements in HVAC equipment and system operations.
  • Provide technical support for industrial and clean utility systems, including but not limited to HVAC, water purification, compressed air, gases, and steam systems, ensuring their optimal performance and compliance with regulatory standards.
  • Collaborate with internal stakeholders to identify opportunities for continuous improvement in utility systems, implementing technical solutions to enhance efficiency and reliability.
  • Lead technology transfer projects related to utility systems for new product introductions and contract manufacturing initiatives, serving as the main technical interface with external/internal clients and contractors.
  • Manage small to medium scale and/or complex technical projects related to utility systems, including project planning, budgeting, documentation, and coordination of project teams.
  • Develop and maintain engineering documentation, including Piping and Instrumentation Diagrams (P&IDs), equipment specifications, and Standard Operating Procedures (SOPs) for utility systems.
  • Perform process and utility calculations for hydraulic design, heat transfer, and material balances, ensuring the integrity and functionality of utility systems.
  • Coordinate with in-house automation and vendors on BAS and utility control strategies and controllers.
  • Support validation activities for utility systems, including Installation Qualification (IQ), Operational Qualification (OQ), and Performance Qualification (PQ), in accordance with regulatory requirements.
  • Lead investigations into utility-related issues and provide technical support for problem-solving and root cause analysis.  Responsible for all CAPA and observation / deviation investigations and originations  
  • Function as SME for HVAC and Utilities during audits.

Physical Attributes:

  • Ability to navigate process environments, including climbing stairs and ladders as necessary
  • Ability to gown qualify for grade C and D areas

Who You Are:

Minimum Qualifications:

  • Bachelor’s Degree in an Engineering field (Chemical, Mechanical, Civil, Biochemical, etc.)
  • 1+ years’ experience in pharmaceutical operations in a role such as Maintenance Engineer, Plant Engineer, Process Engineer, or Project Engineer

Preferred Qualifications:

  • Master's Degree in an  Engineering field (Chemical, Mechanical, Civil, Biochemical, etc.)
  • Familiarity with GMP requirements, FDA guidelines, and validation expectations
  • Proficiency in PC applications, project management software (e.g., Microsoft Project), Root Cause Analysis methodologies, and drawing software (e.g., AutoCAD)
  • Self-motivated, highly organized, and adept at both independent work and collaboration within a team
  • Strong communication and interpersonal skills are essential for liaising with internal and external stakeholders
  • Willing to work flexible hours occasionally to support production schedules

What we offer: We are curious minds that come from a broad range of backgrounds, perspectives, and life experiences. We celebrate all dimensions of diversity and believe that it drives excellence and innovation, strengthening our ability to lead in science and technology. We are committed to creating access and opportunities for all to develop and grow at your own pace. Join us in building a culture of inclusion and belonging that impacts millions and empowers everyone to work their magic and champion human progress!   Apply now and become a part of our diverse team!

If you would like to know more about what diversity, equity, and inclusion means to us, please visit https://www.emdgroup.com/en/company/press-positions.html

Notice on Fraudulent Job Offers

Unfortunately, we are aware of third parties that pretend to represent our company offering unauthorized employment opportunities. If you think a fraudulent source is offering you a job, please have a look at the following information: https://www.merckgroup.com/en/careers/faqs.html

Join our Talent Zone

IMAGES

  1. NCTM Process Standards for Mathematics

    nctm problem solving process standard

  2. Process Standards from NCTM

    nctm problem solving process standard

  3. PPT

    nctm problem solving process standard

  4. What Is Problem-Solving? Steps, Processes, Exercises to do it Right

    nctm problem solving process standard

  5. PPT

    nctm problem solving process standard

  6. NCTM Process Standards: Problem Solving by Sara Falkin

    nctm problem solving process standard

VIDEO

  1. How to Kill Two Birds with One Stone: Learning to Rank with Multiple Objectives by Alexey Kurennoy

  2. Giveaway Custom Khelega kya 1 vs 2

  3. Composing In The Style of God of War (Soundiron Session)

  4. Day of dragons, Chasing stars, music video

  5. SunSwap Lists Neptune Token (NT): Breaking News for Crypto Investors

  6. Creating a Custom Handmade Pixie Cut Short Wig: Installation, Styling, and Cutting

COMMENTS

  1. Process

    Instructional programs from prekindergarten through grade 12 should enable each and every student to—. Recognize reasoning and proof as fundamental aspects of mathematics. Make and investigate mathematical conjectures. Develop and evaluate mathematical arguments and proofs. Select and use various types of reasoning and methods of proof.

  2. Standards for Mathematical Practice

    The Standards for Mathematical Practice describe varieties of expertise that mathematics educators at all levels should seek to develop in their students. These practices rest on important "processes and proficiencies" with longstanding importance in mathematics education. The first of these are the NCTM process standards of problem solving ...

  3. PDF Problem Solving Reasoning and Proof Communication Connections

    NCTM has outlined five process standards that should permeate everyday lessons and determine the activities chosen for each lesson. Problem Solving . The ability to build mathematical knowledge, adapt strategies, reflect on strategies, and apply them to real world contexts.

  4. Principles and Standards for School Mathematics

    The NCTM employed a consensus process that involved classroom ... Measurement, and Data Analysis and Probability) and processes (Problem Solving, Reasoning and Proof, Communication, Connections, and Representation). ... These are the fundamental basis of all mathematics, and teaching this critical area is the first content standard. All ...

  5. Illustrative Mathematics

    The Standards for Mathematical Practice describe varieties of expertise that mathematics educators at all levels should seek to develop in their students. These practices rest on important "processes and proficiencies" with longstanding importance in mathematics education. The first of these are the NCTM process standards of problem solving ...

  6. Evaluating Problem-solving Processes

    When teachers treated small-group problem-solving activities as a regular portion of the mathematics class, they were more likely to assess the results (Meier 1989). Even so, the focus of the instruction was process and the focus of the assessment was often results. During the years since NCTM first published An Agenda for Action (1980), our ...

  7. NCTM Process Standards and Exemplars

    Like the NCTM standards, Exemplars material places a greater emphasis on the "process standards." By focusing on mathematical problem solving and communication, Title I teachers are able to look more closely at what students' work shows about their mathematical understanding. A stronger emphasis on process encourages teachers to demonstrate and ...

  8. PDF National Council of Teachers of Mathematics (NCTM)

    Standard 1. Knowledge of Problem Solving. Candidates know, understand and apply the . process of mathematical problem solving. 1.1 Apply and adapt a variety of appropriate strategies to solve problems. 1.2 Solve problems that arise in mathematics and those involving mathematics in other contexts . 1.3 Build new mathematical knowledge through ...

  9. PDF NCTM Process Standards

    NCTM Process Standards www.nctm.org The Process Standards highlight ways of acquiring and using content knowledge. Without facility with these critical processes, a student's mathematical knowledge is likely to be fragile and limited in its usefulness. All processes should be included in instruction to enable students to: Problem Solving

  10. Open-ended Questions and the Process Standards

    NCTM's Process Standards—Problem Solving, Reasoning and Proof, Communication, Connec-tions, and Representation—are difficult to assess with multiple-choice tests. For example, one aspect of the Communication Standard requires students to "communicate their mathematical thinking coherently and clearly to peers, teachers, and oth-

  11. Writing About the Problem Solving Process to Improve Problem ...

    Problem solving is generally recognized as one of the most important components of mathematics. In Principles and Standards for School Mathematics, the National Council of Teachers of Mathematics emphasized that instructional programs should enable all students in all grades to "build new mathematical knowledge through problem solving, solve problems that arise in mathematics and in other ...

  12. PDF Common Core State StandardS

    importance in mathematics education. The first of these are the NCTM process standards of problem solving, reasoning and proof, communication, representation, and connections. The second are the strands of mathematical proficiency specified in the National Research Council's report Adding It Up: adaptive reasoning, strategic

  13. PDF Process Standards from NCTM

    lesson. Rate your lesson by stars, your lesson gets one star for every standard that it incorporates. Process Standards from NCTM Evidence from the lesson 1 - Problem Solving: Instructional programs should enable all students to… Build new mathematical knowledge through open-ended questions and more-extended exploration;

  14. Five Process Standards NCTM

    The document outlines five process standards for mathematics education according to the National Council of Teachers of Mathematics (NCTM): problem solving, reasoning and proof, communication, connections, and representation. Each standard is defined by a set of three to four bulleted goals that instructional programs from pre-kindergarten through grade 12 should enable students to accomplish ...

  15. NCTM Process Standards vs CCSS Mathematical Practices

    The NCTM process standards, Adding it Up mathematical proficiency strands, and Common Core State Standards for mathematical practices are all saying the same thing but why do I get the feeling that the Mathematical Practices Standards is out to get the math teachers. The NCTM's process standards of problem solving, reasoning and proof, communication, representation, and connections describe ...

  16. PDF NCTM: Principles & Standards for School Mathematics ( PSSM ) Est: 2000

    NCTM: Standards for Mathematics: Pre-K to 12 Established: 2000 Page 2 Content (Process) Standards: Instructional programs for Pre-K to 12 should enable students to: Problem Solving Standard: • Build new mathematical knowledge through problem solving; solve problems that arise in mathematics and in other contexts;

  17. The NCTM Process Standards and the Five Es of Science: Connecting Math

    This study investigates defining characteristics among the process standards of the Principles and Standards for School Mathematics and the 5 Es from the National Science Education Standards and the Inquiry and the National Science Education Standards. These characteristics are used to demonstrate similarities and differences between the learning of mathematics and science, discuss ...

  18. Writing in Learning Mathematics and Problem Posing Approach

    NCTM in Standard Principles for School Mathematics (NCTM, 2000: 61) states . ... process, it is hoped that ... problem solving stage.

  19. Problem solving

    Problem solving is the process of achieving a goal by overcoming obstacles, a frequent part of most activities. Problems in need of solutions range from simple personal tasks (e.g. how to turn on an appliance) to complex issues in business and technical fields.

  20. 1.3: Mathematics in Preschool

    Allow students to solve a problem without stepping in too quickly. Children begin to link words and concepts; therefore, teachers can begin to use story problems for teaching mathematics. In kindergarten, the words should be simple and short and by first grade, students begin to write their strategy in the problem-solving process.

  21. PECT.docx

    A first-grade teacher who wants to promote students' ability to communicate mathematical ideas should emphasize which of the following teaching practices? A. having students explain their problem-solving strategies orally as part of daily mathematics activities B. helping students create a personal dictionary of mathematical vocabulary C. engaging students in the process of writing their own ...

  22. PDF 5 NCTM 8 CCSS 5 Strands of 5 Practices for Guiding Principles for

    Make sense of problems and persevere in solving them. Reason abstractly and quantitatively. Construct viable arguments and critique the reasoning of others. Model with mathematics. Use appropriate tools strategically. Attend to precision. Look for and make use of structure. Look for and express regularity in repeated reasoning.

  23. Manufacturing Supervisor- Night Shift

    The Manufacturing Supervisor - Night Shift on the manufacturing team works to ensure safety and product quality, maintain compliance with standard operating procedures, and coach/develop direct reports. Job duties include: This is a rotating 12 - hour night shift with hours of 7:00pm - 7:00am, and every other weekendComplete on-the-job training to become proficient at the manufacturing ...

  24. Tackling Decision Processes with Non-Cumulative Objectives using

    A limitation of the framework of MDPs is the restriction to ideal policies that maximize Equation 1, while a large class of problems cannot straightforwardly be formulated this way.For example, in weakest-link problems, the goal is to maximize the minimum rather than the sum of rewards, e.g. in network routing one wants to maximize the minimum bandwidths along a path [Cui & Yu ()].

  25. Connecting the NCTM Process Standards and the CCSSM Practices

    The ideas in the Standards for Mathematical Practice are not new but linked to previous practices and standards articulated by other groups, including the National Council of Teachers of Mathematics (NCTM). For example, problem solving and reasoning are at the core of all practices outlined in CCSSM, just as they have been at the core of NCTM's ...

  26. Utility Engineer

    As a Utility Engineer, you will be responsible for providing crucial engineering support and project coordination across all stages of pharmaceutical plant operations, focusing primarily on utility systems. As the system owner for utility equipment, the Utility Engineer is tasked with ensuring the reliability, efficiency, and compliance of utility systems within the pharmaceutical facility ...